How To Scan and Repair Disk Errors on Linux

Maintaining a healthy filesystem is crucial for any Linux system’s stability and performance. Over time, your Linux disk drives can develop errors due to unexpected shutdowns, power failures, hardware issues, or general wear and tear. Left unaddressed, these errors can lead to data corruption, system instability, or even complete system failure. Fortunately, Linux provides powerful tools to scan, detect, and repair disk errors before they become critical issues.

This comprehensive guide will walk you through the process of scanning and repairing disk errors on Linux systems. Whether you’re a system administrator managing servers or a home user maintaining a personal Linux machine, these techniques will help you keep your storage devices in optimal condition and prevent data loss.

Table of Contents

Understanding Disk Errors in Linux

Disk errors in Linux can manifest in various ways and understanding their nature is the first step toward effective troubleshooting. These errors typically fall into two categories: logical errors (filesystem corruption) and physical errors (hardware issues).

Linux filesystems organize data using complex structures, including inodes, superblocks, and data blocks. When these structures become damaged or corrupted, your system might exhibit symptoms such as:

Unexpected system freezes or crashes
Files that suddenly become unreadable or corrupted
Strange error messages during boot or operation
Slow disk performance or excessive disk activity
System failing to boot completely
Input/output errors when accessing certain files

Several factors can contribute to disk errors in Linux environments:

Improper system shutdowns (power outages, hard resets)
Physical damage to storage devices
Aging hardware (all storage media have a finite lifespan)
Software bugs or filesystem driver issues
Magnetic interference (for traditional HDDs)
Bad sectors developing on the disk surface

Regular filesystem checks are essential preventive maintenance tasks for any Linux system. Most modern Linux distributions use journaling filesystems like ext4, XFS, or Btrfs, which are more resilient to corruption than older filesystems. However, even these advanced filesystems can develop issues that require manual intervention.

Preparing for Disk Checks

Before diving into disk repair operations, proper preparation is essential to avoid further damage and ensure effective troubleshooting.

Identifying Your Disks and Partitions

The first step is to identify which disk or partition requires checking. Linux provides several commands to help you gather this information:

lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL

This command displays a hierarchical view of all block devices with their filesystem types, sizes, mount points, and labels. The output will look something like:

NAME        FSTYPE   SIZE MOUNTPOINT    LABEL
sda                 500G               
├─sda1      ext4     50G /             root
├─sda2      swap      8G [SWAP]        swap
└─sda3      ext4    442G /home         home
sdb                   1T               
└─sdb1      xfs       1T /data         data

Alternatively, you can use the df command to see disk usage for mounted filesystems:

df -h

For detailed partition information on a specific disk, use parted:

sudo parted /dev/sda print

To see the filesystem UUID and other detailed information, you can use:

sudo blkid

Take note of the device names (like /dev/sda1) of the partitions you need to check and repair.

Unmounting the Filesystem

Critical warning: Most filesystem check and repair operations require the target filesystem to be unmounted. Performing checks on mounted filesystems can lead to data corruption or loss.

To unmount a filesystem, use the umount command:

sudo umount /dev/sdb1

If you’re unmounting by mount point instead of device name:

sudo umount /data

You can verify if the unmount was successful by checking mount or lsblk output again.

If the system indicates that the filesystem is busy, you may need to identify and close applications using the filesystem:

sudo lsof /data

For filesystems that are in constant use, like the root filesystem, special procedures are required, which we’ll cover later in this guide.

The Primary Tool: fsck (File System Consistency Check)

The fsck (File System Consistency Check) utility is the primary tool for checking and repairing filesystem errors in Linux. It serves as a front-end for filesystem-specific checkers, automatically detecting the filesystem type and calling the appropriate checker.

Understanding fsck

The fsck utility performs several critical functions:

Checks filesystem integrity and consistency
Detects errors in the filesystem structure
Repairs corrupted inodes, superblocks, and data blocks
Fixes directory structure issues
Recovers orphaned files (files without proper directory entries)
Corrects file and directory counts

Think of fsck as the Linux equivalent of Windows’ chkdsk utility, but with more flexibility and advanced options.

Basic fsck Usage

The simplest form of the fsck command is:

sudo fsck /dev/sdb1

This checks the specified partition and reports any errors found. If errors are detected, fsck will prompt you for confirmation before making repairs.

For a more informative output, add the verbose flag:

sudo fsck -v /dev/sdb1

To specify the filesystem type explicitly (useful if automatic detection fails):

sudo fsck -t ext4 /dev/sdb1

Understanding fsck Error Codes

After running, fsck returns an exit code that indicates the outcome of the check. Understanding these codes helps interpret the results:

Code	Meaning
0	No errors were found
1	Filesystem errors were corrected
2	System should be rebooted
4	Filesystem errors were left uncorrected
8	Operational error occurred
16	Usage or syntax error
32	Checking was canceled by user request
128	Shared-library error

You can check the return code after running fsck with:

echo $?

A return code of 0 or 1 generally indicates success, while higher values may require additional attention.

Advanced fsck Options

For automatic repair without prompts (useful for scripts):

sudo fsck -y /dev/sdb1

The -y flag automatically answers “yes” to all repair prompts. Use this option with caution, as it will make changes without asking for confirmation.

For interactive repair with more control:

sudo fsck -r /dev/sdb1

This prompts you for confirmation before making each repair, giving you control over the process.

To check all filesystems listed in /etc/fstab (except those with the noauto option):

sudo fsck -A

To skip the root filesystem when checking all filesystems:

sudo fsck -AR

To perform a test run without making any changes (dry run):

sudo fsck -N /dev/sdb1

For a thorough check that forces checking even if the filesystem appears clean:

sudo fsck -f /dev/sdb1

Filesystem-Specific Tools

While fsck provides a universal interface for checking filesystems, Linux also offers specialized tools for specific filesystem types. These tools often provide more options and better control for their respective filesystems.

Checking ext2/ext3/ext4 Filesystems

The e2fsck tool is designed specifically for the ext family of filesystems (ext2, ext3, and ext4), which are among the most common in Linux systems.

For a basic check with verbose output:

sudo e2fsck -v /dev/sdb1

To force a complete check even if the filesystem appears clean:

sudo e2fsck -f /dev/sdb1

For automatic repair without prompts:

sudo e2fsck -p /dev/sdb1

The -p flag attempts to automatically fix any problems without user intervention but will abort if it encounters serious issues.

For a more aggressive approach that automatically answers “yes” to all questions:

sudo e2fsck -y /dev/sdb1

To display the progress of the check in real-time (useful for large filesystems):

sudo e2fsck -C0 /dev/sdb1

The -C0 flag shows a progress bar during the check, making it easier to monitor on large partitions.

Checking XFS Filesystems

XFS filesystems, often used in enterprise environments and for large storage arrays, require different tools for maintenance. The primary utility for checking and repairing XFS filesystems is xfs_repair.

To check an XFS filesystem without performing any repairs:

sudo xfs_repair -n /dev/sdb1

The -n flag performs a check without modifying the filesystem, similar to a dry run.

To perform repairs on an XFS filesystem:

sudo xfs_repair /dev/sdb1

For verbose output with detailed information about the repair process:

sudo xfs_repair -v /dev/sdb1

For even more detailed output:

sudo xfs_repair -v -v /dev/sdb1

Each added -v increases the verbosity level.

Important: Always unmount XFS filesystems before checking them. Unlike some other filesystem types, XFS absolutely requires unmounting before repair operations.

After completing the check and repair process, you can remount the filesystem:

sudo mount -a

This command mounts all filesystems listed in /etc/fstab that aren’t already mounted.

Dealing with Bad Sectors

Bad sectors are physical areas of a storage device that have become damaged and can no longer reliably store data. These defects can cause data corruption and system instability if not properly managed.

Detecting Bad Sectors

The badblocks utility is specifically designed to scan storage devices for bad sectors:

sudo badblocks -v /dev/sdb

This command performs a read-only test and displays all bad blocks found. The -v flag provides verbose output during the scan.

For a more thorough test that performs a non-destructive read-write test:

sudo badblocks -nsv /dev/sdb

The -n flag performs a non-destructive read-write test, -s shows progress, and -v provides verbose output.

Warning: For the most thorough test, you can use a destructive write test, but this will erase all data on the device:

sudo badblocks -wsv /dev/sdb

The -w flag performs a destructive write test. Only use this on disks with no valuable data or after backing up all data.

To save the list of bad blocks to a file for further processing:

sudo badblocks -v /dev/sdb > bad-blocks.txt

Repairing Bad Sectors

While physical bad sectors cannot be truly “repaired,” Linux can mark them as unusable to prevent data corruption. The e2fsck command can automatically handle bad sectors when used with specific options:

sudo e2fsck -c -v /dev/sdb1

The -c flag tells e2fsck to run badblocks in read-only mode and mark any bad blocks as unusable.

For a more thorough check with a read-write test:

sudo e2fsck -cc -v /dev/sdb1

Using -cc runs a more thorough read-write test with badblocks.

If you’ve already run badblocks and saved the output to a file, you can use:

sudo e2fsck -l bad-blocks.txt /dev/sdb1

The -l flag instructs e2fsck to use the list of bad blocks identified in the file.

Important: A growing number of bad sectors often indicates impending drive failure. If your drive reports multiple bad sectors, especially if the number increases over time, consider backing up your data and replacing the drive soon. Regular S.M.A.R.T. monitoring (covered later) can help you track this trend.

Checking and Repairing the Root Filesystem

Checking the root filesystem presents a unique challenge because it cannot be unmounted while the system is running. Linux provides several methods to address this limitation.

Method 1: Using Force Check at Boot

The simplest approach is to schedule a filesystem check during the next system boot:

sudo touch /forcefsck

This creates an empty file named forcefsck in the root directory. During the next boot, Linux will detect this file and automatically run fsck on the root filesystem before mounting it.

Alternatively, on systems using systemd (most modern distributions):

sudo systemctl enable systemd-fsck-root.service

On some distributions, you can also set a kernel parameter for the next boot:

sudo grub-reboot "$(grep -m 1 '^menuentry ' /boot/grub/grub.cfg | cut -d "'" -f2) fsck.mode=force"

Method 2: Using Live Media

For more severe issues, booting from a Linux live USB or DVD provides full access to your system’s disks while they’re unmounted:

Create a bootable Linux live media (Ubuntu, Fedora, or specialized rescue distributions like SystemRescue)
Boot your computer from this media
Open a terminal in the live environment
Identify your root partition:
```
lsblk
```
Run fsck on the unmounted root partition:
```
sudo fsck -f -y /dev/sda1
```
(Replace /dev/sda1 with your actual root partition)
After completion, reboot into your regular system

This method provides the most thorough check since the filesystem is completely unmounted and not in use.

Method 3: Using Recovery Mode

Many Linux distributions include a recovery or maintenance mode that can be accessed from the boot menu:

Reboot your computer
Access the GRUB menu (usually by holding Shift during boot)
Select recovery mode or advanced options
Choose “fsck” or “root shell” from the recovery menu
If you choose root shell, the system will likely mount the root filesystem as read-only, allowing you to run:
```
fsck -f /dev/sda1
```
After completion, reboot with the command:
```
reboot
```

This method doesn’t require additional boot media but may not provide as complete access as a live environment.

Checking S.M.A.R.T. Disk Health

Beyond filesystem errors, monitoring the physical health of your storage devices is crucial. Modern storage devices include Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.), which provides valuable insights into drive health and can predict impending failures.

Installing smartmontools

First, install the required package:

# For Debian/Ubuntu-based distributions
sudo apt update
sudo apt install smartmontools

# For Fedora/RHEL-based distributions
sudo dnf install smartmontools

# For Arch Linux
sudo pacman -S smartmontools

# For openSUSE
sudo zypper install smartmontools

Basic S.M.A.R.T. Health Check

To check if a drive supports S.M.A.R.T. and verify its basic health status:

sudo smartctl -i -H /dev/sda

The -i flag displays drive information, and -H performs a health check. The output will include a line like:

SMART overall-health self-assessment test result: PASSED

Or, for failing drives:

SMART overall-health self-assessment test result: FAILED

A “FAILED” result indicates serious problems, and you should back up your data immediately and consider replacing the drive.

Comprehensive S.M.A.R.T. Data

For detailed information about your drive’s health:

sudo smartctl -a /dev/sda

This displays all S.M.A.R.T. attributes tracked by the drive, including:

Raw read error rate
Spin-up time
Start/stop count
Reallocated sector count
Seek error rate
Power-on hours
Temperature
Current pending sectors
Offline uncorrectable sectors

To run a short self-test on the drive:

sudo smartctl -t short /dev/sda

This initiates a brief diagnostic that checks the drive’s mechanical and electrical components. For a more thorough examination:

sudo smartctl -t long /dev/sda

A long test can take several hours but provides a comprehensive assessment of the drive’s condition.

After the test completes, view the results with:

sudo smartctl -l selftest /dev/sda

Interpreting S.M.A.R.T. Data

When analyzing S.M.A.R.T. data, pay particular attention to these critical attributes:

Reallocated Sectors Count: Indicates how many sectors have been remapped due to errors. Any non-zero value warrants monitoring, and a growing count suggests drive deterioration.
Current Pending Sectors: Sectors waiting to be remapped. A non-zero value here often indicates problems.
Uncorrectable Sectors: Sectors that couldn’t be read or written, even after error correction. Any uncorrectable sectors are cause for concern.
Command Timeout: Indicates instances where drive commands failed to complete in time. Frequent timeouts suggest mechanical issues.
Power-On Hours: Shows the drive’s total operating time. While not directly indicating problems, older drives are generally more prone to failure.

A steady increase in any error-related attribute typically indicates progressive drive deterioration. If you notice this pattern, consider backing up your data and planning for drive replacement.

Preventive Maintenance and Best Practices

Proactive maintenance helps prevent disk errors and data loss. Implementing these best practices will significantly improve your system’s reliability.

Regular Scheduled Checks

Set up periodic filesystem checks to catch issues before they become serious:

Create a monthly check script:
```
sudo nano /etc/cron.monthly/fscheck
```

Add the following content:

#!/bin/bash
# Log start time
echo "Starting filesystem check at $(date)" > /var/log/fscheck.log

# Check if filesystem is mounted
if mountpoint -q /data; then
  umount /data
  if [ $? -eq 0 ]; then
    fsck -y /dev/sdb1 >> /var/log/fscheck.log 2>&1
    mount /data
    echo "Filesystem check completed at $(date)" >> /var/log/fscheck.log
  else
    echo "Failed to unmount /data, check aborted" >> /var/log/fscheck.log
  fi
else
  fsck -y /dev/sdb1 >> /var/log/fscheck.log 2>&1
  mount /data
  echo "Filesystem check completed at $(date)" >> /var/log/fscheck.log
fi

Make the script executable:
```
sudo chmod +x /etc/cron.monthly/fscheck
```

Adjust the script to match your specific partitions and mount points.

Control Filesystem Check Frequency

For ext filesystems, you can configure when automatic checks occur:

sudo tune2fs -c 20 -i 3m /dev/sdb1

This sets checks to occur every 20 mounts or 3 months, whichever comes first.

To view current settings:

sudo tune2fs -l /dev/sdb1 | grep -E 'Mount count|Check interval'

To disable automatic checks based on mount count:

sudo tune2fs -c -1 /dev/sdb1

To disable time-based checks:

sudo tune2fs -i 0 /dev/sdb1

Note that disabling automatic checks is generally not recommended unless you have another maintenance strategy in place.

Use Proper Shutdown Procedures

Always shut down Linux systems properly to prevent filesystem corruption:

sudo shutdown -h now

Or:

sudo poweroff

Avoid pressing the power button or unplugging the system unless absolutely necessary.

Implement Power Protection

Consider using an Uninterruptible Power Supply (UPS) for your Linux systems. A UPS provides backup power during outages, allowing for proper shutdown instead of abrupt power loss.

For servers or critical systems, you can configure automatic shutdown during power failures:

Install the UPS management software:

sudo apt install apcupsd  # For APC UPSes

Or:

sudo apt install nut  # Network UPS Tools for various UPS brands

Configure the software to monitor your UPS and trigger a clean shutdown when battery power gets low.

Monitor Disk Health Proactively

Set up automated S.M.A.R.T. monitoring with the smartd daemon:

Edit the configuration file:
```
sudo nano /etc/smartd.conf
```
Add a line for each disk. For example:
```
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com
```
This monitors /dev/sda, enables automatic offline tests, performs a short test every day at 2 AM, a long test every Saturday at 3 AM, and emails alerts to admin@example.com.

Enable and start the service:

sudo systemctl enable smartd
sudo systemctl start smartd

Troubleshooting Common Issues

Even with proactive maintenance, you may encounter specific issues requiring special attention. Here’s how to address common problems:

Superblock Errors

The superblock contains critical filesystem information. If damaged, you’ll see errors like “bad superblock” during boot or when mounting. Fortunately, ext filesystems maintain backup superblocks:

Find backup superblock locations:
```
sudo mke2fs -n /dev/sdb1
```
This shows information without creating a new filesystem.
Use a backup superblock for repair:
```
sudo e2fsck -b 32768 /dev/sdb1
```
Replace 32768 with one of the backup superblock locations from the previous command.

For serious superblock corruption:

sudo e2fsck -f -y -v -b 32768 /dev/sdb1

This forces a check, automatically repairs issues, provides verbose output, and uses the backup superblock at block 32768.

“Device is Busy” Errors

If you can’t unmount a filesystem due to “device is busy” errors:

Identify processes using the filesystem:

sudo fuser -m /mount/point

Or:

sudo lsof | grep /mount/point

Terminate those processes:
```
sudo kill PID
```
Replace PID with the process ID from the previous command.
For stubborn cases, use the force option (use with caution as it may cause data loss):
```
sudo umount -f /mount/point
```
As a last resort on very stubborn mounts:
```
sudo umount -l /mount/point
```
The -l option performs a lazy unmount, detaching the filesystem immediately and cleaning up references when they’re no longer busy.

Interrupted fsck Processes

If an fsck check gets interrupted (by power loss or a system crash), the filesystem may be marked as “in use” or “dirty.” When you try to mount it, you might see messages about the filesystem being in an inconsistent state.

To resolve this:

Run a manual fsck with force option:
```
sudo fsck -f /dev/sdb1
```
If that doesn’t work, try:
```
sudo fsck -y -f /dev/sdb1
```
For more persistent issues:
```
sudo e2fsck -f -p -v /dev/sdb1
```

Severely Corrupted Filesystems

For severely corrupted filesystems where normal repair attempts fail:

Try more aggressive options:
```
sudo e2fsck -p -f -y -v /dev/sdb1
```
If that fails, consider data recovery before reformatting:
- Try tools like testdisk or photorec:
```
sudo apt install testdisk
sudo testdisk
```
- Consider professional data recovery services for critical data
As a last resort, if the filesystem is beyond repair and data has been backed up:
```
sudo mkfs.ext4 /dev/sdb1
```
This reformats the partition with a new ext4 filesystem.

Handling Read-Only Filesystems

Sometimes filesystems mount as read-only due to errors:

Check system logs for error messages:

dmesg | grep -i error

Or:

journalctl -xb | grep -i error

Remount in read-write mode after fixing errors:

sudo mount -o remount,rw /dev/sdb1 /mount/point

If remounting fails, a full filesystem check is likely required:

sudo umount /mount/point
sudo fsck -f /dev/sdb1
sudo mount /dev/sdb1 /mount/point

VPS Manage Service Offer

If you don’t have time to do all of this stuff, or if this is not your area of expertise, we offer a service to do “VPS Manage Service Offer”, starting from $10 (Paypal payment). Please contact us to get the best deal!

r00t