How To Scan and Repair Disk Errors on Linux
Maintaining a healthy filesystem is crucial for any Linux system’s stability and performance. Over time, your Linux disk drives can develop errors due to unexpected shutdowns, power failures, hardware issues, or general wear and tear. Left unaddressed, these errors can lead to data corruption, system instability, or even complete system failure. Fortunately, Linux provides powerful tools to scan, detect, and repair disk errors before they become critical issues.
This comprehensive guide will walk you through the process of scanning and repairing disk errors on Linux systems. Whether you’re a system administrator managing servers or a home user maintaining a personal Linux machine, these techniques will help you keep your storage devices in optimal condition and prevent data loss.
Understanding Disk Errors in Linux
Disk errors in Linux can manifest in various ways and understanding their nature is the first step toward effective troubleshooting. These errors typically fall into two categories: logical errors (filesystem corruption) and physical errors (hardware issues).
Linux filesystems organize data using complex structures, including inodes, superblocks, and data blocks. When these structures become damaged or corrupted, your system might exhibit symptoms such as:
- Unexpected system freezes or crashes
- Files that suddenly become unreadable or corrupted
- Strange error messages during boot or operation
- Slow disk performance or excessive disk activity
- System failing to boot completely
- Input/output errors when accessing certain files
Several factors can contribute to disk errors in Linux environments:
- Improper system shutdowns (power outages, hard resets)
- Physical damage to storage devices
- Aging hardware (all storage media have a finite lifespan)
- Software bugs or filesystem driver issues
- Magnetic interference (for traditional HDDs)
- Bad sectors developing on the disk surface
Regular filesystem checks are essential preventive maintenance tasks for any Linux system. Most modern Linux distributions use journaling filesystems like ext4, XFS, or Btrfs, which are more resilient to corruption than older filesystems. However, even these advanced filesystems can develop issues that require manual intervention.
Preparing for Disk Checks
Before diving into disk repair operations, proper preparation is essential to avoid further damage and ensure effective troubleshooting.
Identifying Your Disks and Partitions
The first step is to identify which disk or partition requires checking. Linux provides several commands to help you gather this information:
lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL
This command displays a hierarchical view of all block devices with their filesystem types, sizes, mount points, and labels. The output will look something like:
NAME FSTYPE SIZE MOUNTPOINT LABEL
sda 500G
├─sda1 ext4 50G / root
├─sda2 swap 8G [SWAP] swap
└─sda3 ext4 442G /home home
sdb 1T
└─sdb1 xfs 1T /data data
Alternatively, you can use the df
command to see disk usage for mounted filesystems:
df -h
For detailed partition information on a specific disk, use parted
:
sudo parted /dev/sda print
To see the filesystem UUID and other detailed information, you can use:
sudo blkid
Take note of the device names (like /dev/sda1
) of the partitions you need to check and repair.
Unmounting the Filesystem
Critical warning: Most filesystem check and repair operations require the target filesystem to be unmounted. Performing checks on mounted filesystems can lead to data corruption or loss.
To unmount a filesystem, use the umount
command:
sudo umount /dev/sdb1
If you’re unmounting by mount point instead of device name:
sudo umount /data
You can verify if the unmount was successful by checking mount
or lsblk
output again.
If the system indicates that the filesystem is busy, you may need to identify and close applications using the filesystem:
sudo lsof /data
For filesystems that are in constant use, like the root filesystem, special procedures are required, which we’ll cover later in this guide.
The Primary Tool: fsck (File System Consistency Check)
The fsck
(File System Consistency Check) utility is the primary tool for checking and repairing filesystem errors in Linux. It serves as a front-end for filesystem-specific checkers, automatically detecting the filesystem type and calling the appropriate checker.
Understanding fsck
The fsck
utility performs several critical functions:
- Checks filesystem integrity and consistency
- Detects errors in the filesystem structure
- Repairs corrupted inodes, superblocks, and data blocks
- Fixes directory structure issues
- Recovers orphaned files (files without proper directory entries)
- Corrects file and directory counts
Think of fsck
as the Linux equivalent of Windows’ chkdsk
utility, but with more flexibility and advanced options.
Basic fsck Usage
The simplest form of the fsck
command is:
sudo fsck /dev/sdb1
This checks the specified partition and reports any errors found. If errors are detected, fsck
will prompt you for confirmation before making repairs.
For a more informative output, add the verbose flag:
sudo fsck -v /dev/sdb1
To specify the filesystem type explicitly (useful if automatic detection fails):
sudo fsck -t ext4 /dev/sdb1
Understanding fsck Error Codes
After running, fsck
returns an exit code that indicates the outcome of the check. Understanding these codes helps interpret the results:
Code | Meaning |
---|---|
0 | No errors were found |
1 | Filesystem errors were corrected |
2 | System should be rebooted |
4 | Filesystem errors were left uncorrected |
8 | Operational error occurred |
16 | Usage or syntax error |
32 | Checking was canceled by user request |
128 | Shared-library error |
You can check the return code after running fsck
with:
echo $?
A return code of 0 or 1 generally indicates success, while higher values may require additional attention.
Advanced fsck Options
For automatic repair without prompts (useful for scripts):
sudo fsck -y /dev/sdb1
The -y
flag automatically answers “yes” to all repair prompts. Use this option with caution, as it will make changes without asking for confirmation.
For interactive repair with more control:
sudo fsck -r /dev/sdb1
This prompts you for confirmation before making each repair, giving you control over the process.
To check all filesystems listed in /etc/fstab
(except those with the noauto
option):
sudo fsck -A
To skip the root filesystem when checking all filesystems:
sudo fsck -AR
To perform a test run without making any changes (dry run):
sudo fsck -N /dev/sdb1
For a thorough check that forces checking even if the filesystem appears clean:
sudo fsck -f /dev/sdb1
Filesystem-Specific Tools
While fsck
provides a universal interface for checking filesystems, Linux also offers specialized tools for specific filesystem types. These tools often provide more options and better control for their respective filesystems.
Checking ext2/ext3/ext4 Filesystems
The e2fsck
tool is designed specifically for the ext family of filesystems (ext2, ext3, and ext4), which are among the most common in Linux systems.
For a basic check with verbose output:
sudo e2fsck -v /dev/sdb1
To force a complete check even if the filesystem appears clean:
sudo e2fsck -f /dev/sdb1
For automatic repair without prompts:
sudo e2fsck -p /dev/sdb1
The -p
flag attempts to automatically fix any problems without user intervention but will abort if it encounters serious issues.
For a more aggressive approach that automatically answers “yes” to all questions:
sudo e2fsck -y /dev/sdb1
To display the progress of the check in real-time (useful for large filesystems):
sudo e2fsck -C0 /dev/sdb1
The -C0
flag shows a progress bar during the check, making it easier to monitor on large partitions.
Checking XFS Filesystems
XFS filesystems, often used in enterprise environments and for large storage arrays, require different tools for maintenance. The primary utility for checking and repairing XFS filesystems is xfs_repair
.
To check an XFS filesystem without performing any repairs:
sudo xfs_repair -n /dev/sdb1
The -n
flag performs a check without modifying the filesystem, similar to a dry run.
To perform repairs on an XFS filesystem:
sudo xfs_repair /dev/sdb1
For verbose output with detailed information about the repair process:
sudo xfs_repair -v /dev/sdb1
For even more detailed output:
sudo xfs_repair -v -v /dev/sdb1
Each added -v
increases the verbosity level.
Important: Always unmount XFS filesystems before checking them. Unlike some other filesystem types, XFS absolutely requires unmounting before repair operations.
After completing the check and repair process, you can remount the filesystem:
sudo mount -a
This command mounts all filesystems listed in /etc/fstab
that aren’t already mounted.
Dealing with Bad Sectors
Bad sectors are physical areas of a storage device that have become damaged and can no longer reliably store data. These defects can cause data corruption and system instability if not properly managed.
Detecting Bad Sectors
The badblocks
utility is specifically designed to scan storage devices for bad sectors:
sudo badblocks -v /dev/sdb
This command performs a read-only test and displays all bad blocks found. The -v
flag provides verbose output during the scan.
For a more thorough test that performs a non-destructive read-write test:
sudo badblocks -nsv /dev/sdb
The -n
flag performs a non-destructive read-write test, -s
shows progress, and -v
provides verbose output.
Warning: For the most thorough test, you can use a destructive write test, but this will erase all data on the device:
sudo badblocks -wsv /dev/sdb
The -w
flag performs a destructive write test. Only use this on disks with no valuable data or after backing up all data.
To save the list of bad blocks to a file for further processing:
sudo badblocks -v /dev/sdb > bad-blocks.txt
Repairing Bad Sectors
While physical bad sectors cannot be truly “repaired,” Linux can mark them as unusable to prevent data corruption. The e2fsck
command can automatically handle bad sectors when used with specific options:
sudo e2fsck -c -v /dev/sdb1
The -c
flag tells e2fsck
to run badblocks
in read-only mode and mark any bad blocks as unusable.
For a more thorough check with a read-write test:
sudo e2fsck -cc -v /dev/sdb1
Using -cc
runs a more thorough read-write test with badblocks
.
If you’ve already run badblocks
and saved the output to a file, you can use:
sudo e2fsck -l bad-blocks.txt /dev/sdb1
The -l
flag instructs e2fsck
to use the list of bad blocks identified in the file.
Important: A growing number of bad sectors often indicates impending drive failure. If your drive reports multiple bad sectors, especially if the number increases over time, consider backing up your data and replacing the drive soon. Regular S.M.A.R.T. monitoring (covered later) can help you track this trend.
Checking and Repairing the Root Filesystem
Checking the root filesystem presents a unique challenge because it cannot be unmounted while the system is running. Linux provides several methods to address this limitation.
Method 1: Using Force Check at Boot
The simplest approach is to schedule a filesystem check during the next system boot:
sudo touch /forcefsck
This creates an empty file named forcefsck
in the root directory. During the next boot, Linux will detect this file and automatically run fsck
on the root filesystem before mounting it.
Alternatively, on systems using systemd (most modern distributions):
sudo systemctl enable systemd-fsck-root.service
On some distributions, you can also set a kernel parameter for the next boot:
sudo grub-reboot "$(grep -m 1 '^menuentry ' /boot/grub/grub.cfg | cut -d "'" -f2) fsck.mode=force"
Method 2: Using Live Media
For more severe issues, booting from a Linux live USB or DVD provides full access to your system’s disks while they’re unmounted:
- Create a bootable Linux live media (Ubuntu, Fedora, or specialized rescue distributions like SystemRescue)
- Boot your computer from this media
- Open a terminal in the live environment
- Identify your root partition:
lsblk
- Run fsck on the unmounted root partition:
sudo fsck -f -y /dev/sda1
(Replace
/dev/sda1
with your actual root partition) - After completion, reboot into your regular system
This method provides the most thorough check since the filesystem is completely unmounted and not in use.
Method 3: Using Recovery Mode
Many Linux distributions include a recovery or maintenance mode that can be accessed from the boot menu:
- Reboot your computer
- Access the GRUB menu (usually by holding Shift during boot)
- Select recovery mode or advanced options
- Choose “fsck” or “root shell” from the recovery menu
- If you choose root shell, the system will likely mount the root filesystem as read-only, allowing you to run:
fsck -f /dev/sda1
- After completion, reboot with the command:
reboot
This method doesn’t require additional boot media but may not provide as complete access as a live environment.
Checking S.M.A.R.T. Disk Health
Beyond filesystem errors, monitoring the physical health of your storage devices is crucial. Modern storage devices include Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.), which provides valuable insights into drive health and can predict impending failures.
Installing smartmontools
First, install the required package:
# For Debian/Ubuntu-based distributions
sudo apt update
sudo apt install smartmontools
# For Fedora/RHEL-based distributions
sudo dnf install smartmontools
# For Arch Linux
sudo pacman -S smartmontools
# For openSUSE
sudo zypper install smartmontools
Basic S.M.A.R.T. Health Check
To check if a drive supports S.M.A.R.T. and verify its basic health status:
sudo smartctl -i -H /dev/sda
The -i
flag displays drive information, and -H
performs a health check. The output will include a line like:
SMART overall-health self-assessment test result: PASSED
Or, for failing drives:
SMART overall-health self-assessment test result: FAILED
A “FAILED” result indicates serious problems, and you should back up your data immediately and consider replacing the drive.
Comprehensive S.M.A.R.T. Data
For detailed information about your drive’s health:
sudo smartctl -a /dev/sda
This displays all S.M.A.R.T. attributes tracked by the drive, including:
- Raw read error rate
- Spin-up time
- Start/stop count
- Reallocated sector count
- Seek error rate
- Power-on hours
- Temperature
- Current pending sectors
- Offline uncorrectable sectors
To run a short self-test on the drive:
sudo smartctl -t short /dev/sda
This initiates a brief diagnostic that checks the drive’s mechanical and electrical components. For a more thorough examination:
sudo smartctl -t long /dev/sda
A long test can take several hours but provides a comprehensive assessment of the drive’s condition.
After the test completes, view the results with:
sudo smartctl -l selftest /dev/sda
Interpreting S.M.A.R.T. Data
When analyzing S.M.A.R.T. data, pay particular attention to these critical attributes:
- Reallocated Sectors Count: Indicates how many sectors have been remapped due to errors. Any non-zero value warrants monitoring, and a growing count suggests drive deterioration.
- Current Pending Sectors: Sectors waiting to be remapped. A non-zero value here often indicates problems.
- Uncorrectable Sectors: Sectors that couldn’t be read or written, even after error correction. Any uncorrectable sectors are cause for concern.
- Command Timeout: Indicates instances where drive commands failed to complete in time. Frequent timeouts suggest mechanical issues.
- Power-On Hours: Shows the drive’s total operating time. While not directly indicating problems, older drives are generally more prone to failure.
A steady increase in any error-related attribute typically indicates progressive drive deterioration. If you notice this pattern, consider backing up your data and planning for drive replacement.
Preventive Maintenance and Best Practices
Proactive maintenance helps prevent disk errors and data loss. Implementing these best practices will significantly improve your system’s reliability.
Regular Scheduled Checks
Set up periodic filesystem checks to catch issues before they become serious:
- Create a monthly check script:
sudo nano /etc/cron.monthly/fscheck
- Add the following content:
#!/bin/bash # Log start time echo "Starting filesystem check at $(date)" > /var/log/fscheck.log # Check if filesystem is mounted if mountpoint -q /data; then umount /data if [ $? -eq 0 ]; then fsck -y /dev/sdb1 >> /var/log/fscheck.log 2>&1 mount /data echo "Filesystem check completed at $(date)" >> /var/log/fscheck.log else echo "Failed to unmount /data, check aborted" >> /var/log/fscheck.log fi else fsck -y /dev/sdb1 >> /var/log/fscheck.log 2>&1 mount /data echo "Filesystem check completed at $(date)" >> /var/log/fscheck.log fi
- Make the script executable:
sudo chmod +x /etc/cron.monthly/fscheck
Adjust the script to match your specific partitions and mount points.
Control Filesystem Check Frequency
For ext filesystems, you can configure when automatic checks occur:
sudo tune2fs -c 20 -i 3m /dev/sdb1
This sets checks to occur every 20 mounts or 3 months, whichever comes first.
To view current settings:
sudo tune2fs -l /dev/sdb1 | grep -E 'Mount count|Check interval'
To disable automatic checks based on mount count:
sudo tune2fs -c -1 /dev/sdb1
To disable time-based checks:
sudo tune2fs -i 0 /dev/sdb1
Note that disabling automatic checks is generally not recommended unless you have another maintenance strategy in place.
Use Proper Shutdown Procedures
Always shut down Linux systems properly to prevent filesystem corruption:
sudo shutdown -h now
Or:
sudo poweroff
Avoid pressing the power button or unplugging the system unless absolutely necessary.
Implement Power Protection
Consider using an Uninterruptible Power Supply (UPS) for your Linux systems. A UPS provides backup power during outages, allowing for proper shutdown instead of abrupt power loss.
For servers or critical systems, you can configure automatic shutdown during power failures:
- Install the UPS management software:
sudo apt install apcupsd # For APC UPSes
Or:
sudo apt install nut # Network UPS Tools for various UPS brands
- Configure the software to monitor your UPS and trigger a clean shutdown when battery power gets low.
Monitor Disk Health Proactively
Set up automated S.M.A.R.T. monitoring with the smartd
daemon:
- Edit the configuration file:
sudo nano /etc/smartd.conf
- Add a line for each disk. For example:
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com
This monitors
/dev/sda
, enables automatic offline tests, performs a short test every day at 2 AM, a long test every Saturday at 3 AM, and emails alerts to admin@example.com. - Enable and start the service:
sudo systemctl enable smartd sudo systemctl start smartd
Troubleshooting Common Issues
Even with proactive maintenance, you may encounter specific issues requiring special attention. Here’s how to address common problems:
Superblock Errors
The superblock contains critical filesystem information. If damaged, you’ll see errors like “bad superblock” during boot or when mounting. Fortunately, ext filesystems maintain backup superblocks:
- Find backup superblock locations:
sudo mke2fs -n /dev/sdb1
This shows information without creating a new filesystem.
- Use a backup superblock for repair:
sudo e2fsck -b 32768 /dev/sdb1
Replace 32768 with one of the backup superblock locations from the previous command.
For serious superblock corruption:
sudo e2fsck -f -y -v -b 32768 /dev/sdb1
This forces a check, automatically repairs issues, provides verbose output, and uses the backup superblock at block 32768.
“Device is Busy” Errors
If you can’t unmount a filesystem due to “device is busy” errors:
- Identify processes using the filesystem:
sudo fuser -m /mount/point
Or:
sudo lsof | grep /mount/point
- Terminate those processes:
sudo kill PID
Replace PID with the process ID from the previous command.
- For stubborn cases, use the force option (use with caution as it may cause data loss):
sudo umount -f /mount/point
- As a last resort on very stubborn mounts:
sudo umount -l /mount/point
The
-l
option performs a lazy unmount, detaching the filesystem immediately and cleaning up references when they’re no longer busy.
Interrupted fsck Processes
If an fsck
check gets interrupted (by power loss or a system crash), the filesystem may be marked as “in use” or “dirty.” When you try to mount it, you might see messages about the filesystem being in an inconsistent state.
To resolve this:
- Run a manual fsck with force option:
sudo fsck -f /dev/sdb1
- If that doesn’t work, try:
sudo fsck -y -f /dev/sdb1
- For more persistent issues:
sudo e2fsck -f -p -v /dev/sdb1
Severely Corrupted Filesystems
For severely corrupted filesystems where normal repair attempts fail:
- Try more aggressive options:
sudo e2fsck -p -f -y -v /dev/sdb1
- If that fails, consider data recovery before reformatting:
- Try tools like
testdisk
orphotorec
:sudo apt install testdisk sudo testdisk
- Consider professional data recovery services for critical data
- Try tools like
- As a last resort, if the filesystem is beyond repair and data has been backed up:
sudo mkfs.ext4 /dev/sdb1
This reformats the partition with a new ext4 filesystem.
Handling Read-Only Filesystems
Sometimes filesystems mount as read-only due to errors:
- Check system logs for error messages:
dmesg | grep -i error
Or:
journalctl -xb | grep -i error
- Remount in read-write mode after fixing errors:
sudo mount -o remount,rw /dev/sdb1 /mount/point
- If remounting fails, a full filesystem check is likely required:
sudo umount /mount/point sudo fsck -f /dev/sdb1 sudo mount /dev/sdb1 /mount/point