Troubleshooting File System Problems
A corrupt or damaged file system can result in various effects ranging from data loss to rendering your system unbootable. Smart IT pros will therefore take steps to maintain their servers' file systems and will know how to systematically troubleshoot disks when things go wrong. This article discusses both preventive disk maintenance and provides some tips for using various tools to maintain and troubleshoot file systems on Windows servers.
Seven Golden Rules for Disk Maintenance
Let's begin with a proactive approach to file system maintenance. What steps should an administrator take to help prevent file system problems from happening in the first place? Here are my seven golden rules on the subject, in no particular order:
1. Upgrade your servers to Windows Server 2003. There's real value in doing this as far as disk maintenance is concerned, for example:
- The chkdsk command in Windows Server 2003 runs a lot faster than the Windows 2000 version of this utility, plus it can fix things like a corrupt Master File Table (MFT) that the previous version of the utility would choke on.
- Powerful new command-line tools like DiskPart.exe, Fsutil.exe and Defrag.exe give you more flexibility for managing disks from the command-line instead of the GUI. These tools can be scripted to automate common disk management tasks you need to perform on a regular basis.
- The new Automated System Recovery (ASR) feature greatly simplifies the task of restoring your system/boot volume in the event of catastrophic disk failure.
2. Use hardware redundancy. RAID 1 disk mirroring lets you recover from catastrophic system volume failure with zero downtime, while RAID 5 is a great way of protecting your data volumes. Windows servers include support for built-in software RAID but you'll get better performance and true hot-swap redundancy by investing more money and buying a hardware RAID controller for your system instead. Don't forget though, keep a few spare drives handy so you can swap them during an emergency—redundancy is useless if you don't have the redundant hardware around to use it. Note that if you do choose to go with the software RAID provided by Windows, mirroring your boot and system volumes requires that these volumes be one and the same i.e. one volume is both your boot volume (contains operating system files) and your system volume (contains hardware-specific boot files).
3. Use a good antivirus program. Viruses can be nasty, and one of the things they can do when they infect a machine is to corrupt the Master Boot Record (MBR) and other critical portions of your hard drives. Not only should you have AV installed on your servers, you should also avoid risky behaviors such as running scripts from untrusted sources, browsing the web, and so on. These are just the kinds of behavior that can lead to infecting your system, so avoid doing things like this on your production servers.
4. Defragment your file systems on a regular basis. This is especially important on servers on which a high number of transactional operations occur as the file systems can quickly become fragmented, dragging down the performance of applications running on your server. To perform a successful defrag you should really have at least 15% free space left on your disk, so make sure you don't let critical system or data disks fill up too much or they'll be harder to maintain. The new command-line Defrag.exe tool of Windows Server 2003 is useful here since you can schedule regular running of this tool during off-hours using the Schtasks.exe command instead of having to defrag manually or buy a third-party defrag tool.
5. Run chkdsk /r on a regular basis. This command finds bad sectors on your disk and tries to fix them by recovering data from them and moving it elsewhere. You can run this command either from a command-prompt window or from the Recovery Console if you can't boot your system normally. Remember that when you try and run chkdsk.exe on your system or boot volume, Windows configures autochk.exe (the boot version of chkdsk.exe) to run at your next reboot. This means you'll need to schedule downtime for your server when you perform this kind of maintenance so that autochk.exe can run.
6. Check your event logs regularly for any disk-related events. Windows sometimes determines on its own when a disk is "dirty" i.e. there are file system errors present on it. In that case, Windows automatically schedules autochk.exe to run at the next reboot, but it also writes an event to the Application log using either the source name "Chkdsk" or "Winlogon". So filter your Application log to view these kinds of events on a regular basis or collect them using Microsoft Operations Manager (MOM) or whatever other systems management tool you use on your network.
7. Back up all your volumes regularly. As a last recourse in the event of a disaster, having working backups of both your system/boot volume and data volumes is critical. ASR in Windows Server 2003 makes backing up the boot/system volume easier, while backing up your data volumes can be done using the Windows Backup (ntbackup.exe) tool or any other backup tool such as one from a third-party vendor. Whatever way you choose to back up your system, do it regularly and verify your backups to ensure you can recover your system using them.
I should also add an eighth and final rule as well:
8. (the Platinum rule) If your disk starts to make funny sounds, don't ignore them—do something. Disk failure is often preceded by funny sounds emanating from your computer. These clicking, scraping, screeching, or other types of sounds mean trouble, so when you hear them it's time to make sure you've got a recent backup and a spare disk handy just in case. And it's also time to check your event logs, run chkdsk –r, and use other maintenance and troubleshooting tools to check the health of your disks. Don't ignore these funny sounds!
Tips for Troubleshooting
While a proactive approach to maintaining disks and their file systems is important, it's also inevitable that disasters will occur and you'll need to react to them appropriately. Here are some tips to using one of the key maintenance tools for disk and file systems that is included with Windows Server 2003, namely Chkdsk.exe:
- Make sure you know you have a good recent backup before you run chkdsk.exe.
- Never interrupt Chkdsk.exe while it's doing its job.
- Make sure you have enough time during your maintenance downtime window to run Chkdsk.exe—on very large volumes this command can take a long time to finish its work. To speed up the operation of Chkdsk.exe on very large volumes, you can run it in a "light" form by specifying chkdsk drive_letter /f /c /i before you try running the slower chkdsk /r.
- Chkdsk.exe can't run on the boot/system volume when Windows is running, and it also can't run on data volumes when file handles are open on the volume. The reason being that in both of these situations Chkdsk.exe is unable to lock the volume for its exclusive use. In these cases, Chkdsk.exe will be scheduled to run at the next system restart.
- If you think your volume may be dirty but you don't want Autochk.exe to run when it reboots—for instance, if your server is heavily used and you can't afford the downtime while Autochk.exe runs—you can use the Chkntfs.exe command to first determine whether the volume is dirty or not, and second to find out whether Autochk.exe is currently schedule to run at the next restart. If you determine that the volume is dirty and Autochk.exe is scheduled to run at next restart, you can delay running Autochk.exe using the chkntfs /d command. Note however that doing this is risky—if your volume is dirty you should deal with it as soon as possible and not procrastinate.