Implementing Fault Tolerance on Windows Networks
A fundamental rule of computing (and life in general) is that things break down. Because of this, it's important to ensure that the components of a Windows server fail infrequently and can be quickly restored after a failure occurs. Fault tolerant technologies are hardware and software features that prevent failures from occurring (high reliability) and enable failed components to be replaced or restored with minimal service interruption (high availability). This article outlines the different options for implementing fault tolerance using Windows Server 2003, Enterprise Edition. We'll look briefly at fault tolerance in three main areas: hardware, storage, and network applications.
While hardware fault tolerance is mainly implemented in the system (motherboard) itself, Windows indirectly provides support for hardware fault tolerance by supporting the underlying system hardware that enables such fault tolerance. Examples of hardware fault tolerance on Windows systems includes:
- Hot add memory to allow adding more RAM while the system is turned on and running, with no reboot required to recognize the new memory.
- Hot swap hard disks to allow adding or removing SATA or SCSI disks while the system is turned on and running.
- Hot plug PCI-X slots to allow adding or removing PCI cards while the system is turned on and running.
- Redundant power supplies and cooling fans to allow the system to continue to function when a power supply fails or a fan stops working.
In addition to these advanced hardware technologies, which are generally only available on more expensive enterprise-level systems, reliability can be enhanced by ensuring the surrounding environment is also fault tolerant. For example:
- Uninterruptible Power Supply (UPS) to ensure systems can shut down properly when the electrical power to your site fails.
- Generators to allow critical systems to continue running during a long term power blackout.
- Voltage filters (usually built into the UPS) to ensure voltage variations (spikes) don't damage components or cause loss of data.
- Redundancy and fault tolerance in network infrastructure devices such as switches and routers.
- Redundant WAN links to provide secondary network connections between sites should the primary WAN link go down.
- Redundant ISPs (multihoming) to ensure highly reliable Internet access.
Finally, hot spare servers that fully configured and ready to go online should a production server experience catastrophic failure can provide a simple but expensive solution to ensuring near 100% uptime in business critical environments. Given an endless supply of money in your IT budget, a system admin could implement all of these technologies to ensure almost 100% uptime. Practically speaking however, one has to pick and choose what one can afford in the realm of advanced hardware support and plan for the best.
Something that is sometimes not considered when implementing fault tolerant hardware is to ensure that an adequate supply of spare parts is readily available onsite and also easily accessible. It's not much help if your server supports hot swap hard disks if you don't have any spares around or if you keep them in a different building or have to sign for them to get them. Another thing to consider when using such hardware is to ensure they are certified in the Windows Server Catalog, which ensures that your hardware is fully compliant with and supported by Windows Server 2003.
Probably the most well-known fault tolerant technology supported by Windows is software RAID, which is available on systems where basic disks have been changed to dynamic disks. RAID 1 (disk mirroring) is an excellent method for providing fault tolerance for boot/system volumes, while RAID 5 (disk striping with parity) increases both the speed and reliability of high-transaction data volumes such as those hosting databases. Software RAID means that RAID is implemented within Windows itself, but for even higher performance and greater fault tolerance you can choose to implement hardware RAID instead, though this is generally a more expensive solution than software RAID. Traditionally most software RAID systems have used SCSI, but another option common nowadays is SATA (Serial ATA), which is usually only a fraction of the cost of SCSI but with almost comparable performance.
There's more to RAID as far as fault tolerance in storage goes however. By implementing the Distributed File System (DFS) on your network and replicating DFS roots using the File Replication Service (FRS), you can ensure maximum redundancy for shared volumes, allowing users to access shared files on your network not only more easily but also when a particular file server goes down. For more information on how DFS works and how to implement it, see Andrew Tabona's article Windows 2003 DFS (Distributed File System) here on WindowsNetworking.com.
Another useful technology is the Volume Shadow Copy Service (VSS), which lets Windows keep point-in-time snapshots of data volumes so users can recover accidental deleted files or revert to earlier versions of documents they are working on. While not strictly a fault tolerant technology, VSS does provide increased availability for user data and helps protect it from accidental loss or destruction. For more information on how Shadow Copy works and how to implement it, see Brien Posey's article Working with the Windows Server 2003 Volume Shadow Copy Service here on WindowsNetworking.com.
Reliable Network Applications
Distributed network applications themselves become more available and reliable when combined with several key fault tolerant technologies in Windows Server 2003, Enterprise Edition. One such technology is server clusters, a high availability solution implemented in Enterprise Edition using the Cluster Service. Server clusters can be configured in a variety of different ways and can include up to 8 nodes (servers) within each cluster. Some of the configurations by which server clusters can provide fault tolerance for network applications includes:
- Active/active clustering, where several nodes share the load of processing client requests. If one node goes down, other active nodes can take up the slack until the failed node is brought back up.
- Active/passive clustering, where one or more nodes are on standby and can be brought up quickly when a working node fails.
- Hot standby clustering, where multiple failover nodes are consolidated into a single hot standby node that cam be brought online if necessary to take up the work of any failed node in the cluster.
Depending on how your clusters are implemented, you can characterize network applications running on your cluster in one of three different ways:
- Single instance applications, where one instance of the application is running on the cluster at any given moment. This approach is usually used for hosting network services, for example a clustered DHCP server.
- Multiple instance cloned applications, where identical code running against identical data on several nodes gives the appearance to network clients of a single instance application. This approach is generally used for running stateless applications, but stateful applications can also be implemented this way provided connecting clients are provided with tokens that track their session state.
- Multiple instance partitioned applications, where application code and data is partitioned to run on different nodes in a cluster. This approach is generally used for running stateful applications, for example an SQL database application that can be easily partitioned into different portions for different groups of customers.
Windows Server 2003, Enterprise Edition, also supports a new feature called majority node clustering, which allows the nodes within a cluster to be geographically dispersed from one another but still maintain internal consistency and allows fault tolerance to be implemented in a distributed sense among several sites. For more information on this feature see the tip called Geographically Dispersed Clusters in Windows Server 2003 in the Admin Knowledge Base here on WindowsNetworking.com.
A second way of implementing fault tolerance for distributed client/server applications is to use the Network Load Balancing (NLB) component of Windows Server 2003. This feature can be used to provide failover support for applications and services running on IP networks, for example web applications running on Internet Information Services (IIS). Using NLB you can scale an application out to run on as many as 32 separate servers, and while the main purpose of this approach is to increase availability and provide higher scalability, NLB also provides fault tolerance to increase reliability as well.
For a quick introduction to some of the basic concepts of clustering on Windows Server 2003, see Brien Posey's article Understanding How Cluster Quorums Work here on WindowsNetworking.com.
The different hardware and software fault tolerant technologies supported by Windows Server 2003, Enterprise Edition, make it a powerful platform for mission-critical business applications. By judiciously choosing which of these technologies to implement, you can ensure high reliability and high availability while being careful not to break your budget.