For a complete guide to security, check out 'Security+ Study Guide and DVD Training System' from Amazon.com
In part I, we looked at Windows 2000 & Windows Server 2003 Clustering & Load Balancing for high availability, as well as general planning information.
If you missed the first part in this series, click here to read "Windows Server 2003 Disaster Recovery Planning (Part 1)".
What could happen?
In this section, we discuss Disaster Recovery Planning. In the first article of the series we looked at very general details, now we look at what could actually happen. Many disasters are covered. You learned what disasters could do to you and your organization if they weren't prevented. A disaster is an unavoidable catastrophe that occurs unexpectedly. Recovery is going from disaster to full production again. So what constitutes a disaster? Here are a few disasters you could experience.
- Hackers, exploits, and security breaches
- System failure, disk failure, and so forth
- Power failure
- Fire accidents
- Storm accidents
- Water accidents, flooding
- Earthquake accidents
- Terrorist attacks
- Crime and vandalism
- Extreme weather, such as cold, heat, dryness, and humidity
- Loss of staff that operated or maintained such systems
As you can see, a disaster can stem from nearly anything! In this section, you learn what it could take for you to recover from a disaster by using a Disaster Recovery Plan (DRP).
Building the Disaster Recovery Plan
If you think about it, having high availability in any solution is just like having a built-in disaster recovery plan! If you have a two-node cluster and one fails, the disaster is the failing of a node and the recovery is the failover to the other node. This is a form of disaster recovery. Disaster struck and you recovered because you were prepared. To make this process more formalized and presentable to management, you'll want to build this into a documented plan, but the mechanics of being redundant and failsafe are the fundamentals of the plan itself. To start your DRP, you must first assess your business and its running solution. Here are some initial thoughts. What is an acceptable amount of downtime? I ask this question frequently and I always get a blank stare. I say this because, many times, businesses think that by implementing a DRP, they immediately evade disaster. Sorry, that's not how it works. You have different levels of Disaster Recovery that dictate how much you can recover and how quickly. When detailing downtime, management needs to talk to customers and other users of services to consider how much of a hit business can take during a downtime and still survive.
Here's an example: You're the owner of an ecommerce site that sells widgets online. If you sell widgets 24 hours a day to international and domestic markets, then you're generating revenue 24 hours a day from your web sites. You would want this load balanced and redundant. If your site was down for more than 30 minutes, you could have your buyers go to some other widget seller and they might never return. And this is after only one failure! You could lose business that quickly without a DRP and solution in place, so your amount of acceptable downtime is little to none, if possible.
Another example is an application server that resides on your company's intranet. If you have engineers who can only access the server during working hours, then you have an acceptable downtime of little-to-none during working hours. All maintenance must be completed in off-work hours. You can use this same scenario and say, if the engineers only lost access to the company's documents and drawings for three hours at a time without losing money, then your acceptable downtime is three hours. If acceptable downtime is high, then your cost is low and vice versa.
Disaster Recovery and Management
You need to have your management buy into the DRP. I've seen too many management teams toss DRPs out the window because of costs. But disasters can always strike, so it behooves management to take ownership of an effective DRP. Senior management must understand and support the business impacts and risks associated with a complete system failure. If you're a public company, you might even be held liable, to a certain degree, if negligence can be proved. This is a serious matter when data is involved. Management needs to understand the risks with and without implementing a high-availability solution, as well as how to fund the DRP.
Identify Possible Disaster Impact
Now, let's discuss what impact-based questions you can ask to help guide your business to a highly available and disaster-free environment.
How much of the company's material resources would be lost?
This question is important to assess. While it isn't one of the biggest reasons for having a high-availability solution, it's an important one, nonetheless. If you lose material-based resources because of disaster, it could be costly to business. Think of what might happen if you had a Windows 2000 cluster with SAP/R3 running on it and controlling all the resources for your company. In other words, SAP/R3 is an Enterprise Resource Planning (ERP) application that helps you manage your company's material\ goods. If you had a disaster on your system and all the data was lost, you would risk losing all the shipping information, perhaps your material database, or even worse, inventory. All these items are critical to business and without them you might be unable to run your business. Because of this alone, it's critical for you to assess the possible loss of your material resources data.
What are the total costs invoiced with the disaster?
This is the number one issue based on why you need to make an assessment. You can take the total costs' number and use it in a scenario to justify the cost of what you plan to put into the high-availability solution. I use this number (which I get from analysis and statistics) to explain the TCO of the high-availability solution. An example of total costs is every cost incurred from start to finish of any disaster that takes place.
In other words, if the hard disk fails on a server and it didn't failover, then the time it took to replace that drive (lost business), the cost of the employee who has to take time out of the work week to fix this disaster, and the costs of the hardware and software that might be needed are an example of total costs.
What costs and human resources are required for rebuilding?
If you experience a disaster that's outside the scope or realm of what your organization is staffed to deal with, then outside help or consulting services might be in your future. If this is the case, you need to factor this price/cost into the entire high-availability solution and DRP.
How long will it take to recover if a disaster strikes?
You know what they say: time is money. Assess how long it could take to get your company back online after a disaster and how long until it's fully recovered. You need to address the fact that if you're down due to a disaster, then the longer it takes to bring your systems back online, the more money your business could potentially lose.
What is the impact on the end users?
End users are your workers. They're the fuel for the engine. If they aren't working, then little-to-nothing will get done. This is important if you value the term "productivity" in your organization. If disaster strikes, depending on the impact of the disaster (and possible lack of a DRP), you might find your workforce is sitting around or hanging out at the water cooler.
What is the impact on the suppliers and business partners?
Having a disaster can disrupt your relations with your business partners who might rely on your services. Nothing is worse than losing business yourself and taking your partners down with you. This is considered highly unacceptable and needs to be factored into your overall DRP.
What is the affect on your share price and confidence from consumers?
If you're a publicly held company, your stockholders could lose capital from your disasters and pull money out from your stock. This isn't good and it can only hurt the business image, as well as the revenue stream.
What is the impact on the overall organization?
This is the sum of all the previous questions. If you think about it, having a disaster and having all the previous questions answered negatively might force your company out of business. Always ask questions of this type if you're debating whether you should have a DRP.
Systems, Network, and Applications Priority Levels
Now that you have a good reason to have a DRP, you need to start fleshing it out a bit more. Regarding your systems, network, and applications, you need to create a system that classifies them on a chart, for example, a three-layer chart using an Excel spreadsheet. This ensures resources, money, and effort all get channeled to the system, network, or application that's deemed most important. Usually mainframes, e-mail, routers, and switches turn up as number one on my list of mission-critical components, but this is for you and your analysis to decide. Let's look at my levels:
- Mission critical or high priority is deemed anything you can't live without. The damage or disruption to these systems would cause the most impact on your business. An example is if your systems were completely inoperable.
- Important or medium priority would dictate any system that, if disrupted, would cause a moderate, but still viable, problem to you and your network systems.
An example is if a problem came up (like a disk drive error), which, if neglected, could potentially cause a business interruption for you.
- Minor or low priority is any outage you have that's easily restored, brought back online, or corrected with little damage or disruption. This is still a disruption, but it doesn't impact your systems or your business. An example is if a system has a problem with its monitor.
Resiliency of Services
When working with Highly Available solutions, you need to add resiliency to your plan. Cisco, as well as other network vendors, defines network resiliency as "the ability to recover from any network failure or issue, whether it is related to a disaster, link, hardware, design, or network services." Resiliency should provide you, the implementer of such technologies, with a comfort level that if you have a failure, you could survive it with highly available solutions. You need to plan for resiliency by checking the following areas of your network:
- Make sure your WAN links are redundant. You can implement secondary frame connections or point-to-point links, or dial backup lines with ISDN.
- Make sure your routing protocols are dynamic if you want them to learn other paths in case of disaster. Static paths won't necessarily do this for you.
- Make sure you have multiple networks or Telco carriers. If one carrier has an issue, you can fall back on the other one. MCI WorldCom is a perfect example of this.
- Make sure you have hardware resiliency in every form-hard disks, routers, firewalls, cabling, you name it.
- Make sure you have power redundancy in the form of UPS or backup generators.
- Make sure you have network services resiliency, such as DHCP, and so forth in case of failure.
This isn't a definitive list because it all depends on what you have at your location, but make sure you make your own list, based on what your network has and uses.
Delivering a Disaster Recovery Plan
Now you have a plan on paper! So, what's next? Be sure the plan is full of details and is well documented. Make certain your staff studies it. Schedule a class for everyone to learn about the plan and include a verbal test on the DRP as part of the class. In our next two articles we will get into other aspects of DRP and BCP to include system DRP and so on... standby!
This should give you a good running start on advanced planning for high availability, and it gives you many things to check and think about, especially when you're done with your implementation.