In the first article of this series “Planning for High Availability”, we covered Widows 2003 Clustering & Load Balancing for high availability (HA), as well as general planning information. We expand on the plan for high availability by looking at a full blown disaster and why it’s important to have a plan. This article addresses those issues.
“For a complete guide to high availability, check out ‘Windows Server 2003 Clustering and Load Balancing’ from Amazon.com“
What could happen?
In this section, we discuss Disaster Recovery Planning. In the first article (found in the Links and References section of this article) of the series we looked at very general details, now we look at what could actually happen. If you have ever experienced a disaster on a large scale, you know exactly what could happen – ‘anything’ could happen. To prepare for anything you have to start to brainstorm with your team what could happen and bullet list it. Here are some examples that will most likely come up, of potential disaster to your organization’s business.
- Hackers, exploits, and security breaches
- System failure, disk failure, and so forth
- Power failure
- Fire accidents
- Storm accidents
- Water accidents, flooding
- Earthquake accidents
- Terrorist attacks
- Crime and vandalism
- Extreme weather, such as cold, heat, dryness, and humidity
- Loss of staff that operated or maintained such systems
A “Disaster” is an unavoidable catastrophe that usually occurs unexpectedly. “Recovery” is going from disaster to full production again – either averting the disaster completely, or being able to bounce back from it. Make no mistake, disaster will happen at some time and when it does (large or small scale), you will be able to recover from it if you are prepared to. Remember from the list, you can clearly see, a disaster can stem from nearly anything! In the next section, we cover using a Disaster Recovery Plan to help you recover from a disaster in an organized and quick manner – time is of the essence when it comes to disaster, every second counts.
Building the Disaster Recovery Plan
If you think about it, having high availability in any solution is just like having a built-in disaster recovery plan! If you have a two-node cluster and one fails, the disaster is the failing of a node and the recovery is the failover to the other node. This is a form of disaster recovery. There are many forms and each disaster is unique, each system’s recovery could also be unique. Since there is so much ‘uniqueness’ in it, it’s imperative that if you really value your operation’s ability to do business in a paperless society, then you will take this into consideration and build a Disaster Recovery Plan. There are so many resources online these days, there really shouldn’t be any excuses as to why a system will fail other than ‘it’s not in the budget this year, but you did let management know we may have a problem’.
If Disaster struck and you recovered, it was only because you were either prepared or really lucky. I like being prepared; it totally takes the Las Vegas aspect of networking out of the equation.
To make this process more formalized and presentable to management, you’ll want to build this into a documented plan, but the mechanics of being redundant and failsafe are the fundamentals of the plan itself. To start your DRP, you must first assess your business and its running solution. Here are some initial thoughts.
What is an acceptable amount of downtime?
I ask this question frequently and I often get a blank stare. I say this because, many times, businesses think that by implementing a DRP, they immediately evade disaster. Sorry, that’s not how it works. You have different levels of Disaster Recovery that dictate how much you can recover and how quickly. When detailing downtime, management needs to talk to customers and other users of services to consider how much of a hit business can take during a downtime and still survive. Do companies talk about this often? In my personal experience, not often enough.
Here’s an example: You’re the owner of an e-commerce site that sells widgets online. If you sell widgets 24 hours a day to international and domestic markets, then you’re generating revenue 24 hours a day from your web sites. You would want this load balanced and redundant. If your site was down for more than 30 minutes, you could have your buyers go to some other widget seller and they might never return. And this is after only one failure! You could lose business that quickly without a DRP and solution in place, so your amount of acceptable downtime is little to none, if possible.
Another example is an application server that resides on your company’s intranet.
If you have engineers who can only access the server during working hours, then you have an acceptable downtime of little-to-none during working hours. All maintenance must be completed in off-work hours. You can use this same scenario and say, if the engineers only lost access to the company’s documents and drawings for three hours at a time without losing money, then your acceptable downtime is three hours. If acceptable downtime is high, then your cost is low and vice versa.
Basically, just assess what each disaster you can think of can do to each one of your critical systems, assess how to ‘avoid’ those dangers, and then assess the building of ‘high availability’ into the solution in case you want to avoid disaster entirely, this would also be very costly as to guarantee that no disaster would strike means that you have applied DRP to everything (servers, infrastructure, power, etc) you can think of to include every disaster you could encounter. The trick to keeping it from getting out of hand is by making a list of your top applications and software that you run your business on or whatever you make money from, that’s probably what you want to assess for Disaster Recovery. Once you have assessed all possible issues and their problems, you build a plan to help ‘in the time of the disaster’ to help facilitate the recovery. The plan should have detailed steps, it should be organized, it should be staffed with responsible team members and it should be tested, tested for quality and each time a disaster recovery plan is enacted it should follow up with a discussion on how to make it better. It should be reviewed by management and enforced and backed by monument.
Disaster Recovery (DR) and Management
You need to have your management buy into the DRP. I’ve seen too many management teams toss DRPs out the window because of costs. But disasters can always strike, so it behooves management to take ownership of an effective DRP. Senior management must understand and support the business impacts and risks associated with a complete system failure. If you’re a public company, you might even be held liable, to a certain degree, if negligence can be proved. This is a serious matter when data is involved. Management needs to understand the risks with and without implementing a high-availability solution, as well as how to fund the DRP. Think about a person’s medical charts at a medical facility. If you granulate the up front costs of HA and DR, is it really possible for you to tell any patient that their medical history is gone because you saved a few dollars for the company and put a bonus in your pocket. Would you do that to your own parent or loved one? Would you burn your Mother’s medical records for a swimming pool? When I break it down this way, people laugh – but how true is it? I never see companies take this stuff seriously until its too late, very few proactively plan out a way to get back to business if something serious were to happen. Data Backups and RAID systems can stop an earthquake from removing any hope of using ‘that’ facility again, if you don’t have a standby site available for you to go to with good data backups, you’re not going back into business. It’s that simple. Paperless society equals a need for a way to keep that data alive. DRP – Back it, embrace it. The more you embrace it the better chances you have to avert a disaster in the first place.
Identify Possible Disaster Impact
Now, let’s discuss what impact-based questions you as a Systems Administrator or Engineer can ask to help guide your business to a highly available and disaster-free environment.
How much of the company’s material resources would be lost?
This question is important to assess. While it isn’t one of the biggest reasons for having a high-availability solution, it’s an important one, nonetheless. If you lose material-based resources because of disaster, it could be costly to your business. Think of what might happen if you had a Windows Server 2003 cluster with SAP/R3 running on it and controlling all the resources for your company. In other words, SAP/R3 is an Enterprise Resource Planning (ERP) application that helps you manage your company’s material\ goods. If you had a disaster on your system and all the data was lost, you would risk losing all the shipping information, perhaps your material database, or even worse, inventory. All these items are critical to business and without them you might be unable to run your business. Because of this alone, it’s critical for you to assess the possible loss of your material resources data.
What are the total costs invoiced with the disaster?
This is the number one issue based on why you need to make an assessment. You can take the total costs’ number and use it in a scenario to justify the cost of what you plan to put into the high-availability solution. I use this number (which I get from analysis and statistics) to explain the TCO (Total Cost of Ownership) of the high-availability solution. An example of total costs is every cost incurred from start to finish of any disaster that takes place. In other words, if the hard disk fails on a server and it didn’t failover, then the time it took to replace that drive (lost business), the cost of the employee who has to take time out of the work week to fix this disaster, and the costs of the hardware and software that might be needed are an example of total costs.
What costs and human resources are required for rebuilding?
If you experience a disaster that’s outside the scope or realm of what your organization is staffed to deal with, then outside help or consulting services might be in your future. If this is the case, you need to factor this price/cost into the entire high-availability solution and DRP.
How long will it take to recover if a disaster strikes?
You know what they say: time is money. Assess how long it could take to get your company back online after a disaster and how long until it’s fully recovered. You need to address the fact that if you’re down due to a disaster, then the longer it takes to bring your systems back online, the more money your business could potentially lose.
What is the impact on the end users?
End users are your workers. They’re the fuel for the engine. If they aren’t working, then little-to-nothing will get done. This is important if you value the term “productivity” in your organization. If disaster strikes, depending on the impact of the disaster (and possible lack of a DRP), you might find your workforce is sitting around or hanging out at the water cooler.
What is the impact on the suppliers and business partners?
Having a disaster can disrupt your relations with your business partners who might rely on your services. Nothing is worse than losing business yourself and taking your partners down with you. This is considered highly unacceptable and needs to be factored into your overall DRP.
What is the affect on your share price and confidence from consumers?
If you’re a publicly held company, your stockholders could lose capital from your disasters and pull money out from your stock. This isn’t good and it can only hurt the business image, as well as the revenue stream.
What is the impact on the overall organization?
This is the sum of all the previous questions. If you think about it, having a disaster and having all the previous questions answered negatively might force your company out of business. Always ask questions of this type if you’re debating whether you should have a DRP.
Systems, Network, and Applications Priority Levels
Now that you have a good reason to have a DRP, you need to start fleshing it out a bit more. Regarding your systems, network, and applications, you need to create a system that classifies them on a chart, for example, a three-layer chart using an Excel spreadsheet. This ensures resources, money, and effort all get channeled to the system, network, or application that’s deemed most important. Usually mainframes, e-mail, routers, and switches turn up as number one on my list of mission-critical components, but this is for you and your analysis to decide.
Let’s look at my levels:
- Mission critical or high priority is deemed anything you can’t live without. The damage or disruption to these systems would cause the most impact on your business. An example is if your systems were completely inoperable.
- Important or medium priority would dictate any system that, if disrupted, would cause a moderate, but still viable, problem to you and your network systems. An example is if a problem came up (like a disk drive error), which, if neglected, could potentially cause a business interruption for you.
- Minor or low priority is any outage you have that’s easily restored, brought back online, or corrected with little damage or disruption. This is still a disruption, but it doesn’t impact your systems or your business. An example is if a system has a problem with its monitor.
Resiliency of Services
When working with Highly Available solutions, you need to add resiliency to your plan. Cisco, as well as other network vendors, defines network resiliency as “the ability to recover from any network failure or issue, whether it is related to a disaster, link, hardware, design, or network services.” Resiliency should provide you, the implementer of such technologies, with a comfort level that if you have a failure, you could survive it with highly available solutions.
You need to plan for resiliency by checking the following areas of your network:
- Make sure your WAN links are redundant. You can implement secondary frame connections or point-to-point links, or dial backup lines with ISDN.
- Make sure your routing protocols are dynamic if you want them to learn other paths in case of disaster. Static paths won’t necessarily do this for you.
- Make sure you have multiple networks or Telco carriers. If one carrier has an issue, you can fall back on the other one. MCI WorldCom is a perfect example of this.
- Make sure you have hardware resiliency in every form—hard disks, routers, firewall, cabling, you name it.
- Make sure you have power redundancy in the form of UPS or backup generators.
- Make sure you have network services resiliency, such as DHCP, and so forth in case of failure.
This isn’t a definitive list because it all depends on what you have at your location, but make sure you make your own list, based on what your network has and uses.
Delivering a Disaster Recovery Plan
Now you know why the plan is important and you know how to develop one for any application, system or device on your network. The power, the security, whatever – apply the same questions and methodology to any system or service and you can build a good plan with ease. Now you have a plan on paper! So, what’s next? Be sure the plan is full of details and is well documented. Make certain your staff studies it. Schedule a class for everyone to learn about the plan and include a verbal test on the DRP as part of the class. In the next article we will get into other aspects of DRP and BCP to include system DRP and so on… standby!
Planning for High Availability: Disaster Recovery Planning
A Windows Networking Article – Part I in the ‘Planning for High Availability’ Series
Written by Robert Shimonski: “For a complete guide to high availability, check out Windows Server 2003 Clustering and Load Balancing’ from Amazon.com“