Windows Server 2003 Disaster Recovery Planning (Part 1)


For a complete guide to security, check out ‘Security+ Study Guide and DVD Training System‘ from

Planning for High Availability  

Windows Server Disaster Recovery Planning can be a chore, but if you have the details and a plan, it can go smooth to setup, and will be a life saver when your systems start to smoke, and your VP’s are knocking on your office door asking what the heck is going on! In this section we will look at how to plan for High Availability.

Taking the time to plan and design is the key to your success, and it’s not only the design, but also the study efforts you put in. I always joke with my administrators and tell them they’re doctors of technology. I say, “When you become a doctor, you’re expected to be a professional and maintain that professionalism by educational growth through constant learning and updating of your skills.” Many IT staff technicians think their job is 9 to 5, with no studying done after hours. I have one word for them: Wrong! You need to treat your profession as if you’re a highly trained surgeon except, instead of working on human life, you’re working on technology. And that’s how planning for High Availability solutions needs to be addressed. You can’t simply wing it and you can’t guess at it. You must be precise, otherwise, your investment goes down the drain – and all the work you put in will be not only useless, but also wasteful.

Plan Your Downtime

You need to achieve as close to 100 percent uptime as possible. You know a 100 percent uptime isn’t realistic, though, and it can never be guaranteed. Breakdowns occur because of disk crashes, power or UPS failure, application problems resulting in system crashes, or any other hardware or software malfunction. So, the next best thing is 99.999 percent, which is still somewhat reasonable with today’s technology. You can also define in a Service Level Agreement (SLA) what 99.999 percent means to both parties. If you promised 99.999 percent uptime to someone for a single year, that translates to a downtime ratio of about five to ten minutes. I would strive for a larger number, one that’s more realistic to scheduled outages and possible disaster-recovery testing performed by your staff. Go for 99.9 percent uptime, which allots for about nine to ten hours of downtime per year. This is more practical and feasible to obtain. Whether providing or receiving such a service, both sides should test planned outages to see if delivery schedules can be met. You can figure this formula by taking the amount of hours in a day (24) and multiplying it by the number of days in the year (365). This equals 8,760 hours in a year. Use the following equation: percent of uptime per year = (8,760 – number of total hours down per year) / 8,760 If you schedule eight hours of downtime per month for maintenance and outages (96 hours total), then you can say the percentage of uptime per year is 8,760 minus 96 divided by 8,760. You can see you’d wind up with about 98.9 percent uptime for your systems. This should be an easy way for you to provide an accurate accounting of your downtime. Remember, you must account for downtime accurately when you plan for high availability. Downtime can be planned or, worse, unexpected. Sources of unexpected downtime include the following:

  • Disk crash or failure
  • Power or UPS failure
  • Application problems resulting in system crashes
  • Any other hardware or software malfunction

Building the Highly Available Solutions’ Plan

Let’s look at the plan to use a Highly Available design in your organization and review the many questions you need to ask before implementing it ‘live’. Remember, if the server is down, people can’t work, and millions of dollars can be lost within hours. The following is a list of what could happen in sequence:

  1. A company uses a server to access an application that accepts orders and does transactions.
  2. The application, when it runs, serves not only the sales staff, but also three other companies who do business-to-business (B2B) transactions. The estimate is, within one hour’s time, the peak money made exceeded 2.5 million dollars.
  3. The server crashes and you don’t have a Highly Availability solution in place. This means no failover, redundancy, or load balancing exists at all. It simply fails.
  4. It takes you (the systems engineer) 5 minutes to be paged, but about 15 minutes to get onsite. You then take 40 minutes to troubleshoot and resolve the problem.
  5. The company’s server is brought back online and connections are reestablished.

Everything appears functional again. The problem was simple this time-a simple application glitch that caused a service to stop and, once restarted, everything was okay. Now, the problem with this whole scenario is this: although it was a true disaster, it was also a simple one. The systems engineer happened to be nearby and was able to diagnose the problem quite quickly. Even better, the problem was a simple fix. This easy problem still took the companies’ shared application down for at least one hour and, if this had been a peak-time period, over 2 million dollars could have been lost. They want to become aware, so the possibility of 2 million in sales evaporating never occurs again. Worse still, the companies you connect to and your own clientele start to lose faith in your ability to serve them. This could also cost you revenue and the possibility of acquiring new clients moving forward. People talk and the uneducated could take this small glitch as a major problem with your company’s people, instead of the technology. Let’s look at this scenario again, except with a Highly Available solution in place:

  1. A company uses a Server to access an application that accepts orders and does transactions
  2. The application, when it runs, serves not only the sales staff, but also three other companies who do business-to-business (B2B) transactions. The estimate is, within one hour’s time, the peak money made exceeded 2.5 million dollars.
  3. The server crashes, but you do have a Highly Available solution in place. (Note, at this point, it doesn’t matter what the solution is. What matters is that you added redundancy into the service.)
  4. Server and application are redundant, so when a glitch takes place, the redundancy spares the application from failing.
  5. Customers are unaffected. Business resumes as normal. Nothing is lost and no downtime is accumulated.
  6. The ‘one hour’ you saved your business in downtime just paid for the entire Highly Available solution you implemented.

Human Resources and Highly Available Solutions

Human Resources (people) need to be trained and work on site to deal with a disaster. They also need to know how to work under fire. As a former United States Marine, I know about the “fog of war,” where you find yourself tired, disoriented, and probably unfocused on the job. These characteristics don’t help your response time with management. In any organization, especially with a system as complex as one that’s highly available, you need the right people to run it.

Managing Your Services

In this section, you see all the factors to consider while designing a Highly Available solution. The following is a list of the main services to remember:

<> Service Management is the management of the true components of Highly

Available solutions: the people, the process in place, and the technology needed to create the solution. Keeping this balance to have a truly viable solution is important. Service Management includes the design and deployment phases.

  • Change Management is crucial to the ongoing success of the solution during the production phase. This type of management is used to monitor and log changes on the system.
  • Problem Management addresses the process for Help Desks and Server monitoring.
  • Security Management as discussed in Chapter 7, is tasked to prevent unauthorized penetrations of the system.
  • Performance Management is discussed in greater detail in this chapter. This type of management addresses the overall performance of the service, availability, and reliability. Other main services also exist, but the most important ones are highlighted here. Service management is crucial to the development of your Highly Available solution. You must cater to your customer’s demands for uptime. If you promise it, you better deliver it.

Highly Available System Assessment Ideas

The following is a list of items for you to use during the postproduction-planning phase. Make sure you covered all your bases with this list:

  • Now that you have your solution configured, document it! A lack of documentation will surely spell disaster for you. Documentation isn’t difficult to do, it’s simply tedious, but all that work will pay off in the end if you need it.
  • Train your staff. Make sure your staff has access to a test lab, books to read, and advanced training classes. Go to free seminars to learn more about High Availability. If you can ignore the sales pitch, they’re quite informative.
  • Test your staff with incident response drills and disaster scenarios. Written procedures are important, but live drills are even better to see how your staff responds. Remember, if you have a failure on a system, it could failover to another system, but you must quickly resolve the problem on the first system that failed. You could have the same issue on the other nodes in your cluster and if, that’s the case, you’re on borrowed time. Set up a scenario and test it.
  • Assess your current business climate, so you know what’s expected of your systems at all times. Plan for future capacity especially as you add new applications, and as hardware and traffic increase.
  • Revisit your overall business goals and objectives. Make sure what you intend to do with your high-availability solution is being provided. If you want faster access to the systems, is it, in fact, faster? When you have a problem, is the failover seamless? Are customers affected? You don’t want to implement a high-availability solution and have performance that gets worse. This won’t look good for you!

Do a data-flow analysis on the connections the high availability uses. You’d be surprised that damaged NICs, the wrong drivers, excessive protocols, bottlenecks, mismatched port speeds, and duplex, to name a few problems, have on the system. I’ve made significant differences in networks by simply running an analysis on the data flow on the wire and, through this analysis, have made great speed differences. A good example could be if you had old ISA-based NIC cards that only ran at 10 Mbps. If you plugged your system into a port that uses 100 Mbps, then you will only run at 10, because that’s as fast as the NIC will go. What would happen if the switch port was set to 100 Mbps and not to autonegotiate? This would create a problem because the NIC wouldn’t communicate on the network because of a mismatch in speeds. Issues like this are common on networks and could quite possibly be the reason for poor or no data flow on your network.

  • Monitor the services you consider essential to operation and make sure they’re always up and operational. Never assume a system will run flawlessly unless a change is implemented . . . at times, systems choke up on themselves, either by a hung thread or process. You can use network-monitoring tools like GFI, Tivoli, NetIQ, or Argent’s software solutions to monitor such services.
  • Assess your total cost of ownership (TCO) and see if it was all worth it.

Cost Analysis

Do a final cost analysis to check if you made the right decision. The best way to determine TCO is to go online and use a TOC calculator program that shows you TCO based on your own unique business model. Because, for the most part, all business models will be different, the best way to determine TCO is to run the calculator and figure TCO based on your own personal answers to the calculator’s questions. Here’s an example of a specific one, but many more are available to use online – just run a search in a search engine (like on ROI/TCO calculators, and you will see them.

Testing a High Availability System

Now that you have the planning and design fundamentals down, let’s discuss the process of testing your high-availability systems. You need to assure the test is run for a long enough time, so you can get a solid sampling of how the system operates normally without stress (or activity) and how it runs with activity. Then, run a test long enough to obtain a solid baseline, so you know how your systems operate normally on a daily basis. Use that for a comparison during times of activity.

In Sum

This should give you a good running start on advanced planning for high availability, and it gives you many things to check and think about, especially when you’re done with your implementation.

About The Author

Leave a Comment

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Scroll to Top