If you missed the other articles in this series please read:
- Designing and Implementing Effective Disaster Recovery Strategies with Citrix Technology Part 2: The Strategy
Over the last five years, we have witnessed some truly catastrophic events. The September 11th hijacking and Hurricane Katrina both provided terrible devastation to both people and places. Beyond that, they also had a terrible impact on organizations and their ability to survive in times of crisis. Whole data centers were wiped out. People were ordered to evacuate, and could not remain behind to manage what was left. Massive amounts of damage were done to the infrastructures of both New York and New Orleans. Suddenly IT managers were faced with critical staffing decisions and the realization that they had no way to get their organizations up and running again. Disaster Recovery has become a major force in the lexicon of I.T. lingo.
This article isn’t going to walk you through DR for your entire company. Those types of exercises take years to put in place and a lot more space than I have to write. Instead we will discuss general DR best practices and then focus on a small chunk of your infrastructure, namely your Citrix environment. As we go through the environment though, look at all the things you peripherally interact with just to keep your environment working. Good DR strategies don’t try to answer every what-if, they try to provide a roadmap for what to do next.
Define Your Criticality Levels
The first step is to define your levels of criticality. Worst case scenario dictates that your data center just burned down/blew up/flooded. Whatever the case, it is inaccessible. What kind of organization are you? Are you a financial institution, where every minute down costs money? Do you have a production line dependent on your applications, or shippers that need constant access to inventory numbers? All of these are critical components. Your organization needs to understand the criticality level of each component. Is email critical to your success, or can it wait 24 hours? Which is more important to your ongoing operations, the external web servers or your payroll system? These are decisions that only you can make.
Many organizations will come up with a class for the Business Criticality. For instance, at my current company we define 5 levels of BC with 5 being the lowest priority and 1 being the highest. Each level of BC has a different response level attached, and any application or infrastructure component brought into the environment is assigned a BC as part of the process. For instance, a BC of 5 means that we as an organization can live for up to a month without this piece. A BC of 1 means we cannot live more than 4 hours without it. Once you have a structure for your BC levels, it becomes easier to think about the individual components and the requirements for each of them.
Evaluate Your Environment
Now that you have a starting point, begin your environment evaluation and assign the appropriate BC levels. Some will be obvious. Power is probably a BC1. You can’t do a whole lot without it. Network very well might be a BC1 as well. After all, if your servers can’t communicate that might be a showstopper. How about your email environment? Is it as critical a component? Could you live 4 hours, 8 hours, 24 hours without email? Each component is going to have to be evaluated on its own merits and as a piece of the infrastructure puzzle. For instance, if an application server has a BC1 but utilizes an Oracle database, then you can’t have the database as a BC3. It can be a very complicated web, especially with large and complex environments that have multiple 2 or 3-tier applications to restore.
One trick is to divide your infrastructure into larger segments. Each segment is all of the applications/servers/whatever that need to interact with each other. You assign the BC to the entire chunk, and then define the recovery plan so that they are brought up in the order of criticality within that segment. In this case, all BC1s are not created equal! Your Accounting segment for instance could have 6 different systems, each with an overall BC1. But within the segment and when you are designing your recovery procedures, you know that the Oracle database has to be the first thing up. Then your Peoplesoft server, or SAP, or however you want to define that priority. This gives you a good method to make sure that each portion gets restored in a properly defined timeframe, but also gives you the ability to understand the relationships between each piece and the order that they need to be prioritized in even within their own BC.
Obviously this is worst case scenario where you have literally lost everything. In fact, most DR situations will be one or two applications that need to be brought online. Still, if you have plans put together for the worst case then everything else tends to fall in line behind it. So now you have defined your BC levels, and have evaluated your current environment. If disaster strikes, where exactly will you restore?
Picking a DR Site
Choosing your DR site can be an expensive proposition. For smaller organizations, there might not be a second site at all. Many will choose to try and restore at the main facility, or figure that if it is gone there is not much they can do anyway. Some organizations that are large enough to use multiple data centers will use them as failure sites in case of DR. A third choice is to pay one of the larger storage organizations such as Iron Mountain or Sungard to give you space in their facility to recover your environment in case of a DR. Often these companies are in secure sites in other major metropolitan areas, and will charge you based on your requirements for space and storage. This can be an attractive option if you don’t have the resources for your own facility, but often there are time and space constraints at these types of places. You may face challenges in getting the environment configured to your specifications, and if you choose to bring in your own hardware for leased floor space it means you will get little support from the facility managers.
For those sites large enough to maintain additional data centers already, planning for DR means considering what the added load will mean to your data center. Capacity at each facility will have to have enough overhead for the anticipated spike in a DR situation. Obviously your infrastructure will have to support switching the network over to the new data center, clients will need to be rerouted, etc; Thankfully with a good DNS infrastructure this task is a lot easier than it used to be. The last consideration is your restore mechanism.
How Do You Restore?
Obviously if a real disaster strikes, you may not have access to your production servers. That means your ability to restore will be based entirely on your backups, software library, and installation documentation. Regular, tested backups of key components makes your restore at least feasible. For larger organizations these backups are usually kept in an off-site storage facility in case of a disaster. Whatever backup mechanism you use, you must have a method of restoring those backups at your DR facility. I have participated in several DR tests where the backup tapes used varied between two sites and neither was able to restore the other’s tapes in the backup testing. Seems ridiculous, but it happens all the time.
If your company follows the ITIL guidelines you might already have a software library implemented for all installed software. This can be a valuable tool in a DR situation, although a common roadblock is finding someone who remembers how the application was installed in the first place. Keeping a hard copy of installation media, instructions, etc offsite is an overlooked but extremely important DR component.
This is simply a brief look at how to tackle DR in your organization. In the next part of this article, I will present a specific scenario for a Citrix environment and discuss how all of these DR steps apply. DR is a complicated, expensive, and often overlooked component of a stable infrastructure. In this day and age you really can’t be too careful with your DR strategy.
If you missed the other articles in this series please read: