Early last winter, before the whole pandemic thing started, I was at a conference talking with a couple of admins who used to manage datacenters. One of them had once experienced a full-blown disaster that knocked the local power grid out for several days, and the scope of the event brought into light several deficiencies in the disaster preparedness and recovery plan of that datacenter. As we tossed our comments around, one of them said something that stuck firmly in my mind:
If you have an emergency system in place and it hasn’t been properly tested, you don’t have an emergency system in place.
The keyword there is “properly,” and this led to a discussion among us about what proper disaster preparedness is for the IT professional. At the conclusion of our discussion, we decided that we didn’t really have a good answer to our question.
Here’s why. What often happens in a disaster is the unexpected, and hence the unplanned for. In my colleague’s case, what happened is that when the power grid went down for the area where the datacenter was located, the UPS kicked in and the diesel generators came on, and servers started booting up again. Then the generators went off again after a few more minutes and wouldn’t restart. What had happened?
What’s going on?
The skeleton crew in place at the time quickly got to work investigating what might be wrong but soon got nowhere. Eventually, one of them suggested that they take a look outside to see if anything strange might be going on. Upon examining the generators, which were situated outside the building, it was discovered that the air filters were clogged with what looked like fluff. A later investigation traced this fluff to probably being from the numerous dandelions that had been growing riot in the unmowed grass in the vacant lot nearby the datacenter. Does this mean that neglecting regular landscaping work should be included as part of the testing procedure for IT disaster preparedness?
As we reflected upon this, one of the other admins in our group told a story he had heard from a colleague that illustrated a similar challenge in adequately preparing a datacenter for possible disasters. What he heard about was where the power grid failed at the datacenter, and the generators kicked in and everything seemed OK. Then after several hours, the generators coughed and died.
Disaster recovery plan and the real world
An investigation quickly found that the diesel holding tank on the roof of the building was empty, yet it had been verified as being full only two weeks earlier. There weren’t any leaks in it; however, the problem was that the tank was too small to hold more than a few hours of fuel needed for the generators. The customers that used the datacenter were rightly upset about the outage they experienced, and the company that owned the datacenter came down hard on the datacenter’s manager for installing such a small fuel tank on top of the roof. But the manager was eventually exculpated of any fault in the matter because it turned out that the local civic government had an ordinance that prevented large tanks of fuel being located on tops of buildings. And when they were building the datacenter they applied for a special permit to waive the ordinance, but this was denied — probably rightly so because it wouldn’t be nice if the tank sprung a leak somehow and oceans of diesel fuel saturated the exterior of the building and its surrounding grounds. Does this mean that you should run your disaster recovery plan by an army of lawyers before you have it approved by management?
This prompted another of our group to share another story along similar lines. He heard about a datacenter that had built a small diesel holding tank on the roof and a larger primary holding tank in the basement. There was, of course, a transfer pump that was designed to pump fuel from the basement to the roof tank to keep it full at all times. It was intended, of course, that when the pump was installed that its electrical power should come from the generator instead of the power grid. That way, if the grid should suddenly go down, the fuel in the rooftop tank would allow the generator to start running, which would then keep the pump running to keep the rooftop tank topped up. Unfortunately, the electrical contractors who installed the pump had made a mistake and left it connected to the main AC power grid instead, so the fuel in the large basement holding tank never got transferred to the rooftop tank, which resulted in the generator quickly running out of fuel. Does this mean you should always hire a second contractor to always review every single electrical connection that your primary contractor has made when setting up systems in your datacenter?
A geek’s solution
As we drew near the end of our discussion, we decided that the proper way of testing the disaster readiness and recovery plan for a datacenter would be to blow up the town’s electrical grid and watch what happened over several days that followed. Would the generators start running properly? Would they keep on running until the grid was repaired? This, of course, is a geek’s solution to the problem and typically unrealistic since geeks like us IT pros tend to think of everything in the universe revolving around the technologies we work with.
OK, what if we just killed the power supply from the local substation to our datacenter? Say for 24 hours to see if everything runs OK on generators? Sorry, nix to that — our city has pollution and noise regulations that prevent businesses from running large generators for longer than 30 minutes unless there’s an actual emergency happening or lives are at stake.
OK, let’s pull the main breakers and let the generator run for 30 minutes and see whether it works as intended. We could do that every six months to verify we’re ready for a disaster. No, wait, didn’t I read somewhere that diesel generators need to be run for at least 30 minutes every 30 days at a minimum? OK, so now we’ve got some connection between disaster preparedness and proper regular maintenance. And we have a budget to prepare our disaster recovery plan, but our maintenance budget doesn’t have a line item for this, so back to the drawing board.
Wait a minute, maybe pulling the breaker once isn’t enough to really test electrical integrity. My civil engineering friend who specializes in HVAC systems for office towers says you should always pull and re-strike a breaker at least three times to properly test whether it works. OK, better write that into our DR plan somewhere. Does this mean we should hire an engineer or two to go over our plan to ensure we’ve covered everything? Do we have enough budget for that?
We broke up at that point and went to our hotel rooms. I couldn’t fall asleep for a while that night as I kept having a nagging feeling that the disaster recovery plan for our own business might be failing somewhere.
How well do you sleep at night?
Featured image: Shutterstock