Rats in the datacenter?
No, maybe that wasn’t the actual cause of the recent disastrous outage that Delta Airlines experienced with its datacenter. The airline had originally revealed only that a power outage in their datacenter led to hundreds of Delta flights being cancelled, which resulted in thousands of unhappy customers. A few days later the airline told the Associated Press that a piece of equipment had malfunctioned badly and caused a “small fire” that somehow snowballed into a major outage at their Atlanta area datacenter, which shut down their entire flight operations and reservations systems.
A number of different news and tech media have speculated on how such a critical incident could have happened. But we don’t really know the details of what occurred, and my guess is that Delta will wait months if not years before explaining in detail what went wrong and — more importantly — why the event occurred.
That doesn’t lessen the value of considering what might have happened at Delta, and to help me think this through I asked my friend and colleague Florian Klaffenbach for his thoughts on what can go wrong in datacenter environments. Flo is a technical architect based in Berlin, Germany, and he specializes in Microsoft private and public cloud technologies. He is also a Microsoft MVP awardee in the area of Cloud and Datacenter Management, and he runs a popular site called Flo’s Datacenter Report where he blogs regularly about Windows Server, Hyper-V, and datacenter equipment.
Strange happenings in the datacenter
I started by asking Flo whether he’d ever seen anything in his work as a datacenter consultant that could have caused something similar to what Delta experienced. He replied that in his last 10 years visiting datacenters of many different commercial and governmental institutes, he had seen “many strange things, but always the most disturbing things I saw were around server systems, which are basically the lifeblood of the organization.” I asked him what sort of stuff he’d seen go wrong with servers in datacenters and he replied by giving me a long list of things that had shocked and even horrified him.
“I’ve seen systems running on old, no longer supported hardware or past end-of-lifecycle software,” he says. “Often, no one working at the datacenter knew why that hardware or software was still there or even how it works. I’ve seen critical systems that have no backup or failover in place when it comes to disaster protection.”
There’s also the case of management by mismanagement. “I’ve seen systems managed by a service provider or administration team that has no clue about how to handle the applications that run on them or even give a shit about SLAs [service level agreements],” he says. “And I’ve often seen management underestimate the importance of a service and keep revenue above everything while ignoring the possible losses that could be incurred by an outage of that application.”
And as every IT professional knows, money is often a big part of the problem. “I’ve also seen organizations try to re-engineer programs with open source stuff to save a few pennies, and the outcome is often unstable, as if it had been engineered by a child instead of an adult,” Flo says.
When mismanagement and lack of support is a problem, IT pros often must cut corners. “I’ve also seen IT staff that have no time for testing their systems and applications, so they end up having to do all their deployments and updates on hot systems,” Flo says. “In fact, the motto most of the IT departments still seem to live by is to never touch a running system. Worst of all is that the IT staff usually know about all these issues and that they’re running everything on a high-risk basis. But it seems the only thing they can do is just pray that the service doesn’t fail on their watch.”
Which reminds me of this classic Dilbert comic strip about engineers hiding their heads in the sand hoping that nothing major will go wrong on their watch. One of my engineering buddies tells me this is all too true of his profession.
Root causes, possible solutions
Why do IT departments let their datacenters get into these kinds of messes? Flo thinks there are several reasons. “Lack of knowledge and support by management,” he says. “Management decisions based on having no clue about the business side of IT. Admins with such extreme workloads in their daily work that they have no time for maintenance and renewal projects. And of course cost-cutting, cost-cutting, and cost-cutting.”
How can an organization dig itself out of this mess? “Most importantly, just be strong, collect your facts, and challenge management,” Flo says. “IT departments should not start off by complaining and begging management to fix the situation. Instead they should go the other way around and think about IT more from a business perspective than a purely technical one.
He also says that organizations should spend time rethinking their IT and services and work up a budget for the changes they want to implement. “Then present the issues and your solutions to management, being sure to include a business plan,” he says. “Give them some room to discuss your proposal but try to point them in the right direction, and make sure you allow management to reach a decision on giving your initiative their full commitment. And if necessary, just pay for some external expert to come in and help you out.”
While these steps that Flo suggests can help resolve problems from a management perspective, there’s also the technical side of things to consider. Flo’s tips and best practices to “Delta-proof your datacenter” include the following:
- Stick with standards in your environment.
- Choose partners who develop and work with state-of-the-art technologies.
- Leverage cloud environments such as Microsoft Azure when you don’t have your own resources or if you need resources only for a limited period of time, for example for testing purposes.
- Involve well-known experts such as Microsoft MVPs who have a solid personal ethic behind their products and solutions.”
“Remember that there are no holy cows in your infrastructure,” Flo says. “If a system or application needs to be replaced, it needs to be replaced — do it. And think twice — in fact think three years in the future before you do anything.
The bottom line for datacenter IT pros? Do what you gotta do and don’t let yourself get Delta’d.
Photo credits: FreeRangeStock, Pixabay