February 28 will be remembered as a rude reality check for the global IT enterprise. Reason — Amazon Web Service’s Simple Storage Service (S3) was down for nearly four hours. Forrester Research cloud analyst Dave Bartoletti has famously compared AWS S3 to “air” in the “cloud” context; that’s how big it is. The downtime affected thousands of websites, in quantifiable and unquantifiable terms both.
How did it happen?
The S3 team was trying to get to the root cause of a payment system problem. During this debugging exercise, a command intended to remove a small number of servers from a subsystem was executed. An error in the input, however, pulled a large number of servers offline.
When restarted, these servers took more time than anticipated, and were not able to serve several regions at estimated capacities. This is how a routine day at the office for an S3 support team soon transformed into the catastrophe of the year. And no, it was not because the Yankees won the World Series by having the most money or the Patriots cheated their way to another Super Bowl, this is something else entirely!
The biggest impact from this S3 outage was experienced by web retailers. That’s because their websites suffered from serious performance problems, with page load times increasing to 10 times the normal values. In an ecommerce context, it’s estimated that more than a third of shoppers will switch websites if a web page takes even 10 seconds to load. That is about as long as it takes LaVar Ball to make another ridiculous statement!
The worst affected retailers observed page load times of 50 seconds and higher, and in many cases, even 80 seconds. For an ecommerce company, reduction in page-loading speed is equivalent to loss of potential customers, and the S3 outage affected more than 20 of top 100 web-based retailers.
Lessons for enterprises to learn
Whereas the S3 outage meant serious dollar loss to top ecommerce websites, it also brought along some hard lessons. Here are the key takeaways for IT decision makers.
Single cloud vendor: Not a good idea
During the S3 outage, several web retailers were able to do just fine because they only stored partial data on the Amazon cloud, and hence were able to pass the storm with minimal or no damage. There were other retailers that managed to render images using data stored on other vendors’ cloud platforms, thus achieving business continuity despite the S3 downtime.
This highlights the relevance and importance of a multivendor cloud adoption model, where enterprises store their data assets over cloud solutions from more than one vendor. This mitigates business continuity risks in cases of outages such as the one suffered by S3. Why put all your eggs in one basket?
Innovative disaster-recovery methods
Having a robust contingency plan in case a third-party cloud provider fails is just sagacious. A decentralized cloud strategy with a focus on disaster recovery is the way forward for enterprises. Data backup in the cloud is not really as daunting as it sounds. It is a necessity. Sort of like avoiding eating in a Wal-Mart McDonald’s if you do not want to avoid being ripped off with a cold burger that should be hot!
Enterprises that have looked beyond traditional cloud disaster-recovery approaches are more future proof than others. Leveraging virtualization, companies can deploy disaster recovery sites in the cloud as a reliable and robust method of making critical data available in case of outages.
Proactive resilience testing
Why wait for an outage to realize that your enterprise’s systems are not fail-proof? Proactive resilience testing is a crazy yet effective method adopted by the leading consumers of the cloud, such as Netflix. By deliberately killing nodes, overloading regions, and disrupting services, these enterprises check the disaster readiness of their systems as a whole.
Netflix even uses Chaos Gorilla and Chaos Monkey (open-source tools) to automatically kill internal systems to give real-time tolerance tests to the systems. Is it any wonder, then, that Netflix did not report any major troubles during the S3 outage?
Real-time failure identification and automated responses
Knowing that your enterprise is dealing with a cloud system outage quickly can help reduce the impact of the downtime. This can be achieved in a two-phase manner. Phase one is to analyze normal behavior of the systems, and phase two is to compare the real-time values to ideal values, and raise alarms based on deviations and tolerances. A large number of system reactions can be configured as responses to these situations.
Vendors such as AWS offer several facilities to enable this. For instance, AWS uses a concept called Health Checks to offer a customized view of the server resources used by your websites, thus paving the way to easy identification of anomalies. Then, Amazon CloudWatch can be set up to track parameters such as server availability, creation of alarms, failure reactions, and log-file monitoring.
Building systems with redundancy
Redundancy helps enterprises overcome the ravages of a cloud-platform outage quickly and with minimal human intervention. To build redundancy into your enterprise cloud systems, the “standby” approach surfaces as a reliable method.
With standby redundancy, applications automatically detect disconnections or lack of resource allocation from the main server, and in such events, automatically shift over to backup and redundant systems. Ideally, backup systems should remain off unless the need to switch over arises. An alternative strategy is to have backup systems always running in background, so that the failover time is minimal. Of course, the latter option is more expensive.
The limitation of the “standby” option is the time between error detection and switch over to a redundant system. Active redundancy, wherein applications are distributed across several redundant systems, is a solution. In this approach, the workload is distributed over multiple resources, and in case of failure of one, the other resources start handling increased workloads, with close to zero switchover time.
To everyone who’s in love with the panic that spread because of S3 outage, and to everyone who can’t help but say that “cloud is dead,” we say — get a life! S3 has been powering the cloud infrastructure for years now, and this single instance of downtime is only something that will reduce the possibility of a repeat event less likely. You actually have a better shot at seeing a fantastic Wolverine movie, which apparently is a pretty rare phenomenon. But it beats Star Wars!
Of course, enterprises would want to learn the right lessons from the experience, to ensure that their business websites and apps remain accessible and functional in spite of such outages in the future.