Back in March 2010, Brian Snow, the former technical director of the National Security Agency, said he didn’t trust the public cloud. We’re 7 ½ years down the line, and pretty much every enterprise on the globe uses a public cloud-based service, or will start using one very soon. Now, nobody can deny the strong grounds on which enterprises justify their millions of dollars’ worth of investments in public cloud services. Public cloud services have transformed the way the world of business works, right from communication to data management. Is there any place, then, for a question on the trustworthiness of the public cloud? We explore the answers in this guide.
What prompts this question?
Well, for starters, let’s think for a moment about the events of February 28, when Amazon’s Simple Storage Service (S3) suffered downtime, and took hundreds of business applications down with it. Here’s a summary of what happened, and what Amazon did to salvage the situation.
An administrator mistyped a command during his maintenance activities around some S3 servers. The error took more than the desired number of servers offline. As a result the, the US East Zone S3 environment edged too close to its full capacity, resulting in accessibility issues in thousands of web services dependent on S3.
Amazon has since modified its maintenance tools to ensure that it doesn’t take too many servers offline at one time. The S3 system was then refactored into smaller cells to minimize cross-spreading of an impact. Also, Amazon audited other systems to make sure they did not suffer from the same flaws.
AWS was launched in 2006. Considering it took Amazon almost 11 years to figure out that a maintenance aid tool could take down several servers at once, it’s scary to think of how many other such unidentified failure points might exist in the public cloud. The question, then, is not entirely misplaced.
Is it time to rethink the public cloud?
Now, instead of treating the incident on a symptomatic level, it’s more paramount to deep dive and identify the real reasons that caused such an event, and its impact. When the force of public cloud services really started making an impact, it was understood that computing would be something on the lines of how utilities like electricity work.
There could be complexity in terms of the number of suppliers, but the model of delivery would make consumption very simple. All the innovation, disruptions, and transformations change the quality, cost, and user experience, but essentially the utility’s delivery remains as simple as ever. The public cloud was imagined as an equivalent system, from where enterprises could consume as much storage and computing resources as needed.
However, that’s stopped being the case in the recent past, because of the intense battles among major public cloud solution vendors, Amazon included. The pressures in the market are driving them into making continuous changes in the core of their cloud computing systems, with the intentions of finding scope for improvement, to retain and acquire more customers.
Public cloud vendors need to realize that all their tweaking, experimenting, and improvement efforts need to happen without even the slightest possibility of disruptions in the as-is. Public cloud services are something that the global business engine counts on. When we add a force of change into the mix, we can never assume 100 percent availability.
Gaps in enterprise understanding of public cloud services
The ways a vendor markets its public cloud services depends on the anticipated questions that the consumers will ask. In this sense, enterprise consumers should have been able to drive more transparency in the way vendors promoted, demonstrated, and explained their solutions. However, there is still a cloud (pun intended) around enterprise understanding in terms of public cloud solutions. To make things clearer, questions around these lines need to be voiced more often:
How willing is the vendor to share the operational status of the cloud environment with customers? Now, there are tools for this, for instance CloudWatch for Amazon. However, these tools don’t really reflect the core operational health of the cloud environment.
What extent of human intervention exists in routine maintenance and upkeep activities of the public cloud system? What measures are in place to make sure that human error doesn’t bring systems down?
What kind of metrics are being used to measure critical parameters such as performance, throughput, capacity, utilization, and contention? Which parameters are the most critical for the public cloud vendor, driving maximum effort of improvement? More importantly, what’s deemed as normal and healthy behavior for these metrics, and what is the alarm state?
How many single points of failure (SPOF) exist in the environment? How does the vendor intend to mitigate them in the coming years? Indexing subsystem proved to be a SPOF in Amazon S3’s case; it’s anybody’s guess how many consumers knew of the fact. And if even Amazon didn’t know it, it’s an even scarier proposition! Almost as scary as watching the first five (3 + 2 = 5) “Spider-man” movies — how fake can a group of movies be?
What’s the answer, then?
Enterprises just can’t afford to be skeptical about public cloud services, considering the cost, scalability, and convenience benefits they deliver. How do they make sure that they don’t suffer the fate that hundreds of enterprises faced when S3 went down? Here are some recommendations:
- Bring resiliency in your cloud-hosted apps by deploying the across multiple availability zones within the same public cloud, or even across different public clouds.
- Maintain a disaster-recovery site in a private cloud, a hybrid cloud, or colocation facility.
- Before deploying applications on the cloud, make sure they are resilient, immune to in-application SPOFs, and can be scaled out.
Acknowledge the risks
The public cloud is here to stay. Unfortunately, there have been enough instances to indicate that the ecosystem might not be as stable as enterprises would want it to be. However, by acknowledging the potential risks, and assuming the responsibility of doing everything needed to ensure business continuity, enterprises can avoid the bitter tastes of events such as the AWS S3 outage.
Photo credit: Shutterstock