Are we overdependent on the AWS cloud?

Most online cloud providers boast up-times of up to 99.99 percent (Amazon S3 supposedly has 99.99999999999 percent durability), which is what you would expect from big names like Amazon, Google, and Microsoft. But when that 0.01 percent of the time does happen and S3 actually goes down, like it did on February 28, it affects a third of the Internet.

Customers suddenly realize that they’ve got all their eggs in one basket and frantically start looking at other options. On Amazon’s part, it’s only fair to allow them that 0.01 percent instance. However, casually mentioning it being caused by a typo not only makes people sit up and wonder, but also rubs salt into a recently open wound. When you’re the world’s largest provider of cloud services, when you go down you take down everyone with you. Unfortunately, a lot of those “everyones” are startups that can’t afford to go down or have their reputation tarnished so early in the game.

A bump in the road

February 28 was an eventful day for AWS, and that’s because starting from about midday till about 5 p.m. Eastern Time when services were completely resumed, thousands of websites and apps experienced unusual performance and errors. Though Amazon initially had no clue what went wrong, they later released a statement saying it was a command entered incorrectly. A lot of people are asking why there weren’t backups or safety measures in place to prepare for such an eventuality, but what they forget is that even the cloud giant is not above making an honest mistake.

When you back a race horse because it’s the best and the most dominant, you have to take the good with the bad. Even thoroughbreds have their off days, and AWS should be allowed one. The problem probably arises from people’s expectations that the laws of physics will suddenly cease to apply to the No. 1 cloud provider. If people had it in their heads that 99.99 percent means that one day it will go down, they probably would have a backup server at another location that could be loaded in case of such an emergency.

All credit to the cloud

This isn’t really bad publicity for the cloud, either. In fact, it’s more a testament to how quickly the technology has spread through the enterprise and how dominant AWS has been in this sector. The bright side here is that all the apps and websites that went down with AWS have seen their share of downtime, which only lasted about five hours, and they can now look forward to the other 99.99 percent — hopefully.

Could AWS have avoided such a blunder? Anyone can avoid a blunder once they’re looking back at it in hindsight. Expecting someone not to make an honest mistake is expecting someone not to be human. AWS’s honesty with regards to how a simple typo caused the shutdown is in fact to their credit, though the truth can often be a hard pill to swallow. Speculation over whether a disaster could be avoided is the recipe for madness and a road that ends nowhere. AWS will definitely put a number of measures in place to make sure that it doesn’t happen again, but that 0.01 percent instance could come from anywhere else as well.

Cloud economics

One thing that’s important to remember is the price of cloud servers when compared to actually owning and maintaining your own physical servers. To put it bluntly, a lot of startups wouldn’t even exist in the first place if it weren’t for cloud computing. With platforms like AWS, Azure, and Google Cloud going out of their way to court startups, the cloud is almost like fairy-tale land for startups right now. Added to the fact that most cloud providers provide a free tier that is good enough for almost 250,000 daily visits, a five-hour downtime (that can be prepared for) is a small price to pay.

The reason the cloud is an ideal option for startups is because going into any new business means dealing with a lot of “unknown” factors. Setting up in the cloud gives you the elasticity to grow as required as well as the ability to handle demand spikes. The combination of these advantages really make the cloud invaluable to startup developers who probably already have their hands full with their applications.

Disaster recovery in the cloud

A lot of companies including AWS offer some pretty good disaster recovery options. Considering 25 percent of companies don’t open after a disaster and about 43 percent don’t survive a catastrophic data loss, planning ahead is probably a good idea. AWS’s offering is called Pilot Light, which is a comparison between a heater that has the ability to ignite at any time and a sleeping DR (disaster recovery) server that can kick in any time and effectively save your business. In this multisite architecture, what AWS proposes is to build and update a DR site as a mirror of your production site, which can then be stored in 10 isolated AWS regions with multiple availability zones. Unlike traditional backup and restore for servers and storage, Pilot Light includes applications and the entire production environment via Amazon Machine Images (AMI).

Now, as far as disaster recovery goes, the last thing you want to do is to put all your eggs in one basket again. Though AWS has pretty good backup and recovery options, if you’re already on AWS it would make sense to save your backup with another provider. For pure disaster management, Azure’s offering seems like a pretty good bet. Azure Site Recovery (ASR) allows you to replicate and update your VMs in Azure so that they can later be run at any time as a disaster recovery option. Microsoft really goes out of their way to make their offerings irresistible, and at $54 a month per instance, you can’t ask for more. Additionally, any compute or storage costs are calculated only when you actually run the VM, and since you’re only going to run it once every four years (hopefully), this is a really sweet option. Yet another cool feature from Microsoft here is the ability to test disaster recovery protocol nondisruptively.

Backup or bust

With reference to what to look for in a backup server, since you’re hoping to not have to use it at all, a pay-as-you-go or only pay as per-usage is the first thing you want to look for. As far as maintenance goes, the less you have to do the better, so something like ASR or Pilot Light, where your backup server is automatically kept up to date with your production server, is great. As far as pricing goes, your backup server is really not the place you want to cut costs, so go with the best solution you think your money can buy since, if you’re really lucky, you may never have to use it at all.

When disaster actually strikes, things like RTO — Recovery Time Objective (acceptable amount of time in which an IT service must be restored) and RPO — Recovery Point Objective (acceptable amount of data loss measured in time) are critical and must be confirmed before you choose your backup or DR-management service. The S3 outage — or, what has become known as the S3izure, was all in all just a few hours of downtime from the company that owns close to 40 percent of the cloud. While centralization does have its drawbacks, the benefits far outweigh those drawbacks. This event was probably even good for business, as now more enterprises are going to sign up for a DR server or two.

Photo credit: Pixabay