Almost perfect: Can anyone really deliver ‘five-nines’ availability?

The majority of enterprise-level data systems in the industry seek to achieve the “five-nines,” which is all about offering 99.999 percent of perceived uptime. The achievement of this objective decreases unscheduled downtime to just a couple of minutes every year — but most apps are unable to offer anywhere close to the level of redundancy and protection needed to achieve this elusive goal. However, some websites can be theoretically configured to attain a redundancy level close to 99.999 percent uptime, as long as the company is ready to establish and maintain a certain set of parameters involving high availability and disaster recovery systems.

Mystery surrounding the five-nines

Availability of the five-nines arises from the telecom industry, where systems and hardware are expected to provide “carrier grade” availability at all times. For realizing the five-nines, a specific system may be down for a maximum of 5.26 minutes every year (or 5.27 in case of a leap year). However, the lack of a ruling body, committee, or standard formalizing the five-nines and what it entails indicates that this commonly used term refers to various things, depending on perspective. A lot of providers often exclude scheduled downtime to make five-nines availability easier.

What are the components involved in the five-nines?

A full-fledged computing system or application consists of several pieces, including software, applications, networks, and hardware. Availability gets affected if even one of them fails. The datacenter serves as a foundation of sorts for it all. It is important to keep in mind that anything running in a single datacenter cannot make the five-nines. The workload must be divided between two datacenters present in separate countries or cities, a setup that increases costs exponentially.

The network is a key component that requires something extra to make the five-nines possible. What’s more, this is just the components supplied by the vendor. More components get added by the network infrastructure, which increases the chances of failure. The same arguments are applicable for all components required by any specific application.

All this points to the need for an alternate power source, like dual UPS, sufficient fuel supplies, main power from individual substations, backup generators that cover equipment and automatic cut-in features. Onsite support needs to be round-the-clock with appropriate tools, skills, and others. Although this setup might fall short of the five-nines, it does go close enough. Another thing to note is how just 10 percent of organizations achieve this.

Even if a company engineers the facility to deliver 99.999 percent availability, it still begs the question how availability gets measured within the service. Operating systems, performance management software, ICT equipment, diagnostics, and others used in the service delivery will all need to be covered. It is also important to calculate the availability of each component depending upon the service.

Now, there is a common misconception that replicating components may cut downtime by half, but in reality, introducing more components only increases complexity and increases the possibility of failure. If any components are replicated, they must be kept in sync and switchable between two parallel configurations at any given point of failure. When one configuration remains active and the other stays passive, the component failure in the active one is detected by the switches and the load gets shifted to the secondary matched components in the next configuration, which then assumes the identity and role of the primary component. This setup leads to greater resilience and components, which nearly doubles the cost.

The five-nines may also be compromised by the loss of operating systems, data, databases, middleware, and application software. Even including such elements requires the two systems to be geographically separated due to the chances of both being impacted by the disaster. Moreover, when it comes to security, an outage may very well come from a security breach as software failure or equipment. Another point to consider is the fix time once the equipment fails.

Explaining unavailability

Companies dealing with service level or availability agreements will know the issue with defining unavailability. For example, if a particular function takes more than the stipulated time, the system is considered down, although it might be working properly but slowly. Systems that work perfectly may be considered unavailable by the end user. Think of a computing system that works fine apart from one unavailable function. Application and training complexity are murky areas since the end user is unable to do something due to the absence of training or complex user interfaces, which effectively makes the function unavailable.

Human faults

A large chunk of network errors and downtime stem from human errors. Ongoing training, monitoring, change control and postmortems can reduce the issues but cannot eliminate or predict them altogether. What’s more, it is impossible to predict hacker attacks, rogue weather, viruses, and terrorist activity. Targeting a maximum of five minutes downtime each year limits it to a small number of events that automatically get fixed by the monitoring systems. When humans are needed to solve the issue, they cannot detect, identify, and rectify the issue within five minutes. Thus, whenever human involvement is a must, the five-nines plan is gone for good.

Numerous tools support many-to-one failover in case of applications, thereby keeping the overall costs to a minimum. Many of these replication solutions even provide the capacity to form cluster setups without the use of clustering tech supplied by the OS themselves. Thus, if corporate policy prevents clustering for any cause, both remote and local availability will remain intact thanks to these software solutions.

Five-nines not an impossibility — but not likely, either

At the end of the day, it is impossible to ensure the five-nines for a whole computing service. The margin for error is too small and unexpected or Black Swan events cannot be eliminated. Service providers who advertise these five-nines often rely on the fine print to make it a simpler target.

Sure, five-nines is not an impossibility if the brainpower and budget needed to put together this effective solution are on hand. But the question is, is it worth it? After all, you need to work inside the structure of your enterprise procedures and policies to form a solution that provides the necessary protection without breaking the bank or the rules.

Almost perfect: Can anyone really deliver ‘five-nines’ availability?

Mystery surrounding the five-nines

What are the components involved in the five-nines?

Explaining unavailability

Human faults

Five-nines not an impossibility — but not likely, either

About The Author

Rahul Sharma

Leave a Comment Cancel Reply

Mystery surrounding the five-nines

What are the components involved in the five-nines?

Explaining unavailability

Human faults

Five-nines not an impossibility — but not likely, either

About The Author

Rahul Sharma

Read Next

Leave a Comment Cancel Reply