Some of the web’s largest websites excel at high availability. One visit to updown.io shows excellent uptime stats for websites like Google, Facebook, and GitHub. They’ve reached this point by optimizing and achieving every part of their technology stack for high availability. They look to eliminate every single point of failure and drive increased performance, scalability, maintenance, and security. For most other organizations, it can be punishing to lose revenue due to downtime and have to firefight every time there’s a spike in requests. But for these organizations, where do they start? What do they focus on to improve uptime and bring downtime to almost zero? Let’s discuss.
Leverage the cloud
Cloud vendors guarantee five-nine availability — 99.999 percent uptime — with endless capabilities when choosing the right cloud for the right workload. This is hard to achieve when you manage your own infrastructure on-premises. So how do you do it? By leveraging cloud computing. Cloud vendors can easily back up their servers and ensure five-nines availability — and they can do this automatically, without you having to lift a finger. This is not only convenient, it is secure.
You can even use the different availability zones of a cloud vendor to geographically distribute your application backend — this will not only improve uptime but also help with latency issues. If the server is closer to the user, it provides a definite improvement in data processing speeds.
The cloud doesn’t totally erase downtime. There are occasional times when cloud vendors themselves become the cause of downtime. Snapchat is a great example of a startup that owes its success to its reliance on Google Cloud, particularly App Engine. Still, Snap has taken the cautionary step to invest in AWS as a backup cloud over the next few years. This approach of having a backup even in the cloud will help organizations fight downtime in extreme cases. And since the cloud is much cheaper than buying on-premises servers, you can enjoy this kind of large scale back up without breaking the bank.
Cloud data storage
The next step is to decide how and where to store your data. The cloud brings improved data recovery, cheaper data storage, faster data transfer, and scale out elasticity. With high inflow and large accumulation of data, it’s important to identify the different data storage options available and which solutions are best suited for the kind of data your app requires.
Sometimes your data may be in the form of multiple formats and need to be stored in multiple locations. When architecting your application’s data some key questions to ask are: How granular will your data storage be? How will you partition your data? Will each service have its own database? When building microservices applications, it is a best practice to give each service its own database. As discussed earlier, it’s essential to have backups for databases and data-storage volumes. This way when a database fails, it can be quickly replaced by a backup and the resulting downtime can be avoided.
One recent example of a startup that drew in big funding because of its unique take on data is Snowflake. It separates the storage, processing, and consumption layers of data, enabling you to pick the best implementation for each layer.
Containers help with high availability
Rather than run on servers or virtual machines, applications that are run on containers stand a better chance of weathering a sudden spike of requests. Kubernetes is the go-to platform today to manage containers at scale. It helps with achieving high availability by avoiding a single point of failure. Kubernetes does this by organizing containers in pods, and further by grouping pods into clusters. These clusters are a meta-layer that helps to manage how containers operate. These clusters can be run on a single host server or multiple servers, or even better on cloud servers from different cloud vendors. This kind of distributed management of containers makes them resilient to any single point of failure. This results in better uptime for applications run using containers.
Kubernetes is architected to provide for proper backups for clusters and pods so that if one cluster or pod fails, it can be automatically replaced. Kubernetes, as earlier described, is a container-centric management environment. It orchestrates computing, networking, and storage infrastructure to support dynamic workloads and enables portability across environments. Further, with the concept of container immutability, any container that’s vulnerable or faulty can be replaced whole without having to fix the error and keep the same container running.
In this way, containers and their management by Kubernetes help to enable high availability.
Handle the network as a service mesh
As communication becomes more complex and during a time of peak activity, the network can be the bottleneck. This is especially true of web-scale microservices applications. With service mesh technologies and tools like Istio and Linkerd that enable them, you can handle network load at a massive scale.
A service mesh brings more visibility into application networking. It does this by separating the control plane from the data plane. The data plane is where networking requests are processed between the various networking endpoints and the control plane helps an admin manage the flow of requests. Using a service mesh tool makes it easier to optimize network communication and improve the availability of an application.
Reduce data latency between frontend and backend
Data latency can plague enterprise apps that deal with large quantities of complex data in the backend. To reduce data latency, firms should start engaging in consistent patterns to integrate the frontend and backend. Enterprises could use a development platform like Progress Kinvey that unifies the frontend and backend of its applications. By using integration templates to connect the frontend with the backend these development platforms bring consistency to the flow of data. They help quicken data transfer and reduce latency caused due to slow loading of data. Enterprises with large quantities of data in their backend systems can greatly benefit from these platforms that organize and access all backend data and make them available to frontend applications
Employ chaos engineering
Netflix popularized the concept of “chaos engineering” with its chaos monkey and simian army tools, but now many other organizations are following suit. It involves regularly killing your own services or infrastructure to test the resilience of the software system. It may sometimes sound scary but it’s a concept of intentionally harming your own systems to find bugs, inefficiencies, or vulnerabilities. Minimizing the blast radius by implementing measured attacks in your own systems reduces the extent of damage when an actual incident occurs. In simple words, it attempts to avoid failure by failing constantly. It allows DevOps teams to prepare and practice for outages and minimize the effects of downtime, and the chances of downtime occurring. There are solutions like Gremlin that are built on this concept that make it easier to get started and scale a chaos engineering practice
High availability: There for the taking
At every layer of the application stack — servers, data, and networking — the idea of eliminating single points of failure and distributing risk is what builds high availability into enterprise applications. Despite how complex software delivery has become in today’s cloud-native age, high availability is there for the taking. It’s up to DevOps teams to seize the opportunity and put in place all the practices that ensure high availability for the applications they deploy.