In the rapid growing era of the digital world, businesses are getting more dependent on mobile apps to run businesses and generating revenues. In such scenarios, downtime is the most common problem, which is prevailing even with the most significant players having ample resources and skills available at their disposal. That’s why a guaranteed uptime in terms of “five-nines” is talked about a lot, which means an app is inaccessible less than one second per day. The evolving cloud computing has set the new benchmarks for highly available apps, which is pretty much near to 100 percent. A service-level agreement (SLA), which provides information about the health of the system, is offered by most of the cloud vendors out there. The giant cloud vendors like Amazon, Google, and Microsoft set their availability SLA at 99.9 percent.
This is considered to be a very reliable uptime, and above it — 99.999 percent or “five-nines” — is considered to be an excellent uptime. But this still means that there is approximately 52 minutes of downtime in a year. Considering a large number of businesses running on apps, a considerable loss can happen in these 52 minutes. Any business app being unavailable can have significant monetary costs, resulting in lost sales and customer confidence.
Below are three key ways that can be worked around to minimize the downtime and delivering highly available apps to your customers.
1. Implement a best-of-breed monitoring stack
Implementing the best monitoring tool on each layer of application stack ensures accurate health observation, such as systems monitoring, application monitoring, web and user monitoring, logging, and error tracking. This best-of-breed approach is rapidly being adopted by the IT industry and is replacing the monolithic monitoring architectures in the complex and dynamic IT systems.
Advanced infrastructure monitoring systems allow the organizations to proactively watch for prone-to-attack issues, alert the team, and investigate the causes of previous downtimes. The monitoring process involves aggregating and recording the statistics related to the downtimes, such as system resource utilization and application performance metrics. Alert rules are evaluated continuously against current metrics, and the responses from this metric collection are analyzed to determine when it is appropriate to take action.
Each host in the infrastructure gathers metrics, where monitoring is often implemented, and it sends the reports to the central server. The central server is responsible for sorting and aggregating the records in a time series database that mainly specializes in storing and searching timestamped numerical data and for creating graphs, searching events, and alerting. A few examples of such monitoring system are Prometheus and Graphite.
Below are a few useful metrics to collect when attempting to increase reliability and reduce downtime:
1) Latency: How long does it take for a server to respond to a request coming from the client. This mainly pertains to the server’s response time for an HTTP/HTTPS request.
2) Traffic: How many incoming requests the system is experiencing in a specific period of time. This could be the request rate for a web server, network I/O, logins per second, or transactions per second for a database.
3) Errors: The frequency or failure rate of incoming requests or outgoing responses. This is mainly measured by requests received vs. responses sent or the request to response ratio. This should be noted that not every error is as clear as an HTTP 500 error. For instance, for few systems can have a policy that clients should receive responses in 1,000 milliseconds (ms) or less, and in such cases, if it takes more than 1,000ms, the system may consider it as an error. This means any responses with latency higher than 1,000ms would show up as an error in this case, even though that was a valid request.
4) Saturation: How “occupied” a service is. The occupancy of service could be measured as the space available on a hard drive, network throughput, or the amount of CPU resource available on a CPU-bound service.
2. Eliminate single points of failure
The term “high availability” often encapsulates some principles of designing a robust infrastructure, which is made up of redundant and resilient systems. One of the core foundations of high availability is eliminating single points of failure. A single point of failure is any component of the system which, when fails, leads to the failure of the entire system, resulting in downtime. Usually, high availability is achieved by integrating load balancers between the clients and the servers. To ensure fast and efficient performance of the apps, it is required to scale and balance workloads across multiple servers, technically known as load balancing. In simple terms, load balancing means when one node is unavailable due to any unforeseen reason, another node running in parallel can continue to handle the incoming request and send the response accordingly within a set amount of time frame.
The task of a load balancer is to monitor and detect the failures and direct the traffic to another server to ensure a free flow of the incoming requests and outgoing responses in a timely fashion.
Such resilient systems must:
- Eliminate single points of failure: This usually means creating such an infrastructure that the data is scattered across multiple data centers in various regions. This can be carried out either by implementing multiple redundant servers across the network or by implementing redundant containerized services within the same network.
- Direct the traffic seamlessly: On identifying one server failure, the direction of traffic to another server must be seamless without any service interruptions.
- Monitor health of the redundant systems: This means that the server must be able to determine when a service is failing so that the accurate decision of rerouting the traffic can be taken without delay.
3. Use distributed architecture for applications
In microservices architecture, each service is self-contained and implements a single business capability. In this, an application is structured in such a way that it creates a collection of several small autonomous services modeled around a business domain. That means, if one service has a failure or faces a downtime, the entire application is not impacted.
There are a few concepts that need to be taken care of while developing large-scale distributed architecture:
- Service-level agreement: Before developing an extensive system, organizations should take into account what the meaning of a “healthy” system is. The most common ways to measure the health of the system is with service-level agreement. A few SLAs that should be considered are availability, accuracy, latency, and capacity.
- Horizontal vs. vertical scaling: It is likely that the app running on the existing system may increase the load on the system and exceed from system’s capacity. In such cases, more capacity needs to be added. The two most common scaling strategies are vertical or horizontal scaling. Horizontal scaling is about adding more machines (or nodes) to the system, to increase capacity. Horizontal scaling is the most popular way to scale distributed systems. Vertical scaling is basically “buying a bigger/stronger machine” — either a (virtual) machine with more cores, more processing, or more memory. With distributed systems architecture, vertically scaling is usually less popular as it can be more costly than scaling horizontally.
- Data durability: Durability refers to the ability of the data to remain in existence even if any of the nodes go offline, fail, or get corrupted. To increase data durability, the data is stored on multiple nodes, so if one goes offline or faces downtime, the data is still available on another node.
High costs of not having highly available apps
The biggest threat to any organization or business is the system downtime that, according to the research by IDC and AppDynamics, costs around $100,000 per hour. Although it is impossible to prevent the system failure completely, the organization can reduce time to predict and fix the system failure issues proactively by adopting these preventive measures.
Featured image: Freepik / Fullvector