In a complex cloud-native world, monitoring has become more important than ever. Cloud-native applications are increasingly distributed in nature and are being powered by a vast range of open-source tooling. They’re hosted in multiple cloud locations and on-prem datacenters. Monitoring has changed from a few vendor-provided tools to toolchains that include best-of-breed tools. No single monitoring tool can do it all. All these variables and factors have made monitoring cloud-native applications completely different, but yet more powerful than ever before. Let’s look at the ways we can go beyond the traditional monitoring metrics and logs and find ways to gain deeper observability and visibility in the cloud.
Cloud visibility basics: Metrics and logs
Since the dawn of computing, metrics and logs have been an integral part of managing applications. They’re not going away anytime soon. If you rely on an open-source tool like Nagios, or a vendor tool like Splunk, you know the value of metrics and logs. They’re essential to assessing the health of a system from day to day and are invaluable when performing troubleshooting.
Metrics are numbers used to measure the performance of a system. Metrics exist at every level of the applications stack — infrastructure, networking, and application performance. The most common metrics measure the uptime of servers, the time taken to complete network requests, and latency in applications. They’re best viewed in neat visual charts and dashboards and are a great starting point for any monitoring purpose.
A log is a record of an event that has occurred in a system. Unlike a metric, which is a number, a log contains multiple dimensions, such as a timestamp or a text error message. While metrics tell you what’s happening in a system, logs enable you to dive deeper and figure out the where and when.
Modern log analysis has changed drastically to keep up with the increasing complexity in applications. The volume of logs being generated has skyrocketed. More logs don’t automatically mean more insight. In fact, log paralysis is an unfortunate side effect of application modernization. This is why modern logging tools focus on surfacing the signal from the noise. Since most log data is mundane and repetitive, the best tools separate themselves from the pack by being able to identify the needle in the haystack.
The need to go deeper with monitoring for cloud visibility
While metrics start with “what” and logs go deeper into “when” and “where,” the holy grail to cloud visibility is to uncover the “why” — the root cause of an issue. There are spots, nooks, and crannies that plain vanilla metrics and logs can’t reach. Let’s look at some of the ways you can go deeper than metrics and logs when monitoring cloud-native applications.
Hybrid cloud and multicloud
Rather than single datacenters, organizations today run applications across a datacenter and multiple cloud platforms. This hybrid cloud (data center + cloud) and multicloud (more than one cloud platform) setups bring complexity and make it more difficult to monitor applications, networking, and infrastructure.
Rise of the service mesh
Developing cloud-native applications involves embracing Kubernetes and its ecosystem of tools. This includes managing networking as a service mesh. Rather than simple point-to-point communication, a service mesh calls for many-to-many communication. This separates concerns within the network and gives services more autonomy. However, the communication patterns are complex and harder to monitor.
Solutions for improved cloud monitoring and visibility
The complexity of infrastructure, networking, and the application stack have led to new ways and approaches to monitoring.
Time-series metrics with Prometheus
Traditional monitoring tools provide metrics at per minute or per second intervals. These metrics don’t provide the required detail and flexibility required when monitoring cloud-native applications. Prometheus is an open-source timer-series monitoring tool that tracks metrics in millisecond intervals. The metrics are recorded as a continuous stream of data.
Prometheus pulls metrics from service endpoints rather than relying on an agent to push metrics. Once data is collected, Prometheus enables a user to add labels so that any metric can be viewed in more than one dimension. These advantages have made Prometheus the most widely used monitoring tool for Kubernetes.
Switch to service mesh
Unlike with simple client-server applications, network communication for distributed microservices applications is complex. Rather than one-to-one communication, network requests are now many-to-many. This brings a lot of complexity to networking operations and monitoring. The solution is to use a modern service mesh solution like Istio, which acts as an abstraction layer above the TCP/IP layer of requests and gives you more fine-grained control over the requests.
A service mesh solution includes built-in logic for things like retries, timeouts, load balancing, and service discovery. Not only does this improve request management, but it also brings deeper visibility into each request. Service mesh tools also include open plugin architecture to send network metrics and traces to external tools for robust monitoring. This is a sign of a mature ecosystem, where responsibilities are divided — one tool to manage the requests, and another one to monitor the requests.
Aggregation and routing of monitoring data
With the plethora of cloud locations and components to be monitored, monitoring data flows like an information highway, and this traffic needs to be directed appropriately. This has led to the rise of aggregation and routing tools like Fluentd and Logstash. These tools can collect log data from multiple sources and route them to multiple destinations as required. The destinations could be an exclusive logging tool or monitoring tool or a proprietary internal tool within an enterprise.
Such middleware tools add layers of complexity to the overall architecture, but they are required when monitoring an equally layered application stack. Of course, the idea is not to go overboard with tools, but to use them with reason. It is a tradeoff between complexity and control. Despite the increasing complexity, organizations need control and visibility over cloud-native applications running in production.
Distributed tracing of network requests
Metrics and logs have been the staple for monitoring operations thus far. However, with cloud-native computing, the network has gained prominence in the form of a service mesh. This requires visibility deeper than metrics and logs can provide. Enter distributed tracing. A trace is the amount of time it takes for a network request to be completed. A single trace is made up of multiple time “spans.” As you track each span, you can track the exact journey of each request as it travels across the network. You can see which services it touches, where latency occurs, and where timeouts are triggered. This is like using a magnifying glass to look deeper into any request.
Jaeger is the leading open-source distributed tracing tool in the Kubernetes ecosystem. It builds on the model originally developed with Zipkin. It follows an open standard called OpenTelemetry, which is a CNCF sandbox project and is set to enjoy industry-wide adoption.
Cloud visibility: Yes, you need more than metrics and logs
If you venture into the world of cloud-native computing, you need to explore monitoring solutions beyond metrics and logs. Distributed applications bring new challenges to monitoring, but fortunately, there are multiple options available for monitoring these applications and still enjoying the control and visibility that is required. While each solution, such as Prometheus, Istio, Fluentd, and Jaeger, are the best at what they do, they’re most powerful when used together. They enable cloud visibility that is real-time, high resolution, actionable, and production-ready.
Featured image: Shutterstock