Cloud visibility: Go beyond metrics and logs to see what’s going on

In a complex cloud-native world, monitoring has become more important than ever. Cloud-native applications are increasingly distributed in nature and are being powered by a vast range of open-source tooling. They’re hosted in multiple cloud locations and on-prem datacenters. Monitoring has changed from a few vendor-provided tools to toolchains that include best-of-breed tools. No single monitoring tool can do it all. All these variables and factors have made monitoring cloud-native applications completely different, but yet more powerful than ever before. Let’s look at the ways we can go beyond the traditional monitoring metrics and logs and find ways to gain deeper observability and visibility in the cloud.

Cloud visibility basics: Metrics and logs

Since the dawn of computing, metrics and logs have been an integral part of managing applications. They’re not going away anytime soon. If you rely on an open-source tool like Nagios, or a vendor tool like Splunk, you know the value of metrics and logs. They’re essential to assessing the health of a system from day to day and are invaluable when performing troubleshooting.

Metrics

Metrics are numbers used to measure the performance of a system. Metrics exist at every level of the applications stack — infrastructure, networking, and application performance. The most common metrics measure the uptime of servers, the time taken to complete network requests, and latency in applications. They’re best viewed in neat visual charts and dashboards and are a great starting point for any monitoring purpose.

Logs

A log is a record of an event that has occurred in a system. Unlike a metric, which is a number, a log contains multiple dimensions, such as a timestamp or a text error message. While metrics tell you what’s happening in a system, logs enable you to dive deeper and figure out the where and when.

Modern log analysis has changed drastically to keep up with the increasing complexity in applications. The volume of logs being generated has skyrocketed. More logs don’t automatically mean more insight. In fact, log paralysis is an unfortunate side effect of application modernization. This is why modern logging tools focus on surfacing the signal from the noise. Since most log data is mundane and repetitive, the best tools separate themselves from the pack by being able to identify the needle in the haystack.

The need to go deeper with monitoring for cloud visibility

While metrics start with “what” and logs go deeper into “when” and “where,” the holy grail to cloud visibility is to uncover the “why” — the root cause of an issue. There are spots, nooks, and crannies that plain vanilla metrics and logs can’t reach. Let’s look at some of the ways you can go deeper than metrics and logs when monitoring cloud-native applications.

Hybrid cloud and multicloud

Rather than single datacenters, organizations today run applications across a datacenter and multiple cloud platforms. This hybrid cloud (data center + cloud) and multicloud (more than one cloud platform) setups bring complexity and make it more difficult to monitor applications, networking, and infrastructure.

Rise of the service mesh

Developing cloud-native applications involves embracing Kubernetes and its ecosystem of tools. This includes managing networking as a service mesh. Rather than simple point-to-point communication, a service mesh calls for many-to-many communication. This separates concerns within the network and gives services more autonomy. However, the communication patterns are complex and harder to monitor.

Solutions for improved cloud monitoring and visibility

The complexity of infrastructure, networking, and the application stack have led to new ways and approaches to monitoring.

Time-series metrics with Prometheus

Traditional monitoring tools provide metrics at per minute or per second intervals. These metrics don’t provide the required detail and flexibility required when monitoring cloud-native applications. Prometheus is an open-source timer-series monitoring tool that tracks metrics in millisecond intervals. The metrics are recorded as a continuous stream of data.

Prometheus pulls metrics from service endpoints rather than relying on an agent to push metrics. Once data is collected, Prometheus enables a user to add labels so that any metric can be viewed in more than one dimension. These advantages have made Prometheus the most widely used monitoring tool for Kubernetes.

Switch to service mesh

Unlike with simple client-server applications, network communication for distributed microservices applications is complex. Rather than one-to-one communication, network requests are now many-to-many. This brings a lot of complexity to networking operations and monitoring. The solution is to use a modern service mesh solution like Istio, which acts as an abstraction layer above the TCP/IP layer of requests and gives you more fine-grained control over the requests.

A service mesh solution includes built-in logic for things like retries, timeouts, load balancing, and service discovery. Not only does this improve request management, but it also brings deeper visibility into each request. Service mesh tools also include open plugin architecture to send network metrics and traces to external tools for robust monitoring. This is a sign of a mature ecosystem, where responsibilities are divided — one tool to manage the requests, and another one to monitor the requests.

Aggregation and routing of monitoring data

With the plethora of cloud locations and components to be monitored, monitoring data flows like an information highway, and this traffic needs to be directed appropriately. This has led to the rise of aggregation and routing tools like Fluentd and Logstash. These tools can collect log data from multiple sources and route them to multiple destinations as required. The destinations could be an exclusive logging tool or monitoring tool or a proprietary internal tool within an enterprise.

Such middleware tools add layers of complexity to the overall architecture, but they are required when monitoring an equally layered application stack. Of course, the idea is not to go overboard with tools, but to use them with reason. It is a tradeoff between complexity and control. Despite the increasing complexity, organizations need control and visibility over cloud-native applications running in production.

Distributed tracing of network requests

Metrics and logs have been the staple for monitoring operations thus far. However, with cloud-native computing, the network has gained prominence in the form of a service mesh. This requires visibility deeper than metrics and logs can provide. Enter distributed tracing. A trace is the amount of time it takes for a network request to be completed. A single trace is made up of multiple time “spans.” As you track each span, you can track the exact journey of each request as it travels across the network. You can see which services it touches, where latency occurs, and where timeouts are triggered. This is like using a magnifying glass to look deeper into any request.

Jaeger is the leading open-source distributed tracing tool in the Kubernetes ecosystem. It builds on the model originally developed with Zipkin. It follows an open standard called OpenTelemetry, which is a CNCF sandbox project and is set to enjoy industry-wide adoption.

Cloud visibility: Yes, you need more than metrics and logs

If you venture into the world of cloud-native computing, you need to explore monitoring solutions beyond metrics and logs. Distributed applications bring new challenges to monitoring, but fortunately, there are multiple options available for monitoring these applications and still enjoying the control and visibility that is required. While each solution, such as Prometheus, Istio, Fluentd, and Jaeger, are the best at what they do, they’re most powerful when used together. They enable cloud visibility that is real-time, high resolution, actionable, and production-ready.

Featured image: Shutterstock

Twain Taylor

My interests lie in DevOps, IoT, and cloud applications. I began my career in tech B2B marketing at Google India, after which I headed marketing for multiple startups. Today, I consult with companies in The Valley on their content marketing initiatives, and write for tech journals.

Share
Published by
Twain Taylor

Recent Posts

Contactless payments are hot, but are they secure?

The trend to contactless payments has accelerated as retailers and consumers adjust to COVID-19 realities.…

8 hours ago

Season’s fleecings: CISA warns on holiday shopping scams

The U.S. Department of Homeland Security is warning that online holiday shopping scams may be…

11 hours ago

Azure DNS: Using Azure DevOps to protect public DNS zones

This in-depth tutorial shows you how to use features available in Azure DevOps to boost…

14 hours ago

Report: Baidu Android apps had potential to expose data

Two apps from Chinese tech giant Baidu that had been available in the Google Play…

1 day ago

Shining a light on the dark shadow cast by shadow IT

Employees who don’t have the tools to get their jobs done sometimes turn to the…

2 days ago

Microsoft 365 troubleshooting: Diagnostic tools at your fingertips

Many Exchange Server troubleshooting tools don’t work with Microsoft 365. Fortunately, Microsoft has a bunch…

4 days ago