With the success of Docker containers and the widespread adoption of microservice architecture, the demand for orchestration and monitoring to get up to speed has led to the creation of the Cloud Native Computing Foundation (CNCF), and its first two member projects addressing the orchestration and monitoring/alerting needs respectively. Everyone knows about Kubernetes and how the open source project based on Google’s “Borg” has quickly become the favorite orchestration tool of the enterprise. The success has without a doubt come from the time, effort, and money that Google already invested into getting containers to work, way before anyone even heard of Docker.
Like with any new technology, it’s the enterprises with deepest pockets that get to leverage it to their advantage the quickest, and that’s evident from the success stories of Google, Facebook, Netflix, and Twitter, just to name a few. Now, Twitter and Apple had the resources to bend Apache Mesos into a great cluster management tool, and Google developed its own orchestration tool called Borg, but what about everyone else? At the time, Docker shipped with a standard Docker client CLI that let you manage containers on a local system.
The domino effect
With containers being as popular as they are, it soon became obvious that regular orchestration tools just weren’t going to cut it, and Kubernetes was the answer. When a core piece of technology like the very unit of computing changes, it creates a domino effect that carries all the way to the top of the stack. Orchestration was just the beginning, and the enterprise quickly realized that traditional monitoring and alerting was just not good enough to effectively monitor a system based on microservice architecture at a large scale. The challenge of monitoring microservice architecture is that there is a lot more traffic to monitor, since one host can run many containers there are way more targets and they are constantly changing. Traditional monitoring also doesn’t allow you to get into the layers to find out what’s going on for troubleshooting.
Borgmon for a Borg
Strangely enough, the answer again came from a tool that has its origins at Google, though the ties run a bit deeper. The original “Borg” apparently had a companion called “Borgmon.” Ex-Google engineers Matt Proud and Julius Volz were “missing” their old infrastructure and built Prometheus based on what they knew about Borgmon. Frustration with existing tools were the reason why Google developed Borgmon in the first place. Proud and Volz did it because modern tools did not keep time-series data in a multi-dimensional format nor did they come with an easy to use query language akin to SQL.
Prometheus was built to be a full monitoring and trending system that includes built-in and active scraping, storing, querying, graphing, and alerting based on time-series data. It also carries with it the knowledge of what the ideal environment should look like with regards to which endpoints should exist and which time-series patterns mean trouble. Another unique feature is the ability to actively search for faults.
Unless Facebook decides to open source Claspin, which is their much raved about custom monitoring tool (highly unlikely), Prometheus is pretty much the best new age monitoring tool the enterprise has to deal with modern microservice architecture. In terms of being able to handle the sheer numbers that containers produce, Prometheus not only collects metrics at scale via HTTP, but also actively pulls metrics from running applications. Additionally, a single node can be used to query thousands of targets with millions of time series at a rate of 800k samples per second with no dependencies and easy scalability.
Push vs. pull
A tool like Prometheus that pulls metrics over HTTP comes with quite a few advantages, a few of them being that you can run your monitoring on your laptop while you develop changes. It makes it easier to tell when a target is down with the added ability to manually inspect it’s health from a web browser. A lot of people have the misconception that “pull doesn’t scale,” which is probably largely due to the comparison with the other pull-based tool, Nagios, which indeed comes with some scaling problems.
However, Prometheus takes a fundamentally different approach from Nagios or any other pull tools out there right now. Instead of executing check scripts, it collects time-series data only from a set of instrumented targets over a network. For each target, it just pulls the current metrics over HTTP. An added advantage is that Prometheus has no other execution overhead that works on pull.
Fire from the gods
While the Kubernetes logo of the seven spokes is said to be an acknowledgement to its origins with Google (where the project was nicknamed seven), Prometheus is supposed to be the fire of the gods, and its developers were probably referring to bringing tools from giant Google (God) down to mere mortals. The fact that Prometheus actually pulls metrics from your services rather than wait around for them fits well with a dynamic cloud native environment like Kubernetes. As you scale up a service, Prometheus automatically starts pulling metrics from the replicas as well. Similarly, as nodes fail and pods are restarted on different ones, Prometheus automatically scrapes them too.
There’s probably nothing better than the gods themselves sitting up taking notice, and Google itself has shown interest in and is using Prometheus internally in some capacity according to Volz. CoreOS has integrated Prometheus with its etcd distributed configuration management system and Docker has integrated it with its container tools. DigitalOcean, Boxever, KPMG, Outbrain, Ericsson, ShowMax, and the Financial Times are a few more examples of enterprises that are using Prometheus.
Version 2.0 expected soon
The monitoring system and alerting toolkit Prometheus version 1.6.1 is now available. The latest version improves how memory is managed among other things like an experimental remote-read support feature, UI enhancements, Joyent Triton discovery, new storage, alerting, and evaluation-related metrics. The CNCF is also giving us a sneak peek at Prometheus 2.0, which is supposed to feature release tarballs and Docker containers.
Not just for high scalers
When the original founders first managed to get SoundCloud to back them on this venture, they expected a mild response from maybe a few high scalers who would be interested in it for the new architecture and the way it handles time-series data in a multidimensional format. The response they got, however, was anything but mild, from enterprises both big and small that instantly fell in love with the simplicity, scalability, and compatibility via “exporters.” Exporters are in essence plugins that translate data from other tools into metrics that Prometheus understands and can work with.
It’s unlikely that Prometheus will replace existing tools like Nagios or New Relic that still solve existing problems. The great thing about Prometheus, however, is that it runs coherently with almost every other tool and is a great addition to your stack. Since enterprises typically run more than 20 monitoring tools at a time, having a new-generation tool like Prometheus that’s designed with microservices architecture in mind is indeed a godsend. Or, in other words, “a gift from the gods.”
Photo credit: 20th Century Fox