Kubernetes has changed the way organizations run their workloads. In the last couple of years, Kubernetes has emerged as one of the leading application development platforms. Several organizations around the world are using the prowess of Kubernetes in their workloads. However, like everything else, Kubernetes has its flaws. In Kubernetes’ case, this flaw is its complexity. Yes, K8s makes it easy for organizations to configure clusters and individual containers, but with a growing number of configurations comes a higher risk of defects and failures. And thanks to the complexity of Kubernetes, locating these failures can feel like finding the proverbial needle in a haystack. Fixing them, however, is a different ballgame.
1. Avoiding unnecessary CPU limits
Buffer has been using Kubernetes since 2016, but in 2020, they faced a strange issue that leads to higher latency. Usually, when trying to pinpoint the source of such issues, the common assumption is terrible configurations. However, to their surprise, Buffer figured out that the issue was because of the CPU limits. It’s recommended that organizations make use of the CPU limit (set to 100ms by default) to ensure their CPU resources are not exhausted by certain containers. Without CPU limits in place, some containers can end up exhausting the CPU resources, which leads to the Kubelet process running inside every node unresponsive and changing the state of nodes to “NotReady” and also scheduling pods somewhere else in the cluster, causing the entire cluster to fail.
The principle behind using CPU limits sounds helpful in theory. However, when Buffer actually tried setting CPU limits to its containers, latency became a problem. After trying for a while to investigate the issue, Buffer realized that the CPU was being throttled for containers with CPU usage not even remotely close to the CPU limit. After some research, Buffer found out that this was due to a bug in the Linux Kernel. Buffer was able to fix this issue by reducing the peak CPU usage limit to the lowest possible and making sure the Horizontal Pod Autoscaler was being used in case the resource usage exceeded the set limit. The bug in the Linux kernel has since been resolved.
This failure of K8s was due to a bug inside the Linux kernel that could’ve gone unnoticed for a long time. What is helpful in these situations is good visibility into your workloads and also the need to stay up to date on Kubernetes by actively participating in K8s forums.
2. Increased latency may not always be Kubernetes’ fault
When European classifieds websites operator Adevinta tried to move one of its microservices to Kubernetes from Amazon EC2, they noticed that the latency had become 10 times what it was in EC2. This is concerning, especially for an organization just beginning its Kubernetes journey. To find what was causing this spike in latency, Adevinta tried to collect the necessary metrics from the request path and realized that it was the upstream latency that was the problem. In EC2, the upstream latency was around 10-20ms and for Kubernetes is was around 200ms. On further investigation, it was found that the default expiration time for the credentials provided by KIAM was around 15 minutes, and AWS Java SDK tends to refresh credentials when their expiration time is less than 15 minutes. This issue was then fixed by requesting KIAM credentials with a longer expiration time, which led to much lower latency, even better than EC2.
Migration to a new system can be challenging in a lot of ways. Sometimes, what we do in one system may not translate well in another. In this case, there was no way of anticipating that an issue would occur due to incompatibility in the default configurations. Therefore, developers need to understand the configurations properly and know what each default setting is doing at all times. In this case, the issue was not a Kubernetes failure per se, but it just goes to show that you can’t lift-and-shift an application as-is to K8s.
3. Probe the configurations of your workloads
Jetstack wanted to upgrade the master of its cluster when it ran into an issue that caused the whole cluster to fail. The upgrade was meant to have no API downtime. However, once the upgrade pipeline ran, the upgrade process didn’t finish before the Terraform timeout (set to 20 minutes). After the upgraded pipeline was restarted, the upgrade ran into an error because the status of the master node was unhealthy. This was due to an admission webhook that became unresponsive after the first instance of the master was successfully upgraded. This led to a crash loop as the Kubelet was unable to report the health of the nodes and this issue triggered a chain reaction as GKE auto repair kept creating new nodes to address the error. The issue was then resolved by finding and deleting the unresponsive webhook and configuring a new admission webhook for OPA (open policy agent) to monitor only specific namespaces that satisfy the policy. A liveness probe was also set up to constantly monitor the admission webhook and to restart it when it became unresponsive.
In this case, the Kubernetes failure could have been avoided if Jetstack had monitored the API response times after they deployed OPA. The slower response time for “create” and “updates” commands would have alerted Jetstack developers before they ran into the error while upgrading the master. Using a Helm chart after deploying OPA would have also helped. With a Helm chart, Jetstack could have gotten the required configurations, including a liveness probe, which would’ve helped avoid the failure of the entire cluster. The key to avoiding configuration-related issues is to make sure you probe all the configurations and ensure there isn’t any change in latency due to some configuration that may lead to bigger issues in the future.
4. DNS outage can sink the whole ship
Zalando Fashion Stores’ webpages suddenly started running into errors as one of the downstream services in the aggregation layer began to time out. This led to a surge in requests as clients tried to troubleshoot at their end. This surge led to a spike in DNS queries in the CoreDNS infrastructure. To handle all these requests, more memory was required by the CoreDNS pods. The pods then ran out of memory and kept getting OOM killed, which led to a total DNS outage. The DNS outage also caused the aggression layer service to open circuit breakers to downstream services as it wasn’t able to resolve hostnames, and there was no DNS caching. The result of this outage was that internal monitoring systems were rendered useless since they needed to interact with external systems to provide alerting using relevant metrics.
It took longer than usual to contact an on-call Kubernetes developer to manually configure the memory limit from 100mi to 2,000mi. Once this was done, CoreDNS pods stopped getting OOM killed, and everything returned to normal.
The DNS infrastructure can crash in on itself if it is not configured properly. To make it more resilient, developers must ensure that one DNS timeout doesn’t lead to a domino effect. Another important lesson we learned from this use case is that the monitoring should be efficient and robust. Due to the total DNS outage, the internal monitoring and alerting system completely shut down. This delayed the recovery significantly as the problem wasn’t reported soon enough.
There are usually reasons — and solutions — for Kubernetes failures
Kubernetes is complex. Sometimes, Kubernetes failures can be due to some inherent flaw in the K8s ecosystem. And, sometimes it can be because of wrong configurations and a lack of proper monitoring. All of these things can contribute to failures that are hard to trace and may also lead to unprecedentedly long outages. The K8s community is huge, and developers can always use these forums to find solutions for problems they are facing by either going through existing issues and fixes or by contacting experts when needed.
Featured image: Shutterstock / TechGenix photo illustration