Over the past few years, Kubernetes has become the one-key solution for all cloud-based software requirements. It has proved its worth for hosting applications of all sizes across private, public, and hybrid clouds. Recently, organizations have also started realizing Kubernetes’ ability to host Big Data applications. Kubernetes is replacing other mature Big Data platforms such as Hadoop because of its unique traits as a flexible and scalable microservice-based architecture.
A brief about Hadoop
Hadoop is an open-source framework meant for storing and analyzing large amounts of unstructured data. It supports distributed processing (via MapReduce) and storage (Hadoop Distributed File System or HDFS) of data. Hadoop can be used to collect a wide range of data types, which includes structured data (relational databases etc.), semi-structured data (like logs, emails, and more), as well as unstructured data (like clickstreams, social media data, etc.). Its first version, Hadoop 0.1.0, was released in April 2006, it has been rising in terms of popularity. It specifically capitalizes on two key aspects: providing an enormous processing capability using commercial hardware devices and having a fail-safe data architecture to prevent any data losses.
Hadoop is considered the leading Big Data platform when the requirement is to process huge chunks of data. But it is considered inefficient for processing smaller data sets. Hadoop scores well in processing terabytes of data stored in files. Still, its performance is a bit dodgy when the need is to have interactive and iterative analytics on the provided dataset, or when the businesses need real-time analytics. In addition, operating with the Hadoop framework requires users to learn a new set of skills and concepts. Various components of Hadoop, including MapReduce, Hive, Pig, and Spark, have their own independent methodology, which involves a steep learning curve for its users.
New trends picking up
Here are some trends that indicate that Kubernetes has overtaken Hadoop and has become a preferred platform for hosting Big Data applications by a large number of businesses.
1. Multi-tenancy replacing IT silos
Hadoop was built during an era when network latency was one of the major issues faced by organizations. Organizations preferred to host their entire data inside an in-house data center to avoid moving large amounts of data for analytical purposes. And Hadoop provided the optimal solution for such problems by splitting the data processing capabilities and shifting them toward the data storage locations via parallel processing.
However, with the evolution of the concept of microservices and container technology, organizations are quickly realizing that hosting the entire data on cloud storage provides several additional advantages. Several components of the Big Data stack (such as Spark and Kafka) can be hosted and operated on the cloud-based Kubernetes environment more efficiently. Applications hosted on containers can be easily started on-demand and shut down as per requirements.
Also, container-based clusters allow organizations to run several data science tools like Jupyter or PyTorch in the same cloud environment. With Yarn, various Hadoop and Spark workloads often worked as siloed subsystems and needed additional resources and skills. Kubernetes also provides the ability to manage legacy hardware for services such as application-aware scheduling. Kubernetes-based containers also help standardize the common semantics of high availability, monitoring, and upgrades across the entire environment via APIs as well additional toolsets, which was lacking in Hadoop.
2. Kubernetes replaces YARN
Kubernetes, the open-source container orchestration technology, allows users to run their workloads across private, public, and hybrid clouds. While most modern analytical applications and services are Python-based tools operating on microservices architecture, with YARN, users are restricted with Hadoop and Java-based tools and the HDFS platform. Organizations probably realize that managing cloud-based storage systems and databases is easier than maintaining an on-premises data farm with the Hadoop file system.
According to the StackShare community, Kubernetes has more adoption and acceptance than YARN. While Kubernetes has 54.2K GitHub stars and 18.8K forks on GitHub, YARN stands with 36.1K GitHub stars and 2.21K GitHub forks.
In 2019, Cloudera, one of the flagship bearers of Hadoop technology (that merged with another flag bearer Hortonworks), released its new cloud-based offering Cloudera Data Platform (CDP), which runs on Kubernetes instead of YARN. Although at the same time, it also released a YARN-based version of CDP, which is called Cloudera Data Hub (CDH), which allows users to run their traditional YARN-based MapReduce and Spark applications on popular cloud platforms AWS and Azure. In the same year, Google also announced that it would be replacing YARN with Kubernetes for scheduling Apache Spark.
3. Spark running on Kubernetes
Apache Spark, the open-source unified analytics engine, is used to process huge amounts of data at a rapid pace. Since its initial release in 2014, it has been adopted by several global enterprises, including Netflix, eBay, and Yahoo, for processing terabytes and petabytes of data. It also quickly became one of the top open-source communities with a high number of contributors. Spark can handle large chunks of data stored across multiple machines, parallel processing multiple datasets simultaneously. Spark can run in standalone mode by installing only the Apache Spark framework and a JVM on each machine. However, for more robust resource and cluster management capabilities, Spark requires an external entity (a scheduler) to manage all these machines. Users can choose among various possible options, namely Apache Mesos, Docker Swarm, Hadoop YARN, or Kubernetes.
Running Spark on Kubernetes provides several advantages over a Hadoop YARN-based environment. The Apache Spark framework provides user-friendly APIs to developers, which makes it much more compatible with Kubernetes. When Spark is deployed using Hadoop, it requires a dedicated Hadoop cluster for Spark processing. On the other hand, when Spark is deployed using Kubernetes, users can use multiple workloads such as other Python or R code or other web apps in the same unified infrastructure environment.
Also, in the Hadoop environment, managing dependencies and updating the environment is a tedious process. But with Kubernetes, all the dependencies are separated out into containers, thus saving a lot of time and resources and making them easier to manage independently. On top of all these, Kubernetes clusters provide an elastic and fully flexible infrastructure. Users can spin new machines easily, scale the cluster on-demand, and then destroy the additional capacity when not required, providing users with pay-as-you-go cost advantages.
4. Kubernetes meets Big Data expectations
During its evolution phase, Hadoop provided three main functionalities that made it a Big Data-ready solution: a distributed computer mechanism (MapReduce), a robust data storage (HDFS), and a resource manager (YARN/Mesos). But modern technologies now provide a better replacement for each of these three components: Kubernetes as an efficient resource manager, Amazon S3 for data storage, and Spark and Flink as distributed computation solutions.
Furthermore, the surrounding ecosystem provides a boost to this flourishing Kubernetes regime. Kafka, Cassandra, Redis, MongoDB, Elasticsearch, and several other vendors now include Kubernetes-based installs as part of their fully managed cloud offering. In addition, almost all major enterprise applications, including Apache Tomcat, Zeppelin, JupyterHub, Nginx, MySQL, and many more, are readily available as containerized images.
Therefore, it’s fair to say that Kubernetes has become the preferred platform for Big Data requirements. However, Hadoop will still play a role in many Big Data workloads where organizations have invested heavily into the Hadoop ecosystem. It’s costlier for these organizations to throw away their past investments and start from scratch with Kubernetes.
Kubernetes becoming the platform to host Big Data apps
Most Big Data applications require a scalable and extensible architecture to deliver their best performance. And for this reason, Kubernetes has become a preferred deployment option for many software vendors offering various databases or analytical services. But Kubernetes still faces some challenges before it can be called an end-to-end solution for deploying an entire Big Data stack. It could have better options for persistent storage for data across different jobs since its container architecture is designed for hosting stateless and short-lived applications. Also, it is still maturing in terms of data security and networking. However, Kubernetes has been receiving huge support from the open-source community, and with every passing day, new capabilities are being developed. Kubernetes has the full potential to become the ideal platform for hosting all Big Data applications soon.
Featured image: Shutterstock