Humans are now producing more data than ever before, with a whopping 2.5 quintillion bytes produced every day. This means over 90 percent of the world’s data has been created in the last two years, and with every computer, phone, CCTV camera, IoT, and smart device constantly producing data, the volumes are only going to get exponentially higher. This is why we see several data platform startups receive serious funding as they roll out products that bridge the gap between the amount of data we generate and the amount of data that we can use.
In the past, we always had the opposite problem where there wasn’t enough data to train AI or ML models. Today, it’s like the data floodgates are open, and there’s a race to see who can level up enough to ingest and make sense of it all. Businesses, in particular, are being swamped with more data than ever before, with infinite amounts of behavioral data being accumulated alongside transactional and demographic data. The products earmarked to step in and empower users to not just ingest data at the scale it’s being produced at today but also use it productively are being called data platforms.
1. Databricks raises $1 billion
First on our list is an organization that just announced a billion-dollar Series G round led by Franklin Templeton and followed by new investors like AWS and Salesforce, putting its post-money valuation at a cool $28 billion. Databricks is known as the “Data and AI company,” and its data-engineering, or Big Data, platform goes by the same name. It’s a platform that facilitates end-to-end ETL (extract transform load) and allows users to process and manage massive amounts of data with the help of machine learning models.
Founded in 2013 by the creators of Apache Spark, Delta Lake, and MLFlow, there are now more than 5,000 organizations worldwide, 40 percent of which are Fortune 500, that rely on the Unified Analytics Platform from Databricks. The Apache Spark-based platform is built on Lambda Architecture, which in addition to being a way to solve the problem of computing arbitrary functions, is also a way of processing massive amounts of data. It achieves this by splitting data into three separate layers, a batch layer, a serving layer, and a speed layer, which is also called a stream layer.
New data gets fed to the batch layer, and any that can’t be fed due to latency goes to the speed layer simultaneously. The serving layer contains the outputs from the batch and speed layer that can then be queried in low-latency as needed. Databricks on AWS was launched in 2015, followed by Azure Databricks in 2018, and is now Databricks on Google Cloud, claiming the title of “only unified data platform across all the clouds.” In addition to the use of Notebooks which are collections of runnable commands, features include enterprise security, visualization, rich dashboards, data lake engines, collaboration, analytics, and more.
2. Tealium raises $96 million
Next on our list is Tealium, which just announced a $96 million Series G round led by Georgian and Silver Lake Waterman to bring its total valuation to $1.2 billion. Tealium’s star product is called Tealium Audiencestream, a customer data platform (CDP) that uses data orchestration and enterprise tag management to break down silos and unify customer data into a single source of truth. Tealium IQ Tag Management takes care of the client-side data collection by mapping all data collected during client interactions. For server-side collection, Tealium has an Event Data Framework called EventStream that helps leverage data in real-time.
Today, real-time engagement is critical. Sellers, for example, want to engage customers while they’re shopping for their products, not a week later when they’ve already found and bought what they were looking for. Tealium achieves this by taking a “data-layer” approach that standardizes data, making it easier to use and respond to in real-time. A properly defined data layer ensures efficient monitoring and tracking and solves the problem of data fragmentation. Tealium Predict ML takes things a step further by injecting Machine Language powered insights into customer profiles that make it possible to accurately predict customer actions.
In addition to client and server-side data collection, transformation, and distribution, Tealium also provides security, privacy, and governance features. These features include encryption, content and privacy management, single-tenant solutions like private clouds, geo-based management, and data recovery. Tealium is also compliant with international privacy frameworks like the EU-U.S. Privacy Shield Framework and the Swiss-U.S. Privacy Shield Framework. Additionally, Tealium uses a “key-ring” principle to maintain homogenous user profiles across multiple sessions with different platforms and devices. Other products include Tealium Data Access and Tealium for mobile.
3. Labelbox raises $40 million
While the first two data management platforms on our list use AI to power data management, Labelbox manages data to train AI. Labelbox is a training data platform for enterprise ML applications that just announced a $40 million funding for its Series C round led by B Capital Group, bringing its total funding to $79 million. What makes training data tricky is that it all has to be labeled, usually by humans, before it is fed to a neural network. For example, it’s only after a particular picture is properly annotated as a cat that the algorithm recognizes it as a picture of a cat, and this needs to be done repeatedly over millions of data points.
While labeling is often viewed as the most time-consuming aspect of training AI models, Labelbox claims to “reduce human effort” involved in labeling by up to 80 percent. The platform analyzes existing AI models to advise users on the appropriate labels to be used while also allowing pre-labeled data to be uploaded in large quantities. This makes it a lot simpler for customers who already have existing labeling services as Labelbox integrates with and allows collaboration across databases, back offices, and labeling services. Labelbox also integrates well with internal processes and allows for a complete workflow to organize, manage, and use data effectively.
The platform achieves this by providing users with a web service and an API that allows different teams of users to work together on a single dashboard. It also automates the process of high-quality labeling with minimum errors by training AI models for active learning. In addition to allowing tools to be customized for specific use cases, Labelbox also facilitates labeling directly on photos, documents, videos, and text. Additional features include the ability to browse and manage training data, labeling and performance metrics, a catalog of available labeling services, object analytics, and more.
Big Data management
Big Data management and analytics continue to be an increasingly enchanting sector with investments pouring in, especially since it has so much to do with AI and ML. As we already mentioned, we’ve never had this problem before, and it was always a case of not having enough data to train models. As tools and platforms continue to become more advanced to help enterprises with the ever-increasing amount of data coming in from everywhere, it’s going to be all about who can convert all that raw data into actionable insights that provide competitive advantages.
Featured image: Pictures of Money / Flickr