Delivering on the promise of AI has been an iterative and interdependent balance between processing capacity, math, and data. The rise of cloud computing and open source has largely mitigated capacity and math as bottlenecks. Compute power is essentially unlimited, and the algorithms behind AI are evolving to a point where they will become commoditized building blocks. This leaves data as the fundamental constraint to unlocking the full potential of AI.
In a sense, data is the equivalent of source code for AI environments. I’ve observed that the next step-function increase in software development will be the growth of data science. The concept of data-centric programming highlights the importance of managing and taming massive amounts of data for use within AI frameworks.
Raw data, while plentiful and in theory useful, cannot typically be used by an ML system without modification and preparation. Before being fed into an ML framework like PyTorch or Tensorflow, data has to be aggregated, transformed, cleaned, augmented, and – in most cases – labeled. This process consumes roughly 80% of resources in an average ML project, far exceeding other categories like algorithm development, model training, and deployment. Data prep, in other words, is the engine powering modern AI and ML.
A new class of products is emerging to make this process more effective, easier to manage, and less costly. We call it “training data management.” Today, we’re proud to announce Labelbox as our first investment in this new category. If GitHub has become the platform for managing and developing software (code), then Labelbox has the potential to fill a similar role for data in the AI/ML world.
Labelbox is building a training data platform for the development of AI software. The company’s mission is to fill the critical role of interfacing between AI systems and the domain experts that make these systems function. To start, they focus on the problem of data labeling, an especially important part of the training data workflow.
Labeling – also known as annotation – encodes ordinary human intuition into machine-readable formats. It generates the information that machines actually “learn” in the machine learning process. For example, an AI model that identifies weeds in a field often needs to train on thousands of pictures of weeds. It also needs to know which plant in each image is a weed, versus a healthy crop or some unrelated object. That information is provided by a team of data labelers – people trained to recognize weeds and to edit images – working through the corpus, one data point at a time.
Labeling is also necessary to identify tumors in medical images, defects on a manufacturing line, pedestrians in dash cam videos, buildings in satellite imagery, and many, many other applications. Some of these use cases can be handled by non-experts, while others require close analysis by highly skilled personnel (e.g. radiologists). The need for data labeling is not exclusive to visual data either – it’s equally strong for text and numeric data. Behind nearly every state-of-the-art ML model is a mountain of training data and a small army of data labelers.
Labelbox provides a flexible, cloud-hosted environment to equip data labelers to do their jobs. They have made this truly an enterprise-grade product, with a customizable labeling interface, deep API access, and strong security controls. Critically, the Labelbox platform also allows managers to coordinate any number of labeling teams, across both full-time and outsourced staff, all in one platform. This unique feature gives Labelbox customers granular insight into the performance of their teams and frees them from dependence on any one vendor of labeling services. Labelbox basically acts as a single source of truth for defining, storing, and accessing training data across an entire organization.
In just two years in business, Labelbox has already established themselves as the clear leader in this category. They serve a long list of customers across industries, including healthcare, manufacturing, agriculture, transportation, retail, and financial services – an unusually diverse list! Most Labelbox customers find the company on their own and move through the sales process remarkably quickly – both strong signs of market pull and early product-market fit. It’s also quite rare for a Labelbox customer to leave the platform, which is a testament to the depth of the product and its central role in ML projects.
Most importantly, the Labelbox team is a tight-knit group of humble leaders, killer product visionaries, and relentless executors – not to mention several skilled airplane pilots. They have lived the problem in their previous roles and are among a small group of entrepreneurs leading the charge for better AI/ML infrastructure. We’re thrilled to partner with Manu, Brian, Dan, and the rest of the team to help build a foundational enterprise AI company.
I’d like to thank my partner Matt Bornstein for his work on this post and our investment in Labelbox.
***
Peter Levine is a General Partner at Andreessen Horowitz where he focuses on enterprise investing.