Jensen Huang recently claimed that “the ChatGPT moment for general robotics is just around the corner.” While this claim might be a bit optimistic, it is nonetheless undeniable that robotics researchers have begun massively scaling datasets for robot learning through teleoperation, simulation, and synthetic retargeting data — resulting in stronger, more generalizable robot models. 

GPT-3.5 was the result of years of research breakthroughs, but the general public was far more affected by the release of ChatGPT: one of the first broadly available, effective language models in a simple UI. The ChatGPT moment was also the pivot point between traditional model scaling and the beginnings of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). These new training paradigms required new forms of data, not to mention new types of annotations relating to safety and red-teaming for deployed systems. 

In robotics, we’re inching close to generalizable, deployable robot models, but still struggling to find the right bounds for these deployed systems. As we continue to scale robot learning, we should start to think about what comes next for robot data once (or if!) we reach that ChatGPT moment in robotics. Stronger foundation models and generalizable, horizontal robotics platforms paint a picture of real-world, deployed robots in the near-future — the right time to develop the next generation of robotics is now.

In this post, we lay out some of the challenges, as well as promising work to overcome them. We also want to introduce an open-source platform called ARESexplained in more detail here — to help roboticists curate better datasets, train better models, and create safer systems.

A brief history of ML datasets

Data labeling used to be a simple task. Early datasets like MNIST, CIFAR-10, and ImageNet classified images into discrete categories like integers or nouns pulled from simple taxonomies. The annotations were created by the researchers themselves, or by low-cost labor provided through services like Amazon Mechanical Turk. 

However, the rise of deep learning led to models quickly saturating these benchmarks, prompting the emergence of specialized data annotation providers to develop more complex datasets to train stronger models. Scale AI, Labelbox, SuperAnnotate, and others developed “model-in-the-loop” annotation systems to assist human labelers in developing more complex data types (e.g., long essays on niche topics for supervised fine tuning or grading model responses for reinforcement learning from human feedback). These new datasets led to model providers like OpenAI and Anthropic releasing stronger models and better products, like ChatGPT and Claude. 

As these products matured from research previews into deployed systems, so did the data powering the models. Whereas prior work focused on “on-policy” labels — data focused on the exact answers the model is supposed to provide — real-world systems demanded new, “off-policy” data initiatives focused on safety, red-teaming, benchmarks, and grading rubrics. This also promoted a shift in the labeling workforce, transitioning from a low-skill, low-cost global workforce to increasingly high-credentialed, well-paid United-States-based annotators. 

Right now, robotics data is still stuck in this “on-policy” phase: focused on collecting episodes detailing the exact trajectory required to solve a certain task. If it feels like general-purpose robots have always been right around the corner, that’s because generalizable robotics models have been harder to develop than their language-model cousins in cyberspace. However, newer datasets, including natural language instruction, are opening the door for traditional vision-language models (VLMs) to help out in the robot domain. 

Google’s RT-1 breakthrough

Google’s RT-1, released in late-2022, was one of the first models and datasets aiming to learn a generalizable policy across many tasks, scenes, distractor objects, and horizon lengths. But beyond being one of the first demonstrations of a robotics model succeeding in diverse setups, its real breakthrough was showing the ability of general ML models to solve downstream tasks “zero-shot,” or without any other information. Underpinning this accomplishment was a large dataset of diverse tasks executed by humans on robots. 

Google then released RT-2, which demonstrated the transfer of internet-scale pretraining to robotics data, delivering large advancements over robot-embodiment-specific models! This was one of the first vision-language-action (VLA) models, ushering in a new paradigm of adapting pre-trained multimodal models. 

With this work, the recipe to train a good robot model becomes more clear: transfer knowledge from general VLMs to embodied robot policies with behavior cloning. The pretrained backbones provide generalizability, while human teleoperation data provides the lower-level control necessary to actually enact the policy. As you add more data (through more environments, tasks, and objects), you get a better model. 

At this point, the floodgates opened: researchers at many academic institutions banded together to standardize robot learning and provide a new, open dataset for cross-embodiment robot learning: Open X-Embodiment. Each new axis of diversity  — like environment, robot embodiment, lighting, or focus objects  — yielded greater returns for the generalizable performance of the final policy.

Today’s best practices for robotics data

Today, researchers are finding various ways to expand the data used for robot training. These range from scaling teleoperation to new and inventive methods for adapting human actions through a process called retargeting, which applies actions to different robot embodiments. Students at Stanford developed UMI (Universal Manipulation Interface), a novel system for robot data collection involving just the end-effector, making it easier to scale embodiment-free data collection in the wild. Hindsight relabeling and “sketching” provide a simpler interface for humans and models to communicate trajectory demonstrations. Traditionally, researchers relied on in-house first-person data collection, but modern scaling requirements have promoted new startups for third-party “robot-data-collection-as-a-service,” in addition to open-source data-sharing efforts like Open X-Embodiment, UMI’s Data Hub, and LeRobot’s dataset hub

In addition, the rise of simulation and synthetic data has played a large role in robot progress: Simulation environments like MuJoCo and Isaac Lab have provided the groundwork for massively scaling simulated robot data collection, which are even further augmented by advances in generative AI. Genesis, LucidSim, and GR00T represent efforts to bridge the sim-to-real gap by combining traditional simulators with new generative AI models, while other teams focus on world models and data augmentation through “semantically imagined experiences” or task language relabeling. For more thoughts on world models, simulation, and synthetic data for robots, see my colleague Peter Bowman-Davis’s post on World Models and the Sparks of Little Robotics.

Despite these advances, however, we’re still focusing on collecting “on-policy” data: data that directly teaches robots how to solve a given task. In order to prepare for training high-level, generalizable, cross-embodied models, we should start thinking about what comes next for robot data. 

What comes next for robot data labeling?

Following prior trends in machine learning research, we should expect a shift from “on-policy” exact solutions to more difficult, higher-level “off-policy” labels focusing on the impact of deployed systems. We should also strive for head-to-head evaluations, like those provided by LMArena. But what do these tasks look like for robot models, and how can we build systems to create solutions for these questions?  

Benchmarks

Providing standardized evaluations for robot models is a difficult task, to say the least. For example, most robot papers end with a series of evaluations and a graph comparing methods. However, while open reporting of success rates is valuable and a good way to measure performance, it is difficult to truly compare performance across models. Researchers with the best intent may still accidentally cherry-pick a setup or scenario; exact reproduction of evaluation environments is difficult within a given lab, much less in an entirely different location. And while cleaning datasets of test examples, environments, and objects is a best practice, it can be difficult to truly know that your test environment is free from contamination. 

The opposite is also true: In order to present clean evaluations, researchers go to great efforts to protect their systems. Stanford professor Chelsea Finn, for example, notably mentioned how she and her colleagues on a previous project were “paranoid about varying lighting messing up the network, so we did all the experiments after sunset.” 

One part of the difficulty is running a released model on new hardware. Just running a robot model on different robot hardware may produce minute differences that lead to performance degradation, and this problem gets worse with increasing task and robot complexity. Without standardized benchmarks (requiring standardized environments), we’re left with lackluster evaluations. So it’s promising to see releases such as Physical Intelligence’s π model, which included demonstrations of zero-shot performance at novel labs executed by unrelated teams! Even better, the recent Gemini Robotics release showed a reimplementation of the π model in order to directly compare between models and environments. 

Open-source robotics efforts like Hugging Face’s LeRobot make inroads on this problem by specifying robot hardware, compute resources, and environment details. Researchers can control for using the exact same hardware and environment objects, down to the mats, graspers, and manipulated objects all coming from pre-specified online merchants like Amazon or Alibaba. The COMPARE group (Collaborative Open-source Manipulation Performance Assessment for Robotics Enhancement) has also led much work in this space by publishing open-source benchmarking protocols, recommending common tools and testbeds, and building a repository of artifacts used in robot benchmarking. 

However, controlling for other environmental factors, like lighting, remains difficult without dedicated evaluation spaces shared across developers. Projects like LMArena work for LLMs because language models are effectively fungible across different compute sources, but something similar for robotics would require dedicated environments with support staff to set up, install, monitor, annotate, and judge real-world policy rollouts. 

A team at UC-Berkeley did recently release a useful example of real-world robot model benchmarking called Auto Eval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World. The authors set up evaluation cells, which are real-world stations equipped with WidowX arms, and train success classifiers and reset policies. This enables outside researchers to submit policies for evaluation; once the policy has been run, the evaluation cell autonomously grades the result, posts the scores to Weights and Biases, and resets the environment. Although this is a great step forward, the environment is currently limited to just four tasks over two evaluation cells.

Diversity is the key to evaluation As a caveat to dedicated evaluation spaces, testing robot policies across many different lab environments may provide the exact kind of real-world randomness that we’re looking for! We need a balanced approach of known, centralized evaluation spaces with noisy, decentralized deployments to capture edge cases, operator error, and distribution shift.

Surging LLM inference prices due to test-time-compute are also forcing researchers to rethink prior assumptions by bringing the hourly cost of AI-powered evaluation on par with the hourly cost of a human evaluator. Although it’s unlikely that we’ll have humanoids competing head-to-head in MMA, big wave surfing, or cliff jumping any time soon, developing difficult — and cost-effective — real-world evaluations that force robots out of sterile labs and into dirty, dusty, wet environments will be critical. If inference costs keep rising for difficult evaluations, an hour of human labor to monitor robot policy evaluations might seem pretty cheap.

Testing robots on complex tasks Some early, public efforts to test robots on complex real-world tasks are already underway, including:

Robot benchmarks will also need to consider success criteria. Right now, most robot tasks have seemingly simple goals: push the block, fold the shirt, pour the water into the cup. While these tasks have high-level binary goals, we will need to develop more granular, stringent qualifications to discretize model performance. Does folding a shirt include smoothing any wrinkles? Does pouring water into a cup count if the robot spills a drop? What about qualifying performance under different lighting, environments, or initial states?

We may also want to consider other metrics beyond just task-completion percentage, such as energy use, time, FLOP efficiency, action smoothness, trajectory safety, and robot capability. Does achieving the same task with a significantly less-capable robot count as stronger performance than the same task with a better robot?

Safety and red-teaming

When deploying traditional ML systems, organizations will conduct safety and red-teaming evaluations on their own models before release to guarantee certain safety parameters. And progress has been good so far: Beyond omitting responses that touch on certain topics, there is also promising work on ideas like semantic safety. However, trusting an LLM not to say bad words is one thing; trusting a robot not to hurt people or damage property in your own home is quite another. We need to guarantee that deployed VLA systems are just as safe — if not safer — than 2D language and vision models. 

Thankfully, some simple solutions for these problems exist. We can, for example, engineer robots with high backdrivability, passive compliance, or torque limits to enable safer execution in a potentially unsafe environment. (Eric Jang wrote a great blog post about understanding motor physics and the potential problems of a robot trying to mimic human action.) We can define safety policies in raw software and control signals, ordering the robot to limit potential actions to avoid objects or itself. In high dimensions, however, this becomes more tricky and difficult. 

We can also equip robots with learned collision avoidance policies, such as ARMOR from Apple, which focuses on enhancing spatial awareness. These systems become more important when you consider the capital required to purchase and maintain a robot. Most “robot-as-a-service” companies expect the CapEx of a robot to be repaid over years, which means they need years of damage-free operation to both the robot itself and the environment it operates in.

Another important consideration is what jailbreaking and red-teaming looks like for robot models. Some work is being done in this field already, for example, by using existing jailbreaking techniques to attack ML models controlling robots and forcing them to take harmful actions. While these attacks are geared toward traditional ML systems that are just deployed on a robot instead of newer VLA policies, the authors also developed generalizable jailbreaking policies that could be applied to other models. The challenge here is that, with traditional ML systems, we can easily run evaluation suites against pre-defined test cases — but for real, embodied systems, we are limited in resources and time to certify safety. 

Image created with FLUX by Black Forest Labs.

The real world

But benchmarking models themselves is just half the battle — to certify that robot systems are safe and robust, we need to test their capabilities in the real world. Gone are the days of walking around in a lab setting with a safety harness; today’s robots have to prove their mettle in complex, unstructured environments like running through a forest or walking up a waterfall. Useful robots will encounter dust, dirt, mud, sand, and more as they assist laborers in construction, mining, and other critical industries. 

Increasing the degrees of freedom, though, makes it more difficult to control and certify robot systems. For example, we will want to see more certifications like ingress protection (IP) put to the test. Systems certified to IP65 will be robust to dirt and gentle water splashes, which may be enough for the home environment. Robot systems built for construction may need IP67 (robust to immersion up to 1 meter) or even IP69k (resistant to high-pressure, high-temperature water jets). 

Whatever the use case, we’ll need to ensure that robots can survive the actual environment. Construction and agricultural vehicles are a great analogy: After a full day’s use, you can power wash a skid steer loader to remove muck and grime, but that’s still a scary prospect for most robots. And if it’s true that robots can be “fried” from single-purpose demos, researchers will need to find cost-effective methods for intense real-world testing.

Other issues

There are countless other real-world issues that we need to solve in order to deploy robot systems at scale. Some examples: 

  • Long-context understanding is still a problem that plagues general ML models, despite many advancements in the space. Many robotics datasets, like the episodes provided in Open-XEmbodiment, provide rollouts covering single-task execution on the order of seconds; we will need datasets and benchmarks covering minute- and hour-long multistep tasks to understand robot performance on real-world tasks. 
  • Inter-robot collaboration is also a difficult data problem. Collection, annotation, and recording of these episodes mirrors modern-day multi-agent collaboration problems. And recent collaboration demonstrations do not release much information on the training paradigm that enabled such cooperation. Simulation may provide an edge here to scale data collection without quadratically increasing costs. 
  • Data selection and curation remains a difficult problem for the robotics community. As we scale data collection, it will become increasingly important to intentionally select episodes and actions that reflect good performance. Existing robotics datasets are rife with noisy episodes containing sensor malfunctions or obstructions that hurt policy performance. Our aforementioned ARES project is designed to assist in this process. 

The robots are coming

I don’t mean to paint a dire picture; instead, we should be excited about the potential of solving these problems. In fact, stronger ML models and reasoning systems about safety in text-space may beget stronger abilities for safe robot policies in the real world. 

At the same time, we should be investing real-world efforts to evaluate, share results, interpret models, and find robust methods for certifying safety for robot policies. The coming world will have robots in every part of our society, from manufacturing to defense to your own home. We should work to ensure that the models controlling these systems are safe, secure, and robust. 

At a16z, we’re excited to invest in the forefront of reliable robotics in order to bring about new products, services, and solutions for modern problems. If you’re a researcher, engineer, or founder interested in solving these problems, please reach out! 

Thanks to Peter Bowman-Davis, Jacob Zietek, Ben Bolte, Alex Robey, Ted Xiao, Michael Equi, Lachy Groom, and Philipp Wu for reviewing drafts of this post.

Want more a16z American Dynamism?

Sign up to stay updated on the ideas, companies, and individuals building toward a more dynamic future.

Thanks for signing up for the a16z American Dynamism newsletter.

Check your inbox for a welcome note.

MANAGE MY SUBSCRIPTIONS By clicking the Subscribe button, you agree to the Privacy Policy.