ARES: An Open Source Platform for Robot Data

Jacob Phillips

ML practitioners have always had a reputation for quick, messy “research” code, in contrast to the clean, performant “production” code found in deployed systems. While firing off a quick script and dumping results to disk feels productive in the moment, the overall research community could benefit from more robust systems aimed at speeding up research iteration cycles. 

Nowhere is this more true than robotics. Adding the extra dimensions of reality (like spatial reasoning, depth, occlusions, and friction) makes every step of the research process — ingestion pipelines, annotation, curation, model evaluation, etc. — more difficult. On top of that, the nature of academic paper publication rewards point-solutions: novel repositories of code aimed at achieving one novel contribution instead of building long-term development platforms. 

The coming wave of generalizable robot models shows a clear need for better benchmarks, safety tests, and model understanding. We need to move beyond recordings and simulated environments executing expert policies and towards real-world data understanding to enable deployed robot systems.

To counter these problems, we’ve developed ARES: Automatic Robot Evaluation System. ARES is an all-in-one, open-source robot data platform aimed at helping robotics researchers build better models faster. ARES has an explicit focus on being simple, scalable, and ML-driven. It uses modern tools, databases, and cloud APIs to deliver high-quality insights on a budget. While other ML tools exist to assist the robotics developer with logging and visualization, ARES has a special place as a meta-tool to enable policy improvements, with a special focus on curation and annotation for robotics research.

To name just a few applications, researchers can use ARES to: 

  • Perform structured extraction of information from raw robot episodes. 
  • Explore the task space of your entire dataset. 
  • Calculate success rates on curated slices of policy rollouts. 
  • Create and store composable annotations. 
  • Find nearest-neighbor lookups over text- or trajectory- space. 

We have released data and databases on the Hugging Face Hub, including 5,000 ingested and annotated episodes from the Open X-Embodiment dataset, as well as scripts to quickly and easily set up the platform. We recognize the irony of yet another framework, but hope to see the robotics community adopt modern tools in order to develop better models.

What’s the problem with robot data?

By nature of the real world, robot data is messy, complex, and long-form. It’s rife with errors, sensor malfunctions, and edge cases. At the same time, traditional robot models are trained on far less data than modern machine learning models. Even new vision-language-action models (VLAs), which are built on top of VLMs, are finetuned on relatively small amounts of data. This comes despite Moravec’s Paradox arguing that real-world embodied tasks may be significantly more difficult than text-only reasoning. 

While there are many sources for robot data — including human videos, industrial warehouses, and even the cars on the road — there are many open questions about robot data curation. Curation is the process of filtering and cleaning datasets to find the right mixture of data to train the best model. Robot data curation is difficult because we don’t have the right tools to do robot data understanding; for example, using models to determine if a certain task was successful in a given video. Providing the right tools and systems for robot data ingestion and curation could be a huge help. 

Likewise, expanding open-source robotics datasets to include more diversity and variability could yield huge advancements. The Open X-Embodiment project was a great step in the right direction, consisting of 22 robots spanning 21 institutions and over 150,000 tasks. However, most of the data is still trapped in training-style datasets that don’t lend themselves to curation or retrieval. We also know that environment and object diversity is extremely important, but how can we know that massive robotics datasets are diverse, clean, and helpful for robot learning?

Current curation approaches

Training data curation is an incredibly important part of modern machine learning best practices. Researchers spend enormous amounts of dollars and FLOPs finding the right slice of training data to most effectively train their model. For example, the LLaMA project from Meta dedicates an entire team and several pages of their report to data curation, covering: 

  • Line-, document-, and URL-level deduplicating
  • Heuristic filtering with n-grams and Kullback-Leibler divergence
  • Model-based quality assessment and filtering

These curation approaches are based on numerous experiments at small scales to prove out the best data mixes for scaling model sizes. Meanwhile, back in the world of robot learning, we know that “data quality and consistency matter much more than data quantity…a model trained with curated, high quality demonstrations achieves 40% better throughput, despite being trained with ⅓ less data” from a recent Figure release. However, there’s extremely little information about how leading robotics labs or even general researchers conduct data curation; most public releases seem to filter for embodiments and end-effectors that match their setup or just “roughly categorize [tasks] into ‘more diverse’ or ‘less diverse’” and separate data based on that. Our best researchers are doing coarse, manual curation – we need better tools to facilitate research for data selection. 

Existing robot data tools 

Right now, there are two primary — and extremely useful — open-core tools for robot data understanding: Foxglove and Rerun. Foxglove markets itself as “visualization and observability for robotics developers,” while Rerun aims to be “the multimodal data stack.” Managed infrastructure to handle data pipelines, streaming, data lookup, and scalable deployment of real-world robots are excellent approaches for shortening robotics iteration cycles. Foxglove, in particular, has excellent tools for managing robot data at the edge and has built out a ton of ingestion, debugging, and visualization tools. 

Another tool, Roboflow, offers solutions for annotation and model training, but focuses less on robotics and more on general computer vision applications. These are great tools for enterprises, but researchers want simple, scalable, modern solutions built for machine learning workflows. Beyond just curation, we want ways to interact with our datasets and determine how to generate new annotations and datasets to make policy improvements.

ARES overview

ARES is an open-source (Apache 2.0) platform that aims to improve robotics models by simplifying ingestion, annotation, curation, and data understanding. Researchers can use it to understand the distribution of ground-truth datasets, evaluate the performance of policy rollouts, and analyze batches of robot data. 

On the technology front, ARES utilizes modern, scalable tools like cloud APIs and databases to structure and store robot data. Users can set up their own ingestion systems, annotation models, and curation approaches, or just use the ARES quick-start data and databases hosted on Hugging Face setup via the release scripts.

ARES is composed of three main parts — ingestion and annotation; curation and analysis; and training and export — each of which we explain below. 

Ingestion and annotation

During ingestion, we turn raw robot data into structured representations helpful for robotics researchers, and store them in useful formats. Ingestion uses: 

  • VLMs to extract relevant information such as task performance, focus objects, and environment descriptors, and format it into structured representations
  • A StructuredDatabase (SQLite wrapper) to store episodes, metadata, extracted information, and any other hard-coded information provided by the user
  • Embedding models to compress task instructions, episode descriptions, and state and action trajectories into a custom EmbeddingDatabase (FAISS index manager)

For annotation, we use modern, scalable solutions like cloud-based APIs and compute orchestration, via Modal, to run these models, resulting in highly parallel, asynchronous processing. By default, we run grounding detections with GroundingDINO and segmentation with Segment-Anything to identify any objects that could be relevant to the scene. All annotations are store in the AnnotationDatabase, a wrapper around MongoDB. 

Curation and analysis

Transforming raw data into structured representations is just the first step. We also want the ability to interactively explore our data, add more annotations, develop curated slices, and draw learnings from data distributions. To accomplish this, ARES provides a simple frontend developed in Streamlit for easy local development and deployment. Researchers can use the frontend to perform structured data filtering, such as selecting episodes of a certain length, filtering by robot embodiment, or highlighting specific background surfaces like wood or plastic. 

Users can curate slices of the dataset by selecting robot embodiments, action spaces, episode length, or other hard-coded or inferred fields.

Users can also perform unstructured data filtering by exploring the latent embeddings of task instructions or episode descriptions, using the interactive tool to investigate clusters, summarize selected points, or identify regions of interest. Once the researcher has selected interesting groups of episodes, they can dive into more in-depth representations, viewing all the collected information and visualizing relevant annotations like object detections. 

Researchers can explore the latent space represented by task instructions or episode descriptions. Use the helpful ‘summarize selection’ tool to learn more about a given cluster.

ARES also provides the ability to retrieve similar examples over the text space (covering task instructions and descriptions) or trajectories (robot actions and states). This provides extra help in understanding model performance, as comparing policy rollouts between similar episodes helps to explain where and why a model may struggle with a given task.

The ARES Hero Plot displays all information about an episode, from details like the data collection method to VLM predictions like “background estimate.”

The Hero Plot also contains traditional annotation information like grounding box detections and segmentation masks, or more modern VLM predictions like success criteria and other descriptors. The Hero Plot also finds similar examples across text and state-action spaces.

Retrieving rollouts that have different task instructions (such as “move the mustard bottle” and “move the ketchup bottle”) but extremely similar trajectories may reveal a lack of diversity in the training environment. Likewise, similar tasks with extremely different trajectories may reveal unusual training environments, or even errors in data collection. Examining joint actions and state trajectories can also reveal out-of-distribution paths through the action or state space, explaining why a robot may fail at a given task. 

The Robot Display is helpful for understanding the motion of a robot during an episode. We display all the joint states and actions so users can find in- and out-of-distribution actions.

Training and Export

Once a researcher has selected a curated slice of their dataset, we want to make useful representations for further knowledge-sharing or experimentation. Users can export the dashboard as a graphical representation of their findings, such as demonstrating low performance on a certain set of tasks or strong performance over a variety of background surfaces. Further, researchers can export training artifacts like pre-processed dataframes, enabling the training of new robot models. 

Case studies

Here are two case studies that give real-world examples of the ARES platform. 

Embodied Chain of Thought

Zawalski et al’s “Robotic Control via Embodied Chain-of-Thought Reasoning” (ECoT) is one of my favorite recent robotics papers, showing how to compose annotations from purpose-built models (like detectors, segmenters, grounding models, and LLMs) into generalized plans to post-train VLMs for greater spatial reasoning. This paper is a great work of machine learning literature: simple, synthetic annotations leading to large advances in robot foundation models. 

There’s lots to love in the released codebase, but it’s also a great example of a point-solution built for a research paper, as opposed to a durable robot data platform. For example, we can replace the custom Gemini LLM object with generalized ML APIs and prompt templates; we can use Docker, Kubernetes, and Modal to scale annotating models; we can deploy modern databases like MongoDB in place to dumping to disk; and we can use asynchronous, parallelized processing to massively speed up any labeling runs. This is not meant as a critique of the ECoT team, as building out all this infrastructure for a single paper is unjustified; instead, researchers should have tools like ARES to begin with. 

Using ARES, we reimplemented a labeling effort similar to ECoT, but much more efficient. We generated ECoT datasets in a fraction of the time at extremely low cost; creating 2,500 ECoT annotations (with grounding, success criteria, detection, and segmentation annotations combined into plans with subgoals) took about 10 minutes and cost about five dollars. See scripts/annotating/run_pseduo_ecot.py for more details. 

Physical Intelligence demos

As noted above, one issue with developing open-source robotics platforms is that there is not a lot of high-quality open-source robotics data available. One particularly interesting direction for future research is reward modeling, or using ML models to predict the quality of a particular episode. For example, we want to be able to distinguish between successful and unsuccessful policies at a given task, like T-shirt folding. Unfortunately, other work in this direction tends to rely on closed-source robotics datasets, and not release success or failure annotations. 

Recently, Physical Intelligence released a blog post with a series of policy rollout videos, showing successes and failures over a series of tasks. To demonstrate the ARES platform, we downloaded and ingested these demonstrations, using them both to evaluate reward modeling and the quality of the data in the batch. We provided the available hard-coded information like dataset_name and left the robot embodiment and other hidden fields as null. (See scripts/pi_demo_ingestion.py for more details.) This enabled us to evaluate our models on their ability to determine successful versus unsuccessful rollouts, and the results of this evaluation led to the system specifications for ingesting the rest of the rollouts. (See scripts/eval.py and notebooks/eval_nb.ipynb for more details.) 

Additionally, ingesting the Pi Demos dataset lets us run the typical structured extraction pipeline, providing data distributions over surfaces, focus objects, distractor objects, lighting, background, and other descriptors. 

A simple display of the ARES frontend after ingesting the Pi Demos dataset.

Conclusion

ARES is built to address a gap for robot researchers: how to build long-term, durable infrastructure to improve robotics research. We want to provide a simple, scalable solution to ingest, analyze, and improve model performance across a variety of embodiments and use cases. Current generation models are extremely helpful in pseudo-labeling robotics data, but we also acknowledge that model-based systems can introduce errors. 

We hope that future VLMs and annotating models will provide better, faster, and cheaper solutions to address these problems — and that robotics researchers and developers will adopt modern data infrastructure practices like cloud-scaling APIs and databases in order to develop stronger models. If you’re interested in collaborating or contributing to ARES, or just want to talk about ML infrastructure for robotics, please reach out to the a16z team! 

Thanks to Peter Bowman-Davis, Jacob Zietek, Ben Bolte, Alex Robey, Ted Xiao, Michael Equi, Lachy Groom, and Philipp Wu for reviewing drafts of this post.

Want more a16z American Dynamism?

Sign up to stay updated on the ideas, companies, and individuals building toward a more dynamic future.

Thanks for signing up for the a16z American Dynamism newsletter.

Check your inbox for a welcome note.

MANAGE MY SUBSCRIPTIONS By clicking the Subscribe button, you agree to the Privacy Policy.