World Models and the Sparks of Little Robotics

Peter Bowman-Davis

Table of Contents

In the last few months, you may have seen chatter about Minecraft on a Neural Network, Genie-2, Simulated DOOM, or CS:GO on a Diffusion Model. These recent demonstrations of world models — or neural networks to compress the dynamics of visual, action-based, or other types of data — have taken X by storm, but have also been the source of significant confusion.

In reality, though, these posts are far more than half-baked reconstructions of your favorite video games — we think they are a key part of the future of robotics and autonomy. The data moats that major players in the space have established will not evaporate overnight, but world models represent a credible promise for leveling the playing field, enabling “little robotics” to innovate and compete more effectively.

A disordered world

Historically, the field of robotics has been plagued by the difficulty of robots interacting with disordered environments and their intractable dynamics — for example, training a robot to walk on wet leaves. As my colleague Oliver has previously written about the complexities of general-purpose robotics: demos are easy, but bringing robots to market has proven difficult.

One problem is training robots on data that mirrors the real world. Conventionally, a simulator — not unlike a game engine that drives your favorite videogames — is used to train the robotic decisionmaking model before it is deployed. However, constructing these simulations using tools such as ROS, Unity, Unreal Engine, or Grand Theft Auto 5 is time consuming and fails to generate realistic data — effects that compound as robotic fleet sizes scale up.

Further complicating things for roboticists is that most modern approaches to autonomy are multimodal: cameras, radar, lidar, GNSS, IMU, and other sensors act in unison. However, classical mathematical models of robotic motion typically only ascertain information about the dynamics of the robot itself, not how a camera output or lidar reading would change as a result of a given action. As a result, many “simple motion” scenarios — like driving a wheeled robot a desired distance — have long been achievable, but tasks that are more difficult to describe with mathematics — such as folding a shirt with robotic arms — have only been made possible in the last few years.

Recent advances in machine learning have begun to address these longstanding challenges by leveraging neural networks to better understand and predict complex, multimodal interactions within disordered environments. By focusing on methods that simulate not only robotic dynamics but also the sensory outcomes of those dynamics, researchers have opened the door to more robust and scalable solutions. These approaches aim to overcome the limitations of traditional simulators by producing data that closely mirrors real-world scenarios, offering a more efficient and comprehensive pathway for training and evaluating robotic systems.

The rise of neural simulators

In its current state, neural simulation typically involves the use of an image or generative video model to create realistic visual data from a robot — conditioned on previous camera footage, actions taken by the controller, sensor readings, and so on. The models at the heart of neural simulation are extremely similar to popular models such as Flux, Stable Diffusion, Midjourney, or DALL-E. The major difference from these models is that instead of prompting the models with a text description of your desired image, the model is prompted in the form of prior video and other associated data.

Because the previous frames and actions contain information about the scene, the network is able to approximately predict the next frame in the video sequence. Using the generated frame as context for an additional frame, an image model “rolls out” a video of next frames, which attempts to predict the visual future of the scene. The result is a model that effectively serves as a replacement for traditional simulations, offering greater flexibility and performance at scale.

At a high level, neural simulation addresses three core problems for robotics companies (presented here in increasing order of complexity and present usage): error replay, synthetic data generation, and model-based reinforcement learning.

Neural simulators for error replay

Say you are the developer of an autonomous vehicle and you just witnessed the car blow through a red light, despite the model working at 99% of stoplights. How can you address this anomaly in the model? One way to do this is through scene recreation, which involves carefully reconstructing the conditions that led to the model failure and then retraining the model until the scene (and others with comparable setups) do not trigger a model failure (or “disengagement” in AVs) in future scenarios. Scene recreation involves either setting up physical props or manually modeling the scene in a game engine to test the new model.

However, as the number of autonomous vehicles operated by a company like Waymo grows, the total number of errors made by the fleet per day also increases. Linearly scaling the cost of reconstruction with the amount of vehicles in the fleet isn’t cost effective, and would require massive headcount. Additionally, many scenes simply cannot be physically reconstructed, or are impractical to model — lighting differences, exact scene details (such as the number and position of other cars on the road), and environmental dynamics such as driver behavior will be difficult to reconstruct manually.

Neural simulations allow for more flexible approaches to error resolution through automatic scene reconstruction. In this way, one could feed the last few seconds of a video before a model failure occurred, and then effectively test what would have happened if the model took a different action — essentially the same as a scene reconstruction, at a tiny fraction of the effort and cost. This would allow smaller companies, without the resources for traditional reconstruction techniques, to achieve more reliable, safer outcomes at a fraction of the cost.

Neural simulators as data engines

Synthetic data generation is the use of neural simulators as data engines, particularly for the sake of expanding the dataset at a greater scale and diversity than traditional augmentation techniques. This is particularly powerful when a parameterization of the environment can be learned in addition to the unconditioned video dynamics (e.g., the model is given an input describing if it is raining or not). By computing statistics over the data distribution, blind spots in data distributions may be identified and rectified using synthetic data. In practice, this may be easier said than done for data attributes that are less visually obvious, such as — in the case of self-driving cars — varying behavior patterns of other drivers on roads.

Synthetic data generation is particularly useful for scene phenomena that have a fairly uniform visual effect across environments, but that may be less common in some situations than in others. In an autonomous vehicle dataset, for example, it would be rare to collect data of rain in Las Vegas, but it wouldn’t be particularly difficult to generate a realistic video of rain in Las Vegas provided that the network was trained on many hours of driving in Seattle or London.

Put simply: Edge cases cause engineers no shortage of headaches. And although some circumstances — such as a presidential motorcade — will simply never be in the training set for robotics companies, neural simulators as data engines hold the promise to augment existing datasets with uncommon events for more reliable performance in the field.

World models for reinforcement learning

A nascent use case for neural simulators is using learned world models as the heart of model-based reinforcement learning (RL) policies. The canonical problem statement of RL — learning how to play games where the winning condition is rarely a random occurrence, such as finding diamonds in Minecraft, or beating a world champion in Go — maps well to autonomy challenges in the real world. In this way, well-trained RL policies are one way to make a robot choose an action autonomously.

But RL is hard. Training these models is arcane, and even the most advanced conventional algorithms, known as model-free techniques, suffer from a lack of convergence over long-term tasks with sparse rewards — in Minecraft, for example, the policy may end up roaming around taking random actions, not at all on the track to winning the game. However, model-based RL is an emerging approach to alleviating some of these issues.

In model-based RL, learning a compressed representation of the world enables the training of a policy using this model. This approach can be significantly faster and more likely to converge to a solution compared to model-free techniques such as deep Q-learning or PPO. Put simply, this approach integrates the world model directly into the RL policy, rather than merely leveraging it as an external tool for training the policy. First demonstrated in World Models, and elucidated in Dreamer, DreamerV2, DreamerV3, and DayDreamer for robotics, the learned model-based architecture shows immense promise for long-term autonomy and sparse reward tasks. Additionally, much of the architecture that has been developed to train video models focusing on replay and data engines can be adapted to serve as the model-learning architecture within Dreamer.

The design space of world models

As world models for neural simulation become more widespread, many questions remain about their architectural and behavioral characteristics. Both spatially and temporally, it is unclear whether the future is autoregressive or diffusive. In the temporal domain, many generative video models have traditionally been diffusive, generating an entire, fixed-length video at once through progressive denoising over both space and time. This leads to more visually coherent videos, as the model can perform implicit planning through the generative process, ensuring that, for example, an object doesn’t magically disappear halfway through the video.

However, diffusion-like processes over the temporal domain do not come without tradeoffs: videos are currently required to have a pre-specified length, and because the entire video must be loaded into memory at once, generation tends to require radically more computational resources compared to alternatives. One such alternative method is generating the video frame by frame, or autoregressive in the temporal domain, used by production autonomous vehicle models such as Wayve’s GAIA-1 or Comma’s CommaVQ. This has the advantage of being significantly less computationally costly, and allows for more flexible generation lengths, at the expense of being less visually coherent.

Spatially, models can also be either diffusive or autoregressive, generating a given frame of a video all at once, or token-by-token with the use of a discrete autoencoder, such as a VQVAE. The tradeoffs are similar to that of the temporal domain in terms of cohesion — diffusion tends to produce higher quality images at scale (though this is still a highly contentious topic). However, the tradeoff in the spatial dimension is not as pronounced, given that the resolution of video is unlikely to change frame-to-frame. Thus, the flexibility of autoregressive sampling is less important.

Computationally, current generative video models for the purpose of neural simulation tend to be significantly lighter weight compared to large language models. Leading world models such as DIAMOND, GAIA-1, and Oasis are trained on comparatively little data and compute as compared to the largest language models — and there is even evidence these models could be made lighter weight. Transfer learning, which involves significantly fine-tuning an off-the-shelf image or video model, also provides a compute-efficient approach to training world models and becomes increasingly effective as open-source models continue to improve.

What the future holds

Many aspects of world models for neural simulation are likely to experience rapid growth in the coming months and years. Some of the following questions will define the future of how neural models are applied in robotic systems:

Physical coherence: How can object permanence be enforced? What tools can we use to evaluate neural simulator performance? How can physical laws be implicitly or explicitly encoded into the model’s loss function?
Context length / memory / forgetting: What is the best way to add memory to models? Recurrent Neural Networks? Something akin to retrieval-augmented generation?
Compute cost: What are the scaling laws of generative video models and their architectures tokenized / transformer-based, DiTs, U-Nets)? What are more efficient ways to pretrain, post-train, distill, and transfer learn world models?
Data efficiency: How much data is required to train a physically coherent model, and one that generalizes well to out-of-distribution data (a complex topic in and of itself)?
Overfitting: Is it even possible to train models that perform well far out of the training distribution? How well does the idea of zero-shot learning apply to video models?
Controllability/parameterization: How can the environment be better conditioned in generations?
Multimodality: Video generation is currently possible, but how can we jointly predict other relevant variables (e.g., lidar maps in addition to video, etc.)? How will each be sequenced or combined in the generative process for specific embodiments? Is the future of vision modeling flat, or three-dimensional?

Neural simulators and little tech

The effect of better action replay, data engines, and long-term planning models on the economics of a new robotics company cannot be overstated: What previously were major blockers to new entrants are now being eroded through the use of scalable neural simulation. As the challenges are explored and larger video foundation models are released, previously intractable data, reliability, and autonomy challenges will become mere engineering challenges for smaller players, rather than economic impossibilities.

If you are an engineer or researcher working on world models, please reach out.

Contributor

Peter Bowman-Davis is an Engineering Fellow on a16z's American Dynamism team, where he focuses on diffusion processes for simulation and motion planning.
- Follow
- X
- Linkedin

More From this Contributor

Autonomy Across Air, Land, and Sea Peter Bowman-Davis, Erin Price-Wright, Vijay Patnaik, and Macario Namie

The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, this content may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein.

This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at https://a16z.com/investments/.

Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.

RECOMMENDED FOR YOU

go to top