AI Learned to Talk. Now it’s Learning to Build Reality

AI Learned to Talk. Now it’s Learning to Build Reality.


For decades, science fiction has imagined machines that could conjure entire realities on demand. From Star Trek’s Holodeck to Ready Player One’s Oasis, we’ve long dreamed of stepping into a persistent virtual universe—a world where we can explore, create, and interact with limitless freedom, where anything we can imagine is possible. These virtual worlds weren’t just static environments; they were alive—dynamic, responsive, and ever-evolving.

Today, we’re inching toward a real-world equivalent to the Holodeck: World Models. A world model is an AI model that generates virtual environments with an embedded understanding of the physical world. Just like ChatGPT generates text, and Midjourney creates images, world models can generate entire spaces. But they go further: they don’t just create visuals, they simulate how objects move, how environments change, and how physical forces interact—allowing users to engage and play within these worlds in ways like a video game.

The below clip is a video – not a game built on a game engine

World models already promise real-world applications for anyone who works with space—whether physical or virtual. Anyone whose job involves a game engine today stands to benefit from world models:

  • Robotics engineers training AI in simulated environments
  • Film studios creating virtual production sets
  • Game developers building interactive worlds
  • XR creators crafting immersive experiences
  • Architects designing buildings and spaces
  • Urban planners simulating cityscapes and infrastructure
  • Interior designers visualizing spatial layouts

And this is just the tip of the spear, near-term use cases we can envision today. The most exciting part is that world models could eventually open the door to entirely new experiences that we haven’t imagined yet.

Two Paths to Infinite Worlds: 3D vs. Interactive Video

While all world models share the goal of generating and simulating environments, there’s a fundamental distinction between native 3D and video-based approaches.

  • Native 3D world models are built with innate 3D representation created from text or image prompts. The 3D representation creates structured, explorable environments with depth, persistence, and interactivity. This makes them especially useful for applications that require spatial understanding—from game development to industrial design.
  • Video-based world models generate dynamic, unfolding sequences, using past frames and user inputs to predict future frames, creating a visually rich, interactive experience. These models are powerful for narrative, cinematic, and time-based applications, but struggle with interactivity and persistence.

Each of these categories presents unique opportunities and challenges, which we’ll explore in depth.

Video Approaches

Video World models work by generating sequences of frames in response to user input. The model tracks what has happened in previous frames, processes user actions (like pressing W to move forward), and predicts what the next frame should look like. Think of it as an AI continuously painting the next moment of a scene based on what came before and what the user wants to do next.


In direct contrast to 3D World models, Video World models don’t rely on an explicit 3D representation to build an understanding of the world. Physical properties are learned through a data-driven approach, where the model has “seen” enough video clips to learn properties such as 3D consistency and physically accurate motions. 

The field of interactive Video World models sprang onto the scene when Google demonstrated GameNGen, a model trained on DOOM gameplay that allowed users to play through an AI-generated version of the classic game. Decart Labs followed with Oasis (which you can play here!), bringing similar capabilities to Minecraft—complete with tree-chopping, zombie-fighting, and construction—all available as a real-time web demo that saw one million users in its first three days. Earlier this year, Google pushed the boundaries further with Genie 2, a foundation model that generates playable environments from a single reference image. Users can explore these AI-generated worlds using standard keyboard and mouse controls, opening new possibilities for both human gameplay and AI agent training. Adding to this momentum, Odyssey has developed a new, yet-released real-time video world model. This video world model model was inspired by their previous 3D world model Explorer, which generated production quality 3D scenes from real-world pixels.



Video-based World models benefit from a crucial advantage: the abundance of high-quality video data for training. Unlike 3D models that struggle with limited datasets, video models can learn from the vast ocean of video content available online. That said, the copyright status of web-scale video data for AI training remains a murky gray area that industry and regulators are still working to clarify. Recent licensing deals, like the partnership between Lionsgate and Runway may point toward a sustainable path forward for how creators and AI companies can collaborate.

Video is currently humanity’s primary way of consuming visual media—even 3D environments are ultimately converted to video sequences for viewing. However, video models face a fundamental trade-off: without an explicit 3D representation of the world, they must rely solely on their learned understanding of the world to maintain consistency across frames. 

While 3D methods guarantee spatial coherence through explicit geometry like Gaussian splats, video models struggle to keep track of all the elements they’ve generated—like GPT-2, world models forget everything quickly. In Oasis, for example, if you turn 180 degrees and then turn back 2 seconds later, the landscape will have changed before your eyes. Thus video-based world models will become practical for production-quality video games or 3D worlds once the hard requirement of consistency over many interactions and hours is met. Imagine a game that loses track of where you are on the map, or the hard-earned diamond pickaxe that you forged several hours ago.

Current frontier approaches are considering how to implement RAG to keep track of past frames and RNNs to keep track of a game’s state—if a character picks up a diamond pickaxe in one scene, they should still have it 20 minutes later. Solving these memory challenges will be essential for interactive video models to unlock transformative storytelling methods that blend elements of games and film, and can tell stories on greater scales than ever before.

3D Approaches

Unlike video based approaches, 3D World models generate explorable 3D environments that can be viewed from any angle, modified in real-time, and integrated into existing workflows. From architectural visualization to game development—where 3D assets like meshes, point clouds, and character rigs are already standard tools—the applications are immediate and compelling.

Consider game development: even simple 3D assets can require days of painstaking work from artists, with costs running into thousands of dollars. World models promise to transform this process, allowing developers to generate entire 3D scenes with a single prompt. This capability doesn’t just accelerate prototyping—it opens the possibility of AI-generated scenes that meet production standards, not just accelerating the game development process, but moving us closer to the NeverEnding Game.

3D world models have made significant strides. World Labs’ groundbreaking technology transforms single images into explorable 3D environments in real-time (which you can play around with here!) pointing to a future where creating virtual worlds becomes radically simpler. 

However, current 3D world models face significant technical hurdles. The complexity of generating coherent 3D geometry—while maintaining physical accuracy and visual fidelity—stands firm as an open research problem.  At the heart of these limitations lies a critical constraint: the scarcity of high-quality 3D data at scale. Unlike image/video generation models that can train on billions of photographs/video clips—web scale data—3D models lack access to sufficient datasets of 3D scenes and objects, and the datasets we do have are often limited in size and quality. And the 3D data that we do have may not be labeled to the same degree as images, making it harder to ingest into a model pipeline. One frontier approach to the 3D data problem actually comes from the world of video models: if we can generate sufficiently high-quality videos of any object on demand, we can potentially reconstruct that object in high quality 3D which we can use to train 3D models. 

Beyond data limitations, 3D world models face significant interface and control challenges. Professionals consistently express the need for dynamic manipulation of environmental factors like lighting, time of day, and weather conditions, alongside precise camera positioning and perspective control. Perhaps most critical is the ability to independently modify individual objects post-generation—adjusting a specific building’s material, modifying a landscape element, or replacing furniture without regenerating the entire scene.

Interfaces and Ecosystem

Interface design for world models also represents a significant challenge. Current professional creative tools like Unity and Blender offer extreme precision but demand significant expertise, while text-to-world approaches provide accessibility through 1-shot generations but with very little control. The industry needs a new AI-native user interface built from the ground up—designed specifically for generative world models.


These new interfaces must be fundamentally adaptive, offering contextually aware controls that transform to match each user’s workflows. For an architect, this might mean precise spatial constraint management and regulatory compliance tools. A filmmaker would access dynamic lighting and environmental controls, while a game developer would have interfaces optimized for state tracking and interaction consistency. Industrial designers would need granular object manipulation capabilities that let them modify specific elements with surgical precision.

The core challenge is creating interfaces that can bridge high-level creative intent with meticulous execution, supporting multi-modal inputs like text, sketches, and existing 3D assets while integrating seamlessly with established professional workflows. The workflow of tomorrow might end up being a mix of the precision of game engines and the accessibility of vibe coding in Blender. However they end up looking, these technologies promise to close the gap between what exists in the mind’s eye and what can be realized in reality, empowering all creators to produce their artistic vision.

World Models: From Today’s Applications to Tomorrow’s Realities

The First Wave: AI-Generated Worlds in Action

Through conversations with prospective users of world models, we’ve learned that by and large people are excited to use world models in two distinct ways that mirror the uses of other AI models: to learn about the world and to express their creative visions. This mirrors current behavior around other successful models:

  • Knowledge out, like ChatGPT – Just as we ask ChatGPT questions to gain knowledge, we can use world models to generate and explore virtual spaces to understand the real world. An architect might generate a building to test what the sunset behind the Golden Gate Bridge would look like in the 42nd floor boardroom they’re designing, or an industrial designer might generate doctors’ offices to understand the flow of patient movement from the operating room to the recovery room. A robotics engineer might first train a robot guide dog within a virtual facsimile of a building, then a city, before moving onto real world testing.
  • Creativity in, like media models – Just as we use Midjourney or Sora to express creative vision, world models allow us to create interactive spaces that bring our ideas to life. A game designer might describe a level and generate a playable space to prototype in, or a filmmaker might mock up scenes virtually to get a sense of the mood and feel, before building the physical set and directing a shot. A traveler planning a vacation might generate a landmark in XR to see what it looks like before adding a long trip to the itinerary.

Some professions will rely more on knowledge out, like urban planners modeling infrastructure. Others will rely on creativity in, like filmmakers or game developers. Most consumers will fall somewhere in between, using world models to both understand and express, depending on their use case. But the most interesting use-cases are the brand new ones that will emerge as the technology matures. We expect that world models will democratize 3D creation and drive initial adoption among consumers creating novel experiences, and as these experiences grow in number and capability, they’ll be embraced by professionals—architects, designers, game developers, filmmakers, and urban planners—who value the precision, consistent spatial relationships, and element-level control that the 3D approach provides.

New Experiences: Stepping Into Digital Reality

Across 3D and video, the research behind world models will need to continue developing what makes world models so great for knowledge in and creativity out tasks. Future “knowledge out” users of world models want to be able to identify objects within scenes and reason across them, and be able to run simulations with accurate physics, so as to close the gap between simulation and real-world events. Future creative users of world models want to be able to control their creations in ways native to their craft, such as precise camera, lighting, and environment control; the ability to generate worlds based on text, image, and 3D inputs they want to appear in scenes; and fine-grained manipulation of elements within generated experiences.

If world models reach their full potential, the boundary between the digital and physical world will dissolve in a tangible, experiential way. Instead of watching a film, you’ll step inside it. Instead of playing a game, you’ll explore a world that unfolds around you, one that reacts to your choices, remembers your past actions, and changes in real time. Stories won’t just be told—they’ll be lived. Much like how immersive theater productions like “Sleep No More” transformed stage plays into environments where audiences explore freely and shape their own narratives, world models will evolve storytelling from linear, prescribed narratives to rich, responsive worlds where each journey is unique. Just as immersive theater broke the fourth wall, world models will dissolve the screen between user and experience.

Work will no longer be about modeling things on screens but moving through ideas as if they already exist. Architects won’t just sketch buildings; they’ll walk through them before they’re built, adjusting designs in real time. Urban planners won’t theorize about how a city might function—they’ll simulate it at full scale, tweaking roads and infrastructure before breaking ground. Defense decision-makers won’t be guessing at outcomes—they’ll run reality-scale simulations of supply chains and policies, watching history unfold before making their choices. Engineers will test the limits of their creations in dynamic environments where materials, forces, and interactions behave as they do in the real world.

Technology itself will evolve from something we use through screens to something we naturally inhabit. Instead of typing prompts into an AI, we will walk through ideas, pulling them apart, manipulating them, and watching them take shape around us. The boundaries between AR, VR, and physical reality will blur as computing becomes less about interfaces and more about spaces we can explore.

Worlds-as-a-Service: Economies of Imagination

The business models for how to monetize the dynamic experiences of world models remain undefined. World models promise to transform digital economies by shifting from ownership to experience. Imagine a generative digital theme park—like a real-world Oasis—that charges by the hour of exploration, with each moment dynamically crafted to individual desires. Projects like the immersive Star Wars hotel that Disney created—Star Wars: Galactic Starcruiser—might become economically viable fixtures for giving any fan immersive experiences in the worlds they want to inhabit.

Users might subscribe to platforms that create unique worlds tailored to personal preferences. A filmmaker could generate a custom narrative world, an urban planner could simulate infrastructure scenarios, or a traveler could explore a procedurally generated travel destination—all within moments.

The most profound economic challenge emerges with abundance: in a landscape of infinite worlds, discovery becomes paramount. How do users find meaningful experiences when every interaction could generate an entirely new universe? The most successful platforms will develop sophisticated recommendation systems that transcend traditional preferences, understanding the nuanced emotional and creative desires that drive human exploration.

These platforms won’t merely sell worlds—they’ll curate wonder, creating ecosystems where creators can design unique world generation recipes, users subscribe to generative experiences, and recommendation algorithms become as crucial as the worlds themselves.

Building the Future

While the technology of world models excites us, equally promising is the ecosystem emerging to make these capabilities accessible and useful. Just as game engines democratized 3D creation, a new generation of companies will build the tools, interfaces, and applications that will put world models in the hands of creators. From specialized design software for industrial designers to intuitive controls for filmmakers, these tools will bridge the gap between raw technical capability and practical creative use.

The first steps toward this future won’t come from models alone, but from the combination of ambitious models, thoughtful interfaces, and creative applications built on top of them. As these efforts take off, adoption will scale in lockstep. The most exciting developments may not be the models and products themselves, but the unexpected ways humans find to use them—just as the internet became transformative not through protocols and servers, but through the applications and experiences built on top of them. 

If you’re building in this space, reach out. We’d love to connect.

Special thanks to Po Ryan for research & writing support.