The Next Generation Pixar: How AI will Merge Film & Games

Jonathan Lai

Stories are at the core of the human experience – we make sense of the world, find meaning, and connect with others through stories. Over the last century, many of our most beloved stories were enabled by technology shifts. In the 1930s, Disney invented the multiplane camera and was the first to create sound-synchronized, full color cartoons – eventually leading to the groundbreaking animated film Snow White and the Seven Dwarfs.

Marvel and DC Comics rose to prominence in the 1940s, dubbed the “golden age of comics,” enabled by the mass availability of the 4-color rotary letterpress and offset lithography for printing comics at scale. The technology’s limitations – low resolution, limited tonal range, dot based printing on cheap newsprint – created the iconic “pulp” look we still recognize today.

Similarly, Pixar was uniquely positioned in the 1980s to leverage a new technology platform – computers and 3D graphics. Cofounder Edwin Catmull was an early researcher in NYIT’s Computer Graphics Lab and Lucasfilm, pioneering foundational CGI concepts and later producing the first fully computer generated feature film Toy Story. Pixar’s storied graphical rendering suite, Renderman, has since been used in over 500 films to-date.

In each of these technology waves, the early prototypes that started off as novelties became new formats for deep storytelling, led by a fresh generation of creators. Today, we believe we’re right around the corner from a next generation Pixar. Generative AI is enabling a foundational shift in creative storytelling, empowering a new class of human creators to tell stories in novel ways not feasible before.

Specifically, we believe the Pixar of the next century won’t emerge through traditional film or animation, but rather through interactive video. This new storytelling format will blur the line between video games and television/film – fusing deep storytelling with viewer agency and “play,” opening up a vast new market.

Games: The Frontier of Modern Storytelling

There are two major waves happening today that could accelerate the creation of a new generational storytelling company:

  1. Consumer shift toward interactive media (over linear/passive media, i.e. TV/film)
  2. Technology advancement driven by generative AI

Over the past 30 years, we’ve seen a steady consumer shift where gaming / interactive media has become more popular with each new generation. For Gen Z and younger, games are now the #1 way they spend free time, beating out TV/film. In 2019, Netflix CEO Reed Hastings famously said in a shareholder letter: “we compete (and lose to) Fortnite more than HBO.” For most households today, the question is “what are we playing” vs “what are we watching.”

While TV/film/books still host compelling stories, many of the most innovative and successful new stories are being told in games today. Take Harry Potter for example. Open-world RPG Hogwarts Legacy let players step into the cloak of a new student at Hogwarts with a depth of immersion not seen before. The game was the best-selling title of 2023, grossing over $1B at launch and beating the box office of every Harry Potter movie except the finale Deathly Hallows Pt 2 ($1.3B).

Games IP have also had recent smashing success as TV/film adaptations. Naughty Dog’s The Last of Us was HBO Max’s most-watched series in 2023 with 32M average viewers per episode. The Super Mario Bros movie grossed $1.4B with the biggest global opening weekend ever for an animated film. Then there’s the critically acclaimed Fallout show, Paramount’s less good Halo show, the Tom Holland Uncharted movie, Michael Bay’s Skibidi Toilet movie – the list goes on and on.

A key reason why interactive media is so powerful is because active participation helps create affinity toward a story or universe. An hour of gaming at 100% attention > an hour of passively watching TV. Many games are also social, with multiplayer mechanics built into the core design. The most memorable stories are the ones we create and share with those who are close to us.

Sustained engagement with an IP across multiple modalities – viewing, playing, creating, sharing – enables a story to become more than just entertainment, it becomes part of a person’s identity. The magic moment is when a person transitions from “I watch Harry Potter” to “I am a Potterhead.” The latter is much more durable, building identity and a multiplayer community around what might have previously been a single player activity.

All in all, while some of the greatest stories in our history have been told in linear media, going forward games and interactive media are where the stories of the future are being told – and thus where we believe the most important storytelling companies of the next century will be built.

Interactive Video: Storytelling Meets Play

Given the cultural dominance of games, we believe the next Pixar will arise through a media format that blends storytelling with play. One format we see potential for is interactive video.

First, what is interactive video and how is it different from video games? In a video game – a developer pre-loads a set of assets into a game engine. For example, in Super Mario Bros, an artist designs the Mario character, trees, and background. A programmer conditions Mario to jump exactly 50 pixels after the player hits the “A” button. The jumping frames are rendered through a conventional graphics pipeline. This leads to a highly deterministic and calculated architecture of the game, where the developer is fully in control.

Interactive video, on the other hand, generates frames in real-time entirely from neural networks. No assets need to be uploaded or created other than a set of creative prompts (which could be text or a representative image). A real-time AI image model receives the player input (ex. “Up” button) and will probabilistically infer the next generated gameplay frame. 

The promise of interactive video lies in blending the accessibility and narrative depth of TV/film, with the dynamic, player-driven systems of video games. Everyone already knows how to watch TV and follow linear stories. By adding video generated in real-time using player input, we can create personalized, infinite gameplay – potentially enabling a media property to retain fans for thousands of hours similar to the best player-driven games. Blizzard’s World of Warcraft is over 20 years old and still retains an estimated 7M subscribers today. 

Interactive video also enables multiple consumption modalities – a viewer could lean back and consume content like a TV show, and at other times lean in and actively play on a mobile device or controller. Enabling fans to engage with their favorite IP universes in as many ways as possible is core to transmedia storytelling, which helps create stronger affinity toward an IP.

Over the past decade, many storytellers have pursued various attempts at the interactive video vision. An early breakout was Telltale’s The Walking Dead – a cinematic experience based on Robert Kirkman’s comic series, where players watch animated scenes play out but make choices at key moments via dialogue and quick-time events. These choices – for example determining which character to save in a zombie attack – created story variants that personalized each playthrough. The Walking Dead launched in 2012 and was a resounding success – winning several Game of the Year awards and selling over 28M copies to-date.

In 2017, Netflix also entered interactive video – starting with animations like Puss in Book and eventually releasing the critically acclaimed Black Mirror: Bandersnatch, a live action film where viewers make choices for a young programmer adapting a fantasy book into a video game. Bandersnatch was a holiday phenomenon that created a cult following of fans making flow charts to document every possible ending of the film.

Yet for all the positive reviews, both Bandersnatch and The Walking Dead faced an existential problem – it was time and cost prohibitive to manually create the myriad branching stories that defined the format. As Telltale scaled to multiple projects, they developed a reputation for crunch with developers complaining about “churn and burn.” Narrative quality suffered – while The Walking Dead started with a positive Metacritic score of 89, 4 years later would see Telltale release one of their biggest IPs Batman to a disappointing 64 Metacritic. And in 2018, Telltale declared bankruptcy after failing to find a sustainable business model at the time.

For Bandersnatch, the crew filmed 250 video segments comprising over 5 hours of footage to account for the film’s 5 endings. Budget and production times were reportedly double that of a standard Black Mirror episode, with the showrunners sharing that the project complexity was equivalent to making “4 episodes at the same time.” Eventually in 2024, Netflix decided to mothball the entire Interactive Specials division – opting to create traditional games instead.

Until now, content costs for interactive video projects have scaled linearly to hours of gameplay – there was just no way to get around this. However, advances in generative AI models could be the unlock for making interactive video work at scale.

Generative Models will Soon be Fast Enough for Interactive Video

The recent advances in model distillation for image generation have been astounding. In 2023, the release of latent consistency models and SDXL Turbo drove huge improvements in image generation speed and efficiency – enabling high-resolution renders in just a single step from 20-30 steps previously, and bringing cost down by >30x. The idea of generating video – a series of consistent images with frame by frame changes – suddenly became very possible. 

Earlier this year, OpenAI shocked the world by announcing Sora, a text-to-video model that could generate up to 1 minute long videos while maintaining visual consistency. A short while later, Luma AI released an even faster video model Dream Machine which could generate 120 frames in 120 seconds (~5 seconds of video). Luma recently shared that they’ve reached an astounding 10M users in just 7 weeks. Last month, Hedra Labs released Character-1, a multimodal video model focused on characters, which can generate 60 seconds of video with expressive human emotions and voice acting in 90 seconds. And Runway recently unveiled Gen-3 Turbo, a model that can render 10 second clips in only 15 seconds.

Today, an aspiring filmmaker can quickly generate several minutes of 720p HD video from a text prompt or reference image, which can be paired with a starting or ending keyframe for greater specificity. Runway has also developed a suite of editing tools that provide more fine grained control around diffusion-generated video, including in-frame camera control, frame interpolation, and motion brush (animating sections of the video). Luma and Hedra are due to release their own creator tool suite soon as well.

While the production workflows are early, we’ve already met several content creators putting together stories using these tools. Resemblance AI created Nexus 1945, a stunning 3 minute alternate history of World War II told with Luma, Midjourney, and Eleven Labs. Indie filmmaker Uncanny Harry created a Cyberpunk short with Hedra. Creators have made music videos, show trailers, travel vlogs, and even a fast food burger commercial. Since 2022, Runway has hosted an annual AI Film Festival where they select the 10 top short films produced with the help of AI.

Yet to acknowledge a few of the current limitations – there’s still a big gap in narrative quality and control between a 2 minute clip generated by a prompt and a 2 hour feature film crafted by a team of professionals. It can be hard to generate exactly what a creator wants from a prompt or image, and even experienced prompt engineers usually discard the majority of their generations. AI creator Abel Art reported a ratio of ~500 videos for 1 minute of coherent video. Image consistency also usually starts to fail after a minute or two of continuous video and requires manual editing – which is why most generations are capped at ~1 minute today.

To most professional Hollywood studios today, diffusion model generated videos might be used for storyboards in pre-production to visualize what a scene or character might look like, but not as a replacement for on-set work. There is also an opportunity to use AI for audio or visual effects in post-production, but, as a whole, the AI creator tool suite is still early in comparison to traditional workflows that have seen decades of investment.

In the near-term, one of the largest opportunities for generative video lies in advancing new media formats like interactive video and shorts. Interactive video is already broken up into short 1-2 minute segments given player choices, and are often animated or stylized, allowing for lower resolution footage. More importantly, creating these short videos via diffusion models is more cost-effective than it has ever been for Telltale / Bandersnatch – Abel Art estimated the cost of a 1 minute video from Luma at $125, equivalent to renting a cinema lens for one day.

And while generative video quality can be inconsistent today, the popularity of vertical shorts such as ReelShort and DramaBox has already proven that there is audience demand for lower production value, episodic short TV. With thousands of bite-sized mini series such as “Forbidden Desires: Alpha’s Love,” ReelShort has driven over 30M downloads and $10M+ / monthly revenue, despite critic complaints that the cinematography is amateur and script formulaic.

The biggest remaining technical hurdle for interactive video is reaching frame generation speeds fast enough for content generation on the fly. Dream Machine currently generates ~1 frame per second. The minimum acceptable target for games to ship on modern consoles is a stable 30 FPS, with 60 FPS being the gold standard. With the help of advancements such as PAB, this could go up to 10-20 FPS on certain video types, but is still not quite fast enough.

State of Play: the Interactive Video Landscape

Given the rate at which we’ve seen underlying hardware and model improvements, we estimate that we may be ~2 years out from commercially viable, fully generative interactive video.

Today, we’re seeing progress in research with players like Microsoft Research and OpenAI working toward end-to-end foundation models for interactive video. Microsoft’s model aims to generate fully “playable worlds” in 3D. OpenAI showed a Sora demo where the model was able to “zero-shot” simulate Minecraft: “Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity.”

In February 2024, Google DeepMind released its own foundation model for end-to-end interactive video named Genie. The novel approach to Genie is its latent action model, which infers a hidden action in between a pair of video frames. With training from 300,000 hours of platformer videos, Genie learned to distinguish character actions – ex. how to jump over obstacles. This latent action model, in combination with a video tokenizer, gets fed to a dynamics model that predicts the next frame, thus piecing together an interactive video.

On the application layer, we’re already seeing teams explore novel forms of interactive video experiences. Many companies are working on producing generative film or television, designing around the limitations of current models. We’ve also seen teams incorporate video elements inside AI-native game engines.

Latens by Ilumine is building a “lucid dream simulator” where users generate frames in real-time as they walk through a dream landscape. The slight lag helps create a surreal experience. Developers in the open-source community Deforum are creating real-world installations with immersive, interactive video. Dynamic is working on a simulation engine where users can control robots in first person using fully generated video.

In TV/film, Fable Studio is building Showrunner, an AI streaming service that enables fans to remix their own versions of popular shows. Fable’s proof of concept South Park AI debuted to 8M views last summer. Solo Twin and Uncanny Harry are both AI-focused filmmaking studios at the bleeding edge. The Alterverse built a D&D inspired interactive video RPG where the community decides what happens next. Late Night Labs is a new A-list film studio integrating AI into the creative process. Odyssey is building a visual storytelling platform powered by 4 generative models.

As the line between film and games blur, we’ll see AI-native game engines and tools emerge to provide creators with more control. Series AI has developed Rho Engine, an end-to-end platform for AI game creation, and is leveraging their platform to build original titles with major IP holders. We’re also seeing AI creation suites from Rosebud AI, Astrocade, and Videogame AI enable folks new to coding or art to quickly get started making interactive experiences.

These new AI creation suites will open up the market for storytelling – enabling a new class of citizen creators to bring their imagination to life using a combination of prompt engineering, visual sketching, and voice dictation.

Who will Build the Interactive Pixar?

Pixar was able to take advantage of a foundational technology shift in computer and 3D graphics to build an iconic company. There is a similar wave happening today in generative AI. However, it’s also important to remember that Pixar owes much of its success to Toy Story and the original animated films created by a world-class team of storytellers led by John Lasseter. Human creativity, leveraging new technology, produced the best stories.

Similarly, we believe the next Pixar will need to be both a world-class interactive storytelling studio as well as a top technology company. Given how quickly AI research is progressing, the creative team will need to be able to work hand-in-hand with the AI team to blend narrative and game design with technical innovations. Pixar had a unique team that merged art and technology, and also partnered with Disney. The opportunity today is for a new team to bridge the disciplines of games, film, and AI together.

To be clear, this will be challenging – and it’s not just limited by technology – this team will need to find new ways for human storytellers to work alongside AI tools in a way that empowers vs detracts from their imaginations. There are also many legal and ethical hurdles that need to be solved – legal ownership and copyright protection of AI-generated creative works is unclear today unless a creator can prove ownership of all the data used to train a model. Compensation for the original writers, artists, and producers behind training data still needs to be resolved.

Yet what’s also clear today is that there is immense demand for new interactive experiences. And long-term, the next Pixar could create not just interactive stories but entire virtual worlds. We previously wrote about the potential of never ending games – dynamic worlds that combine real-time level generation with personalized narratives and intelligent agents – similar to HBO’s Westworld vision. Interactive video addresses one of the greatest challenges with bringing Westworld to life – creating large amounts of personalized, high quality, interactive content on the fly.

One day, with the help of AI, we might start the creative process by crafting a storyworld an IP universe we envision fully formed with characters, narrative arcs, visuals, etc – and then generate the individual media products we want for an audience or situation. This will be the final evolution of transmedia storytelling, fully blurring the lines between traditional forms of media.

Pixar, Disney, and Marvel were all able to create memorable worlds that became part of their fans’ core identity. The opportunity for the next Interactive Pixar is to leverage generative AI to do the same – to create new storyworlds that blur the lines between traditional storytelling formats, and in doing so, create universes unlike any we’ve seen before.

Special thanks to Neel Jain for research & writing support.