AI + a16z

Benchmarking AI Agents on Full-Stack Coding

Martin Casado and Sujay Jayakar

Posted March 28, 2025
In this episode of the AI + a16z podcast, a16z General Partner Martin Casado sits down with Sujay Jayakar, cofounder and Chief Scientist at Convex, to talk about his team’s latest work benchmarking AI agents on full-stack coding tasks. From designing Fullstack-Bench to the quirks of agent behavior, the two dig in to what’s actually hard about autonomous software development, and why robust evals — and guardrails like type safety — matter more than ever. They also get tactical: which models perform best for real-world app building? How should developers think about trajectory management and variance across runs? And what changes when you treat your toolchain like part of the prompt? Whether you’re a hobbyist developer or building the next generation of AI-powered devtools, Sujay’s systems-level insights are not to be missed.
Drawing from Sujay’s work developing the Fullstack-Bench, they cover:
  • Why full-stack coding is still a frontier task for autonomous agents
  • How type safety and other “guardrails” can significantly reduce variance and failure
  • What makes a good eval — and why evals might matter more than clever prompts
  • How different models perform on real-world app-building tasks (and what to watch out for)
  • Why your toolchain might be the most underrated part of the prompt
  • And what all of this means for devs — from hobbyists to infra teams building with AI in the loop

More About This Podcast

Artificial intelligence is changing everything from art to enterprise IT, and a16z is watching all of it with a close eye. This podcast features discussions with leading AI engineers, founders, and experts, as well as our general partners, about where the technology and industry are heading.

Learn More