AI + a16z

AI, Data Engineering, and the Modern Data Stack

Tristan Handy, Jennifer Li, and Matt Bornstein

Posted June 23, 2025

In this episode of AI + a16z, dbt Labs founder and CEO Tristan Handy sits down with a16z’s Jennifer Li and Matt Bornstein to explore the next chapter of data engineering — from the rise (and plateau) of the modern data stack to the growing role of AI in analytics and data engineering. As they sum up the impact of AI on data workflows: The interesting question here is human-in-the-loop versus human-not-in-the-loop. AI isn’t about replacing analysts — it’s about enabling self-service across the company. But without a human to verify the result, that’s a very scary thing.

Among other specific topics, they also discuss how automation, as well as tooling like SQL compilers, are reshaping how engineers work with data; dbt’s new Fusion Engine and what it means for developer workflows; and what to make of the spate of recent data-industry acquisitions and ambitious product launches.

Transcript

Tristan: I don’t believe in the idea that you’re going to do analytics by asking a model to write SQL. It’s not that interesting if you can write a well-formed SQL query. The hard part of analytics is what data analysts are doing is they are socially constructing truth inside of an organization. There is no such thing as revenue in an abstract sense. It is just what do we all agree is the way that we measure revenue.

A model just doesn’t have access to that unless you give it very specific instruction, and you would do that through metadata. In a best-case scenario, you would do it through something called a semantic layer. A semantic layer would actually give the model exactly the metadata required to construct the SQL query in a way that everybody in the organization agrees it should be constructed.

We acquired a company called Transform. I think it was two and a half years ago. Now it’s integrated into the dbt platform, and we built an MCP server that exposes this functionality. When you go to any MCP-enabled language model and you ask it questions about your business data, it gives you correct answers. The funny thing is that there’s a bunch of people that play around with that, but it hasn’t crossed into the mainstream. There’s a ton of curiosity around this, but still people are using the BI tools generally that they have been using.

Jennifer: Let’s break it down to what are the tasks an analyst is doing today, and what are the pieces that actually these models have capabilities to serve? I think even compared to a year ago, the capability of writing SQL is night and day. I recently asked ChatGPT to build a chart for me with this very complex Excel sheet. You need to do a couple pivot tables and paint this chart, take out a couple of rows and columns as well. It did a great job at painting this chart. I was very surprised, impressed.

I did a couple of spot checks of if this is still correct data points, and they’re correct. That gave me more hope of where AI can be applied to maybe in the final step of visualization. There’s also data cleaning work to do for analysts. There’s this organizational social work to do, which I don’t believe ever, maybe through a few agents working together, they can gather some truth. But which are the pieces maybe we can speculate now ready to be automated, and which are the pieces still, I think, we require a human to come in and do the work.

Matt: I can share just one or two things that we’ve seen from some of your fellow portfolio companies. I think you’re absolutely right that humans have a lot of work to do in data analysis. That’s very clear. So much of the work is gathering context, making definitions, almost negotiating with other stakeholders about what the definitions mean and which ones are correct. You obviously are kind of the world expert on this. You know this much better than we do. What’s interesting is there’s a parallel there to writing application code, I think.

When you’re just writing code for a new piece of software, there’s still a lot of context both in the terms of the code that has to be ingested and you have to understand the architecture of the whole system and all that kind of stuff and the social context of you’re working in a bigger team of engineers with all different opinions. We’re obviously seeing AI coding take off to a very large extent. I think the key there has been finding the right insertion point. I think this is exactly what you’re asking about, Jennifer, in analytics, where it’s very clear you don’t want to just completely replace an engineer with an AI coding system in the same way that you wouldn’t want to completely replace an analyst. It almost just doesn’t make sense. It’s like somebody still has to press the button.

So as long as that’s true, it’s, like, okay, are they just pressing a button, or are they providing a specification? If so, what does the spec look like? And if so, shouldn’t they just be writing some code or kind of driving anyway? So there’s this kind of fundamental problem, I think, with full replacement of people in these jobs. What coding has gotten right is the models write very good code, actually, now. They can do some stuff sort of on their own if you give them proper direction, write a good spec.

And there are great tools, like things like Cursor and Claude Code and things like this that kind of insert in a way that engineers like, I’m sort of curious to see if that comes to analysts. We’ve actually seen people use Cursor to write analytics queries, which is pretty interesting. One of our companies, Hex, has a pretty good AI product that I think you know this company, Tristan. It’s almost hard to go back once you’ve used their kind of magic features to not using it. But these are still relatively small and relatively incremental. So what’s interesting, I think, is kind of what happens next and what really is that right insertion point.

Tristan: I think you’re totally right. The interesting question here is human in the loop versus human not in the loop. And this is why… Typically, the way that people think about the “AI analyst” is not as a way to accelerate current analysts. It’s a way to do self-service inside of businesses. So, like, okay, I want to take this, and I want to give this to every single data user throughout my company, which is 10x, 100x as many analysts as there are. They’re the people who mostly are using Excel today. But those folks don’t have the ability to evaluate, is this code actually producing the correct result?

Jennifer: Correct. Yep. They have no way to verify it, which is a very scary thing.

Tristan: That’s a human out-of-the-loop process. The places where I think human in the loop is working really well, and Hex is a great example of this, typically, you’re going to see users constructing these queries or notebooks that have an ability to read the code and say this is correct or not. And as a result, it becomes an accelerator for them as opposed to a replacement for them. The area where I think there’s even more room for this is in data engineering.

Data engineering is incredibly valuable, but also many… If you, like, literally take all the tasks that data engineer does every day and you write them on a list and you say, which one of these are things that highly trained, highly paid human beings should be spending their time on? It’s, like, a lot of them they shouldn’t. This is an area where SQL generation is very valuable. Pipelines are incredibly valuable. The performance of them…their performance matters a lot.

So in dbt, we have the ability to…like, using natural language help you build pipelines. Of course, you’ve got to actually validate them and work them through the CI/CD process, etc. That works great. One of the things that I think is the most time-sucky and produces very little value is debugging pipeline failures. Pipeline failures happen. Pipelines are more brittle than we’d like them to be. And yet, inevitably, the cause of those failures is kind of dumb. It’s not that interesting. And so you just need to look through enough log files and trace it upstream. There’s a process for going through this, but oftentimes it takes four hours for a human being to trace this. It turns out, and we haven’t productized this yet, but we’ve proven it to ourselves internally via prompt engineering.

Agents are quite good at identifying the problem and proposing a fix. And if you have the right tooling, you can then take that fix and run it through CI, and you can say, ah, this actually produces the output that I’m looking for. So, I expect to see a lot of automation of data engineering tasks over the coming 12 months.

Matt: And are these failures kind of within system boundaries or across system boundaries? Because I’ve found this is one of the big kind of questions for AI. If you have to interface with an external system, it’s a lot worse at that versus if it’s, like, oh, there’s a mismatch schema. It’s actually pretty good at making a guess, trying to align them.

Tristan: You’re right. The things that I’m focused on are very much in the world of “the data has landed” and all the way through to “I’ve got the data set ready for analysis that I want.” But I think that there is enough connective tissue in this space that I think you could turn around and do that same set of things to five train pipelines or anything like that.

Matt: It’s very interesting. Because in the coding example, you can set your bug bot loose to try to find a bug. And if there’s some external dependency, it tends to just start making stuff up. It’s, like, oh, maybe that system went down or maybe, you know, like, the function signature, you know, and it’s, like, well, did it or did it not? Right. It’s very hard for it to tell. That’s actually very interesting. That’s sort of a point in favor of kind of…you know, to your point.

Tristan: Yeah. Pipeline failures in our world happen for a pretty defined set of reasons.

Matt: Very interesting.

Tristan: An upstream source changed the schema, and it broke something, or new data showed up that we didn’t anticipate, or these kinds of things.

Matt: That’s very interesting.

Jennifer: I’m asking you this question because you coined a term. Second, also because you’re sort of a historian of the space as well. You wrote, you know, very popular blogs around this. “The Analytics Engineering Podcast” also talked quite a bit about modern data stack. Give us a bit more background on what is this modern data stack?

Tristan: Modern is always a tough term. Modern relative to what? I live in a mid-century modern house. It was built in 1979. That’s not particularly modern anymore, but, like, we still call it that because it was in reaction to a thing that came before it. And I’m not an architecture snob, so I don’t really know the full history there. But the thing that modern was referring to kind of in relation to two things that came before it, one was the kind of Hadoop World, and the other was the kind of on-prem data warehouse appliance world. And both of those either had already hit or were starting to hit some pretty serious headwinds by the time that kind of cloud came for data.

I would put the start of the modern data stack at the launch of Redshift in 2013. I think that that was… And you could argue that maybe the 2013 version of Redshift didn’t have many of the characteristics that some of the data platforms later came to have. But it was the first time you could swipe a credit card and get access to really great analytic technology in the cloud. Before that, you had to spend, you know, 100 grand in procure servers. An ecosystem grew up around it in the early days. It was like Looker and Mode and Periscope and Fivetran and Stitch and maybe a couple of others. And you could pretty quickly put together a set of products that was pretty mature with a couple of credit card swipes and in an afternoon. And that was brand new.

And for people who had been stuck with not good tooling for a long time, it was really exciting. It allowed us to work in ways that were not at all possible. Now, I’m sure we’ll get into it, but I think that the arc of history has, I think, played out on the term modern data stack, mostly because it won. Like, the ideas in the modern data stack have kind of taken over the industry. And so then the question becomes, like, well, what’s next?

Jennifer: You sort of also ended the modern data stack era to 2024. So, where are things at now? And where is data stack at? Is it postmodern? I don’t know.

Matt: I mean, the impressionist data stack or deconstructionist data stack.

Tristan: I don’t remember who originally got me onto this, but I’ve become a big fan of the Carlota Perez framework. It introduces the concept of S-curve stacking. So every technology goes through an S-curve where it starts off where almost nobody uses it, and then very quickly a bunch of people, early adopters, and then middle and then late adopters. Eventually, everybody’s using this, and it kind of starts to level off. And the way that you get technological progress is you stack S-curves on top of one another. The way that I see the space right now is that really we had the S-curve right before what we’ve been talking about as the modern data stack was kind of the rise of public cloud and Hadoop.

And Hadoop was really enabled by the cloud. Most companies couldn’t really imagine running kind of a Hadoop infrastructure on-prem. It’s just, like, not really how it’s built. That S-curve kind of came to an end, and then you have the rise of the modern data stack. And I would say that that S-curve kind of came to an end in the same way that the S-curve around railroads came to an end. We got all the railroads, and we’re not in a deployment phase of railroads anymore, circa like 1925. And so now the big axis of innovation, I think, is in two places. One is in open standards, things like Delta and Iceberg that’s at the file format or the table format level. And then the other one, obviously, is in AI.

And so AI is a much bigger topic than the world of data. For all the excitement that’s happened in data over the last 15 years, I don’t think anyone’s worried about artificial data intelligence and data putting us all out of jobs or anything like that. The societal implications of AI are fascinating and well beyond what I’m an expert in. But there are very direct implications for AI on data and for data on AI. And so it’s that intersection that I’m particularly interested in.

Matt: One question I have for you, Tristan, is are there things that modern data stack never hit? Are you seeing a lot of workloads that are still kind of there, and they’ve been there forever, and even though people know modern data stack is the right way to do things, like, they’re, like, oh, but this has some other thing, you know, and so we just haven’t touched it.

Tristan: The term modern data stack, if you move away from the technology part of it, there’s also a persona part of it. Who tends to work with the set of technologies? And I think the answer to that is the spectrum from data engineer to analytics engineer to data analyst. It’s, like, people that are firmly in the world of data. Software engineers sometimes dabble in that space, but they mostly don’t. And similarly, if you’re a business analyst, you might dabble, but you mostly don’t. Like, business analysts have been pretty resilient to the rise of the modern data stack and a lot of them still use tools like Tableau and Alteryx and Excel.

And software engineers often still, like, we will run into people who, despite the fact that there’s, you know, well over a million people that are offering dbt workloads and 70,000 companies today, like, software engineers, like, a lot of times just don’t have any contact with this tooling stack at all. In terms of workloads, I thought that we were going to do more in streaming, we, the collective kind of ecosystem, and that hasn’t turned out to be true, at least at the pace that I had anticipated. I think that ends up being more of a persona thing than a technology thing because I think there are actually good answers to how to do SQL and Python on stream processing engines, but I think it’s actually just tends to be different humans who need really low latency data delivery.

Matt: Streaming, like, cold fusion is one of those things. It’s always a good idea and always on the horizon. That’s a really interesting point you make about software engineers versus analysts or analytics engineers, for instance, because I think in a lot of ways, the history of the data stack, like you said, is sort of this diffusion from more engineers towards more analytics-type people, right? Like, you mentioned Hadoop, for instance. You know, Hadoop was sort of a very technical, highly engineered solution built by a bunch of engineers, right? They kind of looked at this data problem that existed at the time and said, okay, let’s do a distributed file system and this really complicated sort of programming model that only, you know, like, a Google engineer could invent, right? Called MapReduce. And so I think you saw a diffusion of this happen for a long time, right? Like, you can trace things like Hadoop into things like Redshift or Snowflake, where you’re having this distributed benefit but with an easier programming model, for instance.

I think Iceberg or Delta, which you mentioned, is another great example of that, where this was sort of a new table format, as you mentioned, sort of built by people at Netflix and Airbnb and Apple and places like that, but really is diffused much more broadly now as kind of mainstream enterprises want to apply this kind of independent storage layer. It’s not clear if that’s still happening right now. Like, are there kind of new things that are diffusing out of the engineers, you know, into kind of the analytics world, or maybe, to your point, maybe, like, those groups aren’t talking to each other as much these days, or maybe, you know, it’s just kind of the natural flow of the industry. It’s an interesting question, I think, that you bring up.

Tristan: Not that we represent the entire modern data stack by any stretch. I would never try to claim that. But I will say that, you know, if we’re at over a million developers and 70,000 companies…

Matt: Those are huge numbers, by the way. Sorry to interrupt. That’s, like, crazy to think about.

Tristan: It’s a decent slice. We still see those numbers growing pretty quickly. That is not because there are that many new humans getting minted every year. It’s because people are still joining this movement, this way of looking at the world. I think that will continue to be true for a long time. In terms of, like… We still have a lot to steal from software engineers. Like, you don’t… I’ve pretty consistently felt like software engineering tool stack was maybe two decades ahead of data. I think that maybe we’ve closed a little bit of that gap, but we’re still pretty far behind.

One of the things that is irritating to me is that in data, most of the processing engines that we use are proprietary, and they’re controlled by a vendor. And as a result, there’s no such thing as a local development environment, which is, like, kind of anathema to software engineers. Like, the idea that the only way I could possibly run my workload is in Amazon RDS. Like, that’s not a thing or, like, was a thing 25 years ago. The other thing is that, like, basically all software engineering ecosystems are fundamentally built on a compiler or an interpreter in the case of, you know, interpreted languages like Python. That compiler defines, kind of, the ground truth for, like, what works in an ecosystem. And then on top of that, you have libraries and package management, and, like, a whole ecosystem kind of build builds up around it.

And because of these two things together, you end up having, like, a dysfunctional software environment where every company that you go to, you have to build everything from scratch all over again because there’s not good shared libraries because at this company, we use a different data platform. And the languages between these two data platforms are different enough that you can’t reuse the code across them, and blah, blah, blah, blah. And so one of the things that we have been very focused on over the last six months is we acquired a company called SDF. SDF is fundamentally a compiler. The technology involved is a SQL compiler, multi-dialect SQL compiler. And so it aims to abstract across all of the differences between all the different SQL dialects and then pull it down into a place where you can actually emulate that database with full, like, 100% fidelity on your local machine and give developers tooling that they can trust there.

Jennifer: What is the product work you’re doing now on dbt Fusion?

Tristan: The dbt Fusion engine is directly…it comes directly from technology that we acquired from a company called SDF. This is a group of very smart humans who essentially rebuilt the engine at the heart of the dbt ecosystem in Rust and gave it a bunch of new capabilities. At its heart, it is a SQL compiler, multidialect SQL compiler, and so can do a bunch of things like understand the most granular level how a query will operate when it’s sent to a database and it can emulate that locally. That allows us in this new world to do a bunch of neat things. It allows us to give developers local development environments. It allows us to give developers much better developer tooling in their IDE than they’ve ever had access to before. Error handling, automatic refactoring, all of these kinds of things that you would expect in a modern software language. It also will allow us to do a bunch of neat things that are kind of new in the data engineering space.

So, the original technology for this came from when the CTO, Wolfram, was at Meta. He was hired at Meta in the wake of the Cambridge Analytica scandal. And the task was we have over a million tables in our data warehouse, and we don’t know how PII flows through that data warehouse. And we’ve got eight different computer engines, and their SQL dialects are all a little bit different, and we need to make sure that everywhere the PII flows, we can track it. And so that is a capability of this engine. And so you can…on the source level, you can tag all of your PII, PHI, and then it will perfectly track that for you through your entire data estate. It will also give you the ability to orchestrate your pipelines on a much more sophisticated way so that it never does any work that it doesn’t have to, which has the potential.

Jennifer: Much more efficient.

Tristan: Much more efficient. It has the ability to reduce your overall kind of infrastructure costs by, like, meaningful double-digit percentages.

Jennifer: Also thinking ahead of, you know, what is happening when we have more AI analyst agents, that the compute bill probably is going to stack up if we don’t have these more efficient workflow engines.

Tristan: Yeah. Jevons paradox is coming into effect pretty hard right now. I was just talking to Jordan Tigani at MotherDuck, and he’s seeing a lot of workloads move onto MotherDuck. But then what do you know, he saves people a bunch of money, and they find a bunch of new workloads. And he shared some quote with me, which I forget the person who this is attributed to, but he said, analytics always expands to fill the available budget. So, like, you want to continue to improve the price-to-performance ratio, not so that at the end of the day, like, people can stop doing things, but so that they can do more things.

Jennifer: Right. And that’s one of the premises of, you know, why modern data stack was popular was a lot of business analytics work that needed to be done in the past were not able to with much limited data sets. Now that you can store all the data you want to analyze in the cloud data warehouse in a much cheaper and much more performant, easy-to-access way, like, you know, we can answer a lot of questions that we were not able to answer before.

Matt: By the way, the SDF guys just deserve, like, a medal of honor for actually doing this work, right? Like, the idea that you can, like, interpret specific SQL dialects and, like, run local emulation of each of these engines. It’s, like, this kind of extremely detailed systems work that, like, is…it’s hard to do.

Tristan: Yeah, I kind of didn’t believe it at first. I asked them how many automated tests they had to write in order to ensure that…like, to guarantee that that statement. And the answer is that on top of the SDF database emulation stuff, there’s single-digit millions of automated tests that run.

Matt: …is crazy.

Tristan: Which is why it has to all be written in Rust because it’s, like, a very serious build system required there.

Jennifer: What are some of the things that you’re most excited about that, you know, haven’t been done of borrowing from practices of software engineering that could be applied to, you know, data and data engineering?

Tristan: I think that… So I just gave you two of them, local development environments and compilers. Where we go from there, I think it is, like, healthy, reusable ecosystems. When you build a website, you don’t start by writing HTML and CSS. You, like, typically would use React, and then on top of React, there’s a ton of components, and almost never are you going to build any of these basic components. Maybe you’ll, like, modify the CSS to make it look like your brand or something like this.

Matt: And then everything breaks, so you go, oh shoot, better change my CSS back.

Tristan: Right. Yeah. The point of good tooling is to multiply the impact of every individual professional, and that has always been my goal. You know, I started my career as a data analyst. Data analysts, especially back in 2003, did not have great career paths, didn’t make a ton of money. And the better tooling you can give somebody, the more, like, business value they create and the more you can afford to pay them. And so I think that if we can create really highly functional package ecosystems, we can stop the process of people reinventing the wheel over and over and over again.

Jennifer: Yeah, 100%. And also thinking in context of not just going to be humans analyzing and utilizing data, but there will be more and more AI agents that are coming too.

Tristan: And the more you can standardize, the better your agents will be able to interface with your data.

Jennifer: For sure. And reusing the components, reusing the libraries, being able to guarantee more accuracy through having these verified sort of components as well. I’d love to hear a bit more of your hot takes on the recent news. Dbt has done a couple acquisitions, one more recently you mentioned SDF. At the Databricks Summit, people are talking about, you know, Lakebase [SP] from the recent acquisition of Neon, and Snowflake acquired Crunchy Data. What’s happening with these, I’d say, like, one of the aspect companies going into first more operational or transactional data workloads. And also how you generally think about, you know, what’s happening with the tooling stack being more compressed now compared to sort of a few years ago.

Tristan: Compressed? You mean like consolidating?

Jennifer: Yes.

Tristan: Yeah. Yeah. Yeah.

Matt: That’s like the C word these days, consolidating.

Tristan: Yeah. One of the most boring things to do as a data engineer is to create pipelines that replicate data using CDC from your OLTP to your OLIP data stores. It is just, like, these database technologies for operational workloads and analytical workloads, they optimize for different things. And so then I don’t believe they’re ever going to be the same. So you always have both of them, and you always need to get data back and forth between them. The idea that you would have the same vendor being able to provide both seems, like, obviously a good idea. Now, I know that we’re recording this on the afternoon where Ali and Reynald went deep into the lake base. That happened this morning, and it was super interesting to hear them talk about it. I think it’s based on a lot of good thinking, but this is not… They’re not the first ones to do this. Let’s give people both access modes, OLTP and LAP. I think that it will help a bunch of Databricks and Snowflake users that their platforms now support that.

Matt: What do you think is really going on here? And maybe just for our listeners, we can do the quick explainer, which is OLTP means one row at a time. So if you’re checking out at Amazon, you insert into an OLTP or transactional database. OLAP means you’re going one column at a time. So if you want to look…summarize across all the rows…all the transactions that were done, so it’s more analytical. What do you think is really going on here? I just find it so interesting. It was sort of an OLTP world for decades. It was Oracle and SAP and even MySQL and Postgres. This was…when you said databases, this is what people thought of. It’s almost like OLAP kind of became the hot thing between Snowflake and Databricks. And I know I’m abusing the term OLAP a little bit now, but just sort of analytical workloads in general. But now it’s very funny, right? Because these companies are now getting into OLTP. It’s like this kind of, like, market pendulum kind of shifts back and forth, and the technology may not change dramatically, but the people kind of running…kind of, like, owning the customer and sort of owning the consolidation point may change. So I’m just so curious what you think is kind of going on, like, why that happens.

Tristan: We have needed databases to process transactions as long as we’ve had software. And you could certainly get people who know much more than me about the early days of that ecosystem, but you’d have to trace it back to whatever, like, the mainframe [crosstalk 00:32:03] Google database system.

Matt: Yeah, it’s like the airline booking systems.

Tristan: Yeah, right. You could probably draw some exponential curve of the number of software applications out there in the world. And almost every software application needs some way to store state, it needs a database. And so the growth of OLTP has been, I think, pretty consistent over time. And for a long time, if you were going to do analytics, you reused whatever system you were using for your transaction processing system. I mean, even I started my career, like, writing queries on top of Oracle’s OLTP database and MySQL and stuff like this. And it was bad, but as long as your data wasn’t huge, it wasn’t a giant problem. And so why did OLAP start to become a bigger thing? It’s just the rise of the internet. Like, the rise of the internet led to clickstream data, led to advertising data, and the data volumes went up. And so you developed more use cases for which you needed the capability to process larger sets of data.

I still think, and I don’t need…you folks probably have better market research on this than I do, but my guess is that the OLTP world is still, from a pure dollars perspective, is probably still significantly larger. But it’s also a little more stable. And we’ve been doing this for a long time. The growth rate is probably pretty consistent. And so the novelty is in analytical databases. And that’s why you see companies like Databricks and Snowflake kind of come from nothing, because I think the folks who had done OLTP databases for a long time didn’t anticipate just how big an opportunity there was here.

Matt: Oh, that’s interesting. So it was a little bit overlooked by the OLTP guys.

Tristan: I think so. And now they’re kind of, like, backwards integrating.

Jennifer: And on the point of what is driving storage and compute workloads, my speculation also on this acquisitions is also what type of workloads these players want to see on top of their platform. All the OLTP databases, they say at this point, added vector search capabilities or vector capabilities already, where I think that’s a majority of the workload when you were thinking about AI. That’s driving a ton of usage on top. It’s people who are trying to leverage the data in the database to build applications. And OLAP has a role to play in that, but it’s still not as direct as sort of these OLTP databases where there’s going to be a lot of synergy between the two to leverage one for predictive, maybe more batch workloads, but the other one for more of these forward-looking use cases.

More About This Podcast

Artificial intelligence is changing everything from art to enterprise IT, and a16z is watching all of it with a close eye. This podcast features discussions with leading AI engineers, founders, and experts, as well as our general partners, about where the technology and industry are heading.

Learn More