Data, data, data — it’s long been a buzzword in the industry, whether big data, streaming data, data analytics, data science, even AI & machine learning — but data alone is not enough: it takes an entire system of tools and technology to extract value from data.
A multibillion dollar industry has emerged around data tools and technologies. And with so much excitement and innovation in the space: how exactly do all these tools fit together?
This podcast — a hallway style conversation between Ali Ghodsi, CEO and founder of Databricks, and a16z general partner Martin Casado — explores the evolution of data architectures, including some quick history, where they’re going, and a surprising use case for streaming data, as well as Ali’s take on how he’d architect the picks and shovels that handle data end-to-end today.
Ali: It kind of started in the ’80s. Business leaders were flying blind, not knowing how the business was doing, waiting for finance to close the books. This data warehousing paradigm came about where they said, “Look, we have all this data in these operational data systems. Why don’t we just get all that data, take it out of all these systems, transform it into a central place, let’s call it a data warehouse, and then we can get business intelligence on that data?”
And it was just a major transformation because now you could have dashboards. You could know how your product was selling by region, by SKU, by geography. That has created at least $20 billion market that has been around for quite a few decades now.
But about 10 years ago, this technology started seeing some challenges. One, more and more data types, like video and audio, started coming about, and there’s no way you can store any of that in data warehouses.
Second, they were on-prem big boxes that you had to buy. And they coupled storage and compute, so it became really expensive to scale them up and down.
And third, people wanted to do more and more machine learning and AI on these data sets. They saw that we can ask future-looking questions. “Which of my customers are going to churn? Which of my products are going to sell? Which campaigns should I be offering to who?”
The data lake came about 10 years ago. And the idea was, “Here’s really cheap storage, dump all your data here, and you can get all those insights. And it turns out, just dumping all your data in a central location, it’s hard to make sense out of that data that’s sitting there. As a result, what people are doing now is they’re taking subsets of that data, moving them into classic data warehouses in the cloud.
So, we’ve ended up with an architectural mess that’s inferior to what we had in the ’80s, where we have data in two places, in the data lake and in the data warehouse, where the staleness and the recency is not great.
In the last two to three years, there’s some really interesting technological breakthroughs that are enabling a new kind of design pattern. We refer to it as the lakehouse. And the idea is: what if you could actually do BI directly on your data lake? And what if you could do your reporting directly on your data lake, and you could do your data science and your machine learning straight up on the data lake?In the last 2-3 years, there's some really interesting technological breakthroughs that are enabling a new kind of design pattern — the lakehouse. And the idea: what if you could actually do BI directly on your data lake? -Ali Ghodsi
Martin: I would love to tease apart a few things that have led us here. There’s very clearly a large existing data warehouse market around BI and analytics, typified by people using SQL on structured data.
It seems like the ML and AI use case is a little bit different than the analytics use case. The analytics use case is normally human beings looking at dashboards and making decisions, whereas in the ML/AI use case, you’re creating these models and those models are actually put into production and are part of the product. They’re doing pricing, they’re doing fraud detection, they’re doing underwriting, etc.
The analytics market is an existing buying behavior and an existing customer. ML/AI is an emerging market. And so the core question is: are we actually seeing the emergence of multiple markets or is this one market?
Ali: There are big similarities, and there are big differences. Let’s start with the similarities. Roughly the same data is needed for both. There’s no doubt that, when it comes to AI and machine learning, a lot of the secret sauce to getting really great results or predictions comes from augmenting your data with additional metadata that you have.
In some sense, we have the same data, and you’re asking analytical questions. The only difference is one is backward-looking, one is future-looking. But other than that, you want to do the same kinds of things with the data. You want to prepare it. You want to have it so that you can make sense of it. If you have structural problems with your data, that also causes problems for machine learning.
The differences today are that it’s a line of business that’s typically doing AI and data science or hardcore R&D. Whereas data warehousing and BI oftentimes sit in IT. Users of the data warehouse and the BI tools are data analysts and business analysts. In the case of machine learning, we have data scientists and machine learning engineers. So, the personas are different and sit in a different place in the organization. Those people have different backgrounds, and they have different requirements for the products they’re using today.The analytics market is an existing customer. ML/AI is an emerging market. And so the core question is: are we actually seeing the emergence of multiple markets or is this one market?
Martin: If you talk to some folks that come from the traditional analyst side, they’ll say, “AI and ML is cool, but if you really look at what they’re doing, they’re just doing simple regressions. Why don’t we just use the traditional model of data warehouses with SQL, and then we’ll just extend SQL to do basic regressions, and we’ll cover 99% of the use cases?”
Ali: Yeah, that’s interesting that you ask because we actually tried that at UC Berkeley. There was a research project that looked at: Is there a way we can take an existing relational model and augment it with machine learning?
And after five years, they realized that it’s actually really hard to bolt machine learning and data science on top of these systems. The reason is a little bit technical — it just has to do with the fact that these are iterative, recursive algorithms that continue improving the statistical measure until it reaches a certain threshold and then they stop. That’s hard to implement on top of data warehousing.
If you look at the papers that were published out of that project, the conclusion was we have to really hack it hard, and it’s not going to be pretty. If you’re thinking of the relational Codd model with SQL on top of it, it’s not sufficient for doing things like deep learning and so on.
Martin: Is the same statement true about going from something architected for AI and ML and then having it support more of a traditional analyst relational model?
Ali: So, interestingly, I think the answer is no, because there is now a widespread data science API that has emerged as the lingua franca for the data scientists: data frames.
A data frame essentially is a way where you can take your data and turn it into tables and start doing queries on it. That sounds a lot like SQL, but it’s not, because it’s actually built with programing language support so you can do that in programming languages, like Python or R, which enables you to do data science.
So, now your data is in tables, and it turns out you can now also build SQL on top of data frames. You can get a marriage between the world of data science and machine learning and the world of BI and data analytics, using data frames.
Martin: I get what you’re saying about the data warehouse, but there’s a lot more than just the data sitting in the data warehouse. You still have this entire world of data and SQL and ETL. Is there a dissonance there or do they stay two worlds? What happens?
Ali: Every enterprise we talk to, they have the majority of their data in the data lake today, and a subset of it goes into the data warehouse.
There’s a two-step ETL that they do. The first ETL step is getting into the data lake, and then there’s a second ETL step that they use to move it to the warehouse. So, organizations are paying a hefty price for this architectural redundancy.
But the question is: do you really need two copies of it? And do you really have to maintain those two copies and keep them in sync? Are you going to have a world in which you have all your data in the data lake and then you do your machine learning and data science on it, and then subsets of it move again into a data warehouse, where you clean it up and put it in that structured form so you can do SQL and BI, or can we do it all in one place?
Martin: Let’s actually ask that specific question. Because even though the AI-ML is a large market with a lot of value, there’s a ton of existing workflow around BI.
You’ve got all the dashboarding and tools that are based on SQL for data warehouses, but then you also have folks that want to interact with the data very quickly and will use something like ClickHouse or Druid in order to do that in OLAP. OLAP stands for online analytical processing and is effectively a fast interface that supports fast queries. Then you’ve got more traditional batch processing, which normally folks have thought about Spark. What you’re saying is that you can combine all of these things in the same data lake, including OLAP query loads?
Ali: Yes, I actually think you can get all the way there. The data lakes are a broad source. Big, large, cheap storage, but kind of data swamps.
It turns out there are some recent technological breakthroughs that show you how you can basically turn them into a structured relational storage system. The way you do that is you build transactionality into these data lakes.
Once you have that, you can now start adding things, like schemas, on top of them. Once you add schemas on top of them, you can add quality metrics. And once you have that, you can start reasoning about your data as structured data in tables instead of data that’s just files.
Martin: I get putting structure on top of a blob store, but you still need a query later, right? Building a query engine that’s super-fast that can respond to analytical queries, there’s entire companies that do that.
Ali: Yeah, so it turns out there’s two APIs you need. One is the data frame API. That’ll enable all the data science and machine learning. Then you can build a SQL layer on top of it, and there’s nothing that really gets in the way of making this as performant as the state-of-the-art, fastest MPP engines out there. You can apply the same tricks now because you’re actually dealing with structured data.
Martin: It feels like, especially in data, there’s always kind of the trend du jour that everybody’s excited about, but they’re not ever really sure if the market’s real or not. People have been saying this a lot for real-time and streaming use cases.
It’s very clear that people want to process data at different times and speeds. Batch, we know, is a very large market, where you’ve got a bunch of data, you want to do a whole bunch of processing, and then it’s stored somewhere else and you do some queries.
More and more people are talking about streaming analytics, where as a stream comes in you do the queries before it hits disc.
I sit in pitches basically as a full-time job, and a lot of the things motivating the streaming use case seem a little a contrived.
Ali: There’s the latency and the speed and how fast you can get this stuff. That’s one side of the equation, and that’s what everybody focuses on.
Oftentimes when we ask the business leader, “Hey, so what kind of latency would be okay with you?” They’ll say, “We want it to be superfast like every 5 minutes, every 10 minutes.” And you can accomplish that with batch systems.
Then when you dig into, “wouldn’t you want it to be even faster?” It turns out that streaming systems, the weakest link will dictate the latency. There’ll be some upstream process that has nothing to do with the system that you’re putting in place. And if that upstream link, if that one place where you’re loading the data in or something, if that’s coming in every half an hour, then it doesn’t matter how fast the rest is.
I think the actual latency, this obsession with, “We need it in less than 5 milliseconds.” For most use cases, you don’t have that.
There’s another side of the equation, which people don’t focus on because it’s harder to understand or explain, but it might be the biggest benefit out of these streaming systems, which is, it takes care of all the data operations for you.
If you don’t have a real-time streaming system, you have to deal with things like, okay, so data arrives every day. I’m going to take it in here. I’m going to add it over there. Well, how do I reconcile? What if some of that data is late? I need to join two tables, but that table is not here. So, maybe I’ll wait a little bit, and I’ll rerun it again. And then maybe once a week, I rerun the whole thing from scratch just to make sure everything is consistent.
In some sense, all the ETL that people are doing today and all the data processing that they’re doing today could be simplified if you actually turn it into a streaming case, because the streaming engines take care of the operationalization for you. You don’t have to worry anymore: “did this data arrive late? Are we still waiting on it? Is the thing consistent?” They’ll take care of all of that.
Martin: You think ultimately a large part of this becomes stream processing?
Ali: What I’m saying, provocatively, is that in some sense all of the batch data that’s out there is a potential use case for streaming.
I think that stream processing systems have been too complicated to use, but actually under the hood they take care of a lot of data ops that people are doing manually today.
Martin: I would love to talk through what you think a modern data stack looks like. We talked to a whole bunch of folks, and it seems there’s a best practices stack forming, but very, very few people know what it looks like.
Let’s say you get hired, Ali. You have a new job, VP of Data, and you were to build a data infrastructure that does both analytics and AI-ML, what product category — not specific products, but product categories — would you use end to end?
Ali: If I get hired into a big company, I’ll spend the next five years fighting political battles on who owns which part of the stack, and which technology I would need to get rid of. There’s a lot of org chart, and human, and process problems, but let’s say, I get in there and they say, he gets to have it his way.
Martin: He’s got all the juice, that’s right.
Ali: Obviously, trying to do something on-prem makes absolutely no sense at this point. And when you’re building that cloud-native architecture, don’t try to replicate what you had in the past on-prem. Don’t think of it as big clusters that are going to be shared by users.
One big change that happens in the cloud that on-prem vendors don’t think of often is that the networks in the cloud are invisible. Any two machines can communicate at full speed, and it can also communicate to the storage system, to the data lake, at full speed. This was not the case on-prem, and things like Hadoop and so on, they had to optimize where you put the data and the computation had to be close to the data.
So, you move it into the cloud. Typically, you have data flowing in from some of your systems. Depending on what kind of business you’re in, you have IT devices or you have something from your web apps. Sometimes it goes to streaming queuing systems, like Kafka. And from there, it lands into the data lake.
Martin: Into the data lake. So you’re saying the data goes directly into your data lake.
Ali: That’s the first landing place. If you don’t do that, you’re actually going to go back a decade or two in the evolution. Because if you don’t put it into the data lake, then you have to immediately decide what schema you’re going to have. And that’s hard to get rights from the beginning. The good news with data lakes is you don’t have to decide the schema. Just dump it there.
Step number two, you need to build a structural transactional layer on top of it, so that you can actually make sense of it. There’s three or four of those technologies that appeared roughly at the same time, and they all enable you to take your data lake and turn it into a lakehouse.
Step number three. You need some kind of interactive data science environment where you can start interactively working on your data and getting insights from it.
Typically, people have Notebooks-based solutions, where they can iterate with Notebooks. They use things like Spark under the hood, and they’re interactively processing their data and getting insights from it.
And that’s really important because a lot of data science in organizations ends up not being advanced machine learning. It ends up being, okay, so we have this data coming in from our products or from our devices or whatever it is. We have to massage it, get it in a good form, and we need to get some basic insights out of it.
If you want to get into the predictive game, you need a machine learning platform. There are now these machine learning platforms that are emerging, many of them are proprietary, inside the companies. You can read about them, but you can’t get your hands on one.
Martin: And this is for operational ML?
Ali: This actually goes from training the ML model, so actually featurizing it, creating a model that can do the predictions, tracking the results, making sure that you can make them reproducible and reasoning about them to moving it into production, which is the hardest part. Moving it into production where you can actually serve it inside products. That’s the job of the machine learning.
Martin: And the people that use the machine learning platform in your world are the data scientists, the data engineers, or both?
Ali: It’s different organizations, today, unfortunately. The serving part, the production part sometimes is owned by IT, and the creating of the models happens by data scientists that sit in the line of business.
And there is friction in those organizations, because IT operates at a different wavelength from the data scientists, but the machine learning platform needs to span both. If it doesn’t, you’re not going to get the full value out of the machine learning work that you’re doing.
Martin: Can you talk a little bit about where the data pipeline and DAG tools fit in in all this?
Ali: That would be the first step of this. I talked about training immediately. But the hardest part really is to take that data that’s now sitting in the data lake and build the pipelines that featurize it and get it in the right shape and form so that you can start doing machine learning on it. So, that’s step number one. Then, after that, you start training the models.
To orchestrate that automatically and make that workflow just happen, you need software that does that, so that’s definitely the first mile in the ML platform.
Martin: And if I want to take my traditional BI dashboard and attach it to this, where does that attach?
Ali: That’s the last mile. BI itself typically uses something like JDBC/ODBC. To make that really fast and snappy and work on top of the data, you need some capability that makes that possible.
In the past, your only option has been to put it in a data warehouse, and then attach your BI tool to it. I’m claiming that with the lakehouse pattern that we’re seeing, and with some of those technological breakthroughs I mentioned, you could connect your BI tool directly now on that data lake.
Martin: To where? To the transactional layer that’s built on top of it?
Ali: Yep, if you have something like Delta Lake or if you have something like Iceberg or Hive ACID, you could connect it to those directly.
Martin: If you didn’t have any legacy technology, it seems like doing a data lake makes a lot of sense. Is there a simple migration path to this?
Ali: I think it’s harder in the West. In Asia, it’s easier because there’s not lots of legacy. It’s harder in the West because the enterprises have 40 years of technology that they’ve bought and installed app data in and configured. They need to make that work with what ywe’re talking about.
Whereas if you’re building it from a clean slate, you can actually get it right more easily.
Martin: Are you actually seeing more usage of data lakes for companies that aren’t encumbered by legacy?
Ali: The companies that are really succeeding with this stuff… take an Uber. They’re doing predictions, and the predictions are a competitive advantage. You press a button, and within a second, it tells you what the price of the ride is. It basically simulated the ride. It knows what that meter is going to tell you after an hour ride with traffic conditions and everything. It gives you exactly the right price — can’t overprice, can’t underprice. It matches supply and demand of drivers with surge pricing. It can even put people in the same car to lower the cost.
All of these are machine learning use cases, and those stacks, these are all companies that are 10 years old. They didn’t exist. They don’t have lots of legacy data warehouses and legacy systems. They built it custom for this use case, and it’s a huge competitive advantage.
Martin: Is this the durable stack that lasts for the next decade, or is this converging on something that looks a little bit different than you can articulate from here?
Ali: I can’t predict the future, but I’ll tell you a few ingredients of it that just make sense long-term.
If I’m an enterprise and I’m sitting there as a CIO or someone that’s picking the data strategy, I would make sure that whatever I’m building is multi-cloud. There’s a lot of innovation happening between the different cloud vendors. They have deep pockets, and there’s sort of an arms race there, so make sure that you have something that’s multi-cloud.
The second thing I would do is, as much as possible, try to base it on open standards and open source technology if possible. That gives you the biggest flexibility that, if the space again changes, you can move things. Otherwise, you find yourself locked into a technology stack the way you were locked in to technologies from the ’80s and ’90s and 2000s.
Storing all your data, dumping it first in raw format into a data lake, is something that’s going to remain because there’s so much data that’s being collected. You don’t have time to figure out exact perfect schemas for it and what we’re going to do with it. So, either we dump it somewhere, or we throw it away, and no one wants to be that employee that threw away the data, especially when it’s so cheap to store it.
And the third thing I would do is I would make sure that the stack that you’re building, the way you’re laying it out, has machine learning and data science as first-class citizen. Machine learning platforms didn’t exist 15 years ago, so that probably will change quite a bit. I think the exact shape of the machine learning platform I don’t think will look exactly the way it is today.
But many of the ingredients are right.
Martin: Perfect. Thank you so much. I don’t know if we’re on a race to see who speaks faster, but I think you win.
Ali: Thank you for having me.
Ali Ghodsi is a cofounder and CEO of Databricks, the AI infrastructure for the enterprise.
Martin Casado is a General Partner at Andreessen Horowitz, where he focuses on enterprise investing.
The a16z Podcast discusses the most important ideas within technology with the people building it. Each episode aims to put listeners ahead of the curve, covering topics like AI, energy, genomics, space, and more.