It’s well understood that artificial intelligence is advancing like mad right now. What’s less appreciated is the role that data and infrastructure continue to play in these advances — whether that’s adding new data sources to train better models, building the data infrastructure to support AI workloads, or taking advantage of more powerful hardware to do all sorts of new things. And, of course, lost in all the excitement around AI is the fact that good, old-fashioned data analysis is still a major enterprise workload and continues to see its own fair share of innovation.
We recently held our Data and AI Forum in New York City, featuring talks from a collection of our founders and other leaders in the space about where the world of data is heading. Here are some highlights (edited for readability) from the founders building products across the spectrum of use cases.Data scientists are asking questions, they're forming hypotheses, they're testing new things, they're building narratives, they're taking risks, they're telling stories. . . . It's an art and a science and a great use of human time.
“Our most fulfilled, amazing days as humans are the days that we are spending doing creative and interesting work and not doing the tedious drudgery stuff. And I think AI is here to help us achieve that state of fulfillment.
“I’ve been working in data, data science, and data analytics my whole career. I am now the founder and CEO of a company that builds a data science and analytics tool, and our product is used by thousands of data practitioners every day. And we see them do some really creative, interesting stuff.
“I think data practitioners are creatives. I know it’s not the first thing that comes to mind — when I say ‘creatives,’ I think of artists or whatever — but think about what data scientists do in their day. They’re asking questions, they’re forming hypotheses, they’re testing new things, they’re building narratives, they’re taking risks, they’re telling stories. This is good data science, it’s good data analytics. And it’s what we expect from our data teams. It’s an art and a science and a great use of human time.
“But data work can also be really tedious. You spend a lot of time writing boilerplate and fixing dependencies and tracking down missing parentheses in a query. It can be more plumbing than science sometimes. This is where I think people wind up spending a lot of their time, and it really is a blocker to them doing their best work. So this really feels like a perfect opportunity to bring human-computer symbiosis into this creative profession.
“Now, most people, when they think of this, assume it means just replacing data teams with a ‘magic insights’ text box. Like, the next step is we’ll all buy solutions into which our stakeholders or executives will come in, they’ll write a question, and it’ll give them a magic response back. You know: properly formatted charts and well-reasoned explanations and full business context. But that doesn’t really work.
“And it doesn’t work, one, because these models aren’t perfect. They can hallucinate, they’re missing a lot of context, they don’t understand the full situation of things. But also that humans want to be able to hear a story, and understand, and ask and answer questions of a human around these things.”When we think about big data or data lakes, we have tools and vehicles that can process that efficiently. But that's kind of like having a rowboat: It'll get you across a lake, but I wouldn't trust that trying to cross the Pacific Ocean.
“Even though AI has this power to enable us to get more value out of our content, it’s really challenging to do that. There’s no such thing as a free lunch. And I think that there are four main challenges that prevent organizations and businesses from really being able to unlock this data right now.
“The first one is scale. When we think about unstructured text and visual data, it’s orders of magnitude larger than today’s big data. So to put that into perspective: If we were to think about tabular data, say we had 10 million rows of tabular data, that’s around 40 megabytes. And to put that into perspective, we can think of that as being like [the surface] area of Lake Tahoe in California, which is around 496 square kilometers.
“If we were to think about 10 million text documents, we go from 40 megabytes to 40 gigabytes. And now we have something that’s more on the scale of the Caspian Sea — 371,000 square kilometers of space. It’s three orders of magnitude more data, in terms of volume, than when we think about tabular data.
“And then when we think about visual data, if we had 10 million images, that would be 20 terabytes of data. That’s another three orders of magnitude bigger. That’s like the Pacific Ocean in terms of the sheer scale of data volume. . . .
“Right now, when we think about big data or data lakes, we have these tools and vehicles that can process that efficiently. But that’s kind of like having a rowboat or a canoe: It’ll get you across a lake, but I wouldn’t trust that if you’re trying to cross the Pacific Ocean.
“In order to actually be able to unlock the value from this richer, more contextual data that we get with content, we actually need to create tools and infrastructure to process that. It’s going to be probably a similar shape — in terms of how a seagoing boat looks somewhat similar to a rowboat — but the scale and the processing of it will have to be completely different. We’ll need to prepare ourselves for the sheer volume and scale that we’re thinking about when we move from a tabular view of the world to more of a content view of the world.”Scale-up became a dirty word once Google published the MapReduce paper . . . but, actually, scale-up works really well if you clean up after your data. Just good data hygiene can get you pretty far.
“I think one really interesting thing that’s happening and is changing the way systems need to be architected is that what is considered big data is actually increasing. When Google came out with the MapReduce paper 2004, there were a lot of workloads that you had to spread across multiple machines because machines were pretty small. Like, the first AWS instances had a gigabyte of RAM and one CPU.
“Now, [you can rent AWS instances with hundreds of processors and terabytes of RAM]. There are very few workloads that won’t fit into that amount of hardware. . . .
“I think there’s a bunch of things that have to be true in order for you to really need big data systems: You’ve got lots of data. You need it all. You do need it all at once. The amount you’re using doesn’t fit on a machine. You can’t get rid of that data and you can’t summarize it. OK, then you need some fancy scale-out system.
“So what does the world look like if data size isn’t the primary driver of your architecture? What are some things that you can do about it? One is: Don’t be afraid to scale up. Scale-up became a dirty word, I guess, once Google published the MapReduce paper. Everybody’s building these large-scale distributed systems but, actually, scale-up works really well if you clean up after your data. Just good data hygiene can get you pretty far.
“Another interesting one: If you have smaller data, you can push some of that out to the user. When we built BigQuery, one of the things we said was that, with large data, you want to move the compute to the data rather than the data to the compute. Laptops used to be synonymous with ‘underpowered,’ but, these days, M2 Macbooks are basically supercomputers. If you have smaller data sets, why not push the workloads out to them? . . . It’s a lot less expensive to do locally than it is to do in the cloud.”Even today, I would argue Teradata is a better database than Snowflake. But the market is moving away from them and that’s because it's incredibly expensive and customers feel locked in.
“There’s this Cambrian explosion of new data sources and new applications every single year . . . And what that creates, of course, is data silos. You now have your most valuable data in a variety of different database systems, and it creates a lot of vendor lock-in because many of these systems are proprietary in nature, which means you can only access that data through that particular system.
“So, this notion of centralizing your data, that model is much slower than it looks because you have to move all of the data out of all these different systems and get it into one place before you can do analysis. It limits your view to what is in that ‘enterprise data warehouse,’ which is never the complete truth about your business. You always have data in other places. And take it from me spending time at Teradata — not one of their customers had all of their data in Teradata, it’s just not possible.
“And, of course, proprietary lock-in and it can become very expensive. And that was really the challenge for many of these early databases: Oracle, Teradata, IBM DB2. They’re not bad databases by any stretch of the imagination. Even today, I would argue Teradata is a better database than Snowflake. But the market is moving away from them and that’s because it’s incredibly expensive and customers feel locked in.
“So, [the idea that you need to centralize] your data: not true, and also impossible. The truth is you need to optimize for decentralized data.”One thing we're starting to see that is pretty exciting is this move from ‘static datasets’ — data that exists already out there, independent of AI systems. We're moving now, I think, toward data sets that are being built with AI in the loop.
“Most of the AI systems that are being trained today are trained on these public datasets, mostly data crawled from the web. And I think there’s actually still a decent amount of public data available. Even if we’re reaching the limits, say, of text, there are other modalities that folks are starting to explore — audio, video, images. I think there’s a lot of really rich data sources out there, still, on the web.
“There are also — I don’t know the exact magnitudes, but I imagine roughly a similar scale of — private datasets out there. And I think that’s going to be really important in certain applications. Imagine if you have a cogeneration system, it’s great that it’s trained on all of public GitHub, but it might be even more useful if it’s trained on my own private code base. I think figuring out how to blend these public and private datasets is going to be really interesting. And I think it’s going to open up a whole bunch of new applications, too.
“From Character’s perspective, and I guess more generally, one of the things that we’re starting to see that is pretty exciting is this move from, you could call it, ‘static datasets’ — data that exists already out there, independent of AI systems. We’re moving now, I think, toward data sets that are being built with AI in the loop. And so you have what people often refer to as ‘data flywheels.’ You can imagine, say, for Character, we have all these rich interactions where a character is having a conversation with someone, and we get feedback on that conversation from the user, either explicitly or implicitly, and that’s really the perfect data to use to make that AI system better.
“And so we have these loops that I think are going to be really exciting and provide both richer and, perhaps, much larger data sources for the sort of next generation of systems.”
* * *
The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, this content may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein.
This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at https://a16z.com/investments/.
Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.