Machine Learning + Big Data

Here at a16z, we treat “big data” and “machine learning” as connected activities. People have been talking about the need for more ‘analysis’ and insight in big data, which is obviously important, because we’ve been in the ‘collection’ phase with big data until now. But the innovation in the big data world that I’m most excited about is the ‘prediction’ phase — the ability to process the information we’ve collected, learn patterns, and predict unknowns based on what we’ve already seen.

Machine learning is to big data as human learning is to life experience: We interpolate and extrapolate from past experiences to deal with unfamiliar situations. Machine learning with big data will duplicate this behavior, at massive scales.

Where business intelligence before was about past aggregates (“How many red shoes have we sold in Kentucky?”), it will now demand predictive insights (“How many red shoes will we sell in Kentucky?”). An important implication of this is that machine learning will not be an activity in and of itself … it will be a property of every application. There won’t be a standalone function, “Hey, let’s use that tool to predict.”

Take Salesforce for example. Right now it just presents data, and the human user has to draw her or his predictive insights in their heads. Yet most of us have been trained by Google, which uses information from millions of variables based on ours and others’ usage to tailor our user experience … why shouldn’t we expect the same here? Enterprise applications — in every use case imaginable — should and will become inherently more intelligent as the machine implicitly learns patterns in the data and derives insights. It will be like having an intelligent, experienced human assistant in everything we do.

The key here is in more automated apps where big data drives what the application does, and with no user intervention. (My colleague Frank Chen calls this the “big data inside” architecture for apps).

But all of this forces, and benefits from, innovation at the infrastructure level.

Big Data needs Big Compute: Where Hadoop and Spark fit in the picture

Think of big data and machine learning as three steps (and phases of companies that have come out of this space): collect, analyze, and predict. These steps have been disconnected until now, because we’ve been building the ecosystem from the bottom up — experimenting with various architectural and tool choices — and building a set of practices around that.

The early Hadoop stack is an example of collecting and storing big data. It allows easier data processing across a large cluster of cheap commodity servers. But Hadoop MapReduce is a batch-oriented system, and doesn’t lend itself well towards interactive applications; real-time operations like stream processing; and other, more sophisticated computations.

For predictive analytics, we need an infrastructure that’s much more responsive to human-scale interactivity: What’s happening today that may influence what happens tomorrow? A lot of iteration needs to occur on a continual basis for the system to get smart, for the machine to “learn” — explore the data, visualize it, build a model, ask a question, an answer comes back, bring in other data, and repeat the process.

The more real-time and granular we can get, the more responsive, and more competitive, we can be.

Compare this to the old world of “small-data” business intelligence, where it was sufficient to have a small application engine that sat on top of a database. Now, we’re processing a thousand times more data, so to keep up the speed at that scale, we need a data engine that’s in-memory and parallel. And for big data to unlock the value of machine learning, we’re deploying it at the application layer. Which means “big data” needs “big compute”.

This is where Apache Spark comes in. Because it’s an in-memory, big-compute part of the stack, it’s a hundred times faster than Hadoop MapReduce. It also offers interactivity since it’s not limited to the batch model. Spark runs everywhere (including Hadoop), and turns the big data processing environment into a real-time data capture and analytics environment.

* * *

We’ve invested in every level of the big data/big compute ecosystem, and this remains an exciting, active space for innovation. Because big data computing is no longer the sole province of government agencies and big companies. Even though the early applications tend to show up in industries where data scientists have typically worked, machine learning as a property of all applications — especially when coupled with an accessible user interface — is democratizing who, what, and where this kind of real-time computing and learning can happen … and what great new companies can be built on top of it.

My belief is every application will be re-constituted to take advantage of this trend. And thanks to big data and big compute innovations, we finally have the ingredients to really make this happen. We’re at the threshold of a significant acceleration in machine intelligence that can benefit businesses and society at large.

– Peter Levine


Big Data is the collection of massive amounts of information, whether unstructured or structured.

Big Compute is the large-scale (often parallel) processing power required to extract value from Big Data.

Machine Learning is a branch of Computer Science that, instead of applying high-level algorithms to solve problems in explicit, imperative logic, applies low-level algorithms to discover patterns implicit in the data. (Think about this like how the human brain learns from life experiences vs. from explicit instructions.) The more data, the more effective the learning, which is why machine learning and big data are intricately tied together.

Predictive Analytics is using machine learning to predict future outcomes (extrapolation), or to infer unknown data points from known ones (interpolation).

— thanks to Christopher Nguyen




The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation.

This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments and certain publicly traded cryptocurrencies/ digital assets for which the issuer has not provided permission for a16z to disclose publicly) is available at

Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see for additional important information.