Machine Learning + Big Data

Here at a16z, we treat “big data” and “machine learning” as connected activities. People have been talking about the need for more ‘analysis’ and insight in big data, which is obviously important, because we’ve been in the ‘collection’ phase with big data until now. But the innovation in the big data world that I’m most excited about is the ‘prediction’ phase — the ability to process the information we’ve collected, learn patterns, and predict unknowns based on what we’ve already seen.

Machine learning is to big data as human learning is to life experience: We interpolate and extrapolate from past experiences to deal with unfamiliar situations. Machine learning with big data will duplicate this behavior, at massive scales.

Where business intelligence before was about past aggregates (“How many red shoes have we sold in Kentucky?”), it will now demand predictive insights (“How many red shoes will we sell in Kentucky?”). An important implication of this is that machine learning will not be an activity in and of itself … it will be a property of every application. There won’t be a standalone function, “Hey, let’s use that tool to predict.”

Take Salesforce for example. Right now it just presents data, and the human user has to draw her or his predictive insights in their heads. Yet most of us have been trained by Google, which uses information from millions of variables based on ours and others’ usage to tailor our user experience … why shouldn’t we expect the same here? Enterprise applications — in every use case imaginable — should and will become inherently more intelligent as the machine implicitly learns patterns in the data and derives insights. It will be like having an intelligent, experienced human assistant in everything we do.

The key here is in more automated apps where big data drives what the application does, and with no user intervention. (My colleague Frank Chen calls this the “big data inside” architecture for apps).

But all of this forces, and benefits from, innovation at the infrastructure level.

Big Data needs Big Compute: Where Hadoop and Spark fit in the picture

Think of big data and machine learning as three steps (and phases of companies that have come out of this space): collect, analyze, and predict. These steps have been disconnected until now, because we’ve been building the ecosystem from the bottom up — experimenting with various architectural and tool choices — and building a set of practices around that.

The early Hadoop stack is an example of collecting and storing big data. It allows easier data processing across a large cluster of cheap commodity servers. But Hadoop MapReduce is a batch-oriented system, and doesn’t lend itself well towards interactive applications; real-time operations like stream processing; and other, more sophisticated computations.

For predictive analytics, we need an infrastructure that’s much more responsive to human-scale interactivity: What’s happening today that may influence what happens tomorrow? A lot of iteration needs to occur on a continual basis for the system to get smart, for the machine to “learn” — explore the data, visualize it, build a model, ask a question, an answer comes back, bring in other data, and repeat the process.

The more real-time and granular we can get, the more responsive, and more competitive, we can be.

Compare this to the old world of “small-data” business intelligence, where it was sufficient to have a small application engine that sat on top of a database. Now, we’re processing a thousand times more data, so to keep up the speed at that scale, we need a data engine that’s in-memory and parallel. And for big data to unlock the value of machine learning, we’re deploying it at the application layer. Which means “big data” needs “big compute”.

This is where Apache Spark comes in. Because it’s an in-memory, big-compute part of the stack, it’s a hundred times faster than Hadoop MapReduce. It also offers interactivity since it’s not limited to the batch model. Spark runs everywhere (including Hadoop), and turns the big data processing environment into a real-time data capture and analytics environment.

* * *

We’ve invested in every level of the big data/big compute ecosystem, and this remains an exciting, active space for innovation. Because big data computing is no longer the sole province of government agencies and big companies. Even though the early applications tend to show up in industries where data scientists have typically worked, machine learning as a property of all applications — especially when coupled with an accessible user interface — is democratizing who, what, and where this kind of real-time computing and learning can happen … and what great new companies can be built on top of it.

My belief is every application will be re-constituted to take advantage of this trend. And thanks to big data and big compute innovations, we finally have the ingredients to really make this happen. We’re at the threshold of a significant acceleration in machine intelligence that can benefit businesses and society at large.

– Peter Levine

definitions

Big Data is the collection of massive amounts of information, whether unstructured or structured.

Big Compute is the large-scale (often parallel) processing power required to extract value from Big Data.

Machine Learning is a branch of Computer Science that, instead of applying high-level algorithms to solve problems in explicit, imperative logic, applies low-level algorithms to discover patterns implicit in the data. (Think about this like how the human brain learns from life experiences vs. from explicit instructions.) The more data, the more effective the learning, which is why machine learning and big data are intricately tied together.

Predictive Analytics is using machine learning to predict future outcomes (extrapolation), or to infer unknown data points from known ones (interpolation).

— thanks to Christopher Nguyen