AI Revolution

Digital Biology

Daphne Koller and Vijay Pande

This conversation is part of our AI Revolution series, which features some of the most impactful builders in the field of AI discussing and debating where we are, where we’re going, and the big open questions in AI. Find more content from our AI Revolution series on www.a16z.com/AIRevolution.

Daphne Koller is the founder and CEO of insitro, a company using AI and machine learning to engineer drug discovery. In this conversation with a16z’ Vijay Pande, she discusses how an LLM for cells could revolutionize drug discovery and how to bridge the technological and cultural divide between building with atoms and building with bits.

  • [00:32] Why life sciences?
  • [03:23] AI in the life sciences
  • [07:02] LLM for cells
  • [11:37] Engineering disease and drug discovery
  • [13:33] Bits vs. atoms
  • [17:36] The opportunity ahead

Why life sciences?

Vijay: Daphne is the OG’s OG in AI. She was a pioneer at Stanford in different areas of AI, especially in PGMs. She left Stanford to cofound Coursera with Andrew Ng and is now the founder and CEO of Insitro, a tech bio company using AI to develop drugs in life sciences. Daphne, given all the things you could be doing, why life sciences?

Daphne: I think there are 3 parts to the question: why life sciences, why now, and why me? I’m going to answer all 3 parts. Life sciences is one of the really hard and really important problems, and there are very few things that are as challenging and exciting as intervening in human health in a safe and effective way. It’s just a thing that absolutely needs to be done if we are going to use AI for good, which is one of the things I really strive to do.

The second part of the question is why now? What brought me back to this field back in 2016, post-Coursera, was the realization that we can now, finally, for the first time, measure biology at scale, both at the cellular level—sometimes at subcellular level—and at the organism level via ways of quantitating human biology. For the very first time, that gives us the ability to deploy machine learning in ways where it is truly meaningful because the data sets are large enough for really interesting machine learning methods to be deployed.

I am a big believer in leverage, or places where you can have a disproportionately large impact. I spent a large part of my Stanford career working in these 2 spaces simultaneously—core machine learning and machine learning in service of biomedical data—I actually have the ability to bridge the chasm between these 2 very disparate disciplines.

When I was leaving Coursera in 2016 I saw even at that time, which is tiny compared to where we are today, that while machine learning was changing the world, it wasn’t having much of an impact in the life sciences. I believe one of the main reasons is because there are so very few people who actually have the language of both disciplines and are able to bring them together. I felt like I could have an impact in AI across many things, but here I could have disproportionate impact.

AI in the life sciences

Vijay: You spoke about the why now. What’s your take on AI for life sciences? What’s the “why now” there? What’s different now than even what we could do even just 5 years ago?

Daphne: I think it comes back to this ability to collect, but even more than collect, generate data at scale. One of the truly unique things that we have at Insitro is a data factory. We have put together the tools that have been developed by people who are taking pluripotent stem cells which are cells from you, or me, or anyone in this audience, and turning them into this pluripotent status—which can make a Daphne neuron in a dish or a Daphne hepatocyte. That condition is going to be different from the Vijay neuron and the Vijay hepatocyte because we have different genetics. That’s going to manifest in how these cells look and behave, and different measurements.

We can engineer those to introduce a disease-causing mutation and ask, “What does that disease-causing mutation do to a Daphne neuron versus what does it do to a Vijay neuron? What does this mutation do versus that mutation?” We’re able to do data generation on spec. That is a truly unique capability, which, frankly, is not that easy to do, even in other areas where AI is being deployed. You don’t get to make your own data in many cases, but here we do. That creates really important discovery opportunities for life sciences, but also really cool and interesting machine learning problems. You could start doing active learning or do experimental design, and it’s a really exciting technical discipline at this point.

Vijay: Could you dive a little deeper and give an example? Your paper on the POSH approach came out on Archive. Could you double-click on that? Tell people what you did there, especially why AI in life science is a big deal. What could you hope to get?

Daphne: First of all, let me tell you a little bit about that platform which is called POSH, or Pooled Optical Screening in Humans. You take a bunch of cells, they can be cancer cells or whatever, and you put them with a pool of CRISPR guides that edit them. Each cell gets a different guide. Now you have a bunch of cells, each with a genetically diverse mutation, that are all sitting there in a pool. You can measure them with a microscope as they move around and do their thing. You can basically fix them and sequence the barcode that came with a guide. So now you can say, “This cell that got this guide behaved this way and this other cell behaves that way.”

I can tell you that 1 of the really challenging things about cells is because they’re alive, if you put different cells in different wells, then they each have a slightly different environment and you get subtle differences and it’s really hard to reconcile. When they’re all in a pool, you eliminate all of those artifacts, and all of a sudden you have the ability to measure a genome-wide CRISPR screen. You have 20,000 genes in the genome all modifying the same cellular background in the same dish with a different genetic intervention—and you’re measuring that on a genome-wide scale in 10 or 12 plates in 2 weeks.

Imagine doing that, rinse, repeat, and doing genome-wide scale on this genetic background or in this cell type. You can really start to decipher the genotype-phenotype connection and the effect in which individual genetics makes a difference on cellular phenotypes, which we then translate to what we believe they will have in terms of clinical impact. That is the beginning of an understanding of what we want to modify to have meaningful therapeutic interventions. This is a truly engineered approach to discovery.

Developing an LLM for cells

Vijay: The biology part is really critical because now you get the data, and we all know how important that is. One of the things I found intriguing is the creation of a latent space for human biology, and especially being able to tell the difference between disease and non-disease, or even different disease phenotype. How does that come about and how is AI driving that?

Daphne: I’m going to go back a step because you said one of the things we need to do is get the data. I should have mentioned that it’s impossible to run this instrument without AI being built into it because you can’t even segment the cells; you can’t call the barcodes. All of it is an AI-enabled architecture.

Every part of our technology stack is intrinsically AI-enabled. To your point, Vijay, now you have a whole bunch of cellular images, what do you do with them? The first thing we do is we built this latent space. We built a language model for biology, but you’d have to explain this to people. No one knew what I was talking about. Now I’m just saying, “It’s just like GPT, but for cells.”

We have the language of cells and what cells look like, or the transcriptional, or gene expression profiles of cells. You measure hundreds of millions of cells in different states. With a much more limited amount of data—because we have this latent space—then just like the large language models for natural language with a small amount of data, you can start asking, “How does a disease-causing gene move you from one place to the other? How does a treatment move you hopefully back from the disease state back to the healthy state?” That’s super powerful. And it’s the gift that keeps on giving.

Like other language models, it keeps getting better the more data you feed it. Over time, you end up with a better and better competitive moat of how understanding the core foundations of biology help you better understand disease and health. This is not just for cellular data. The other source of data that we use is clinical data.

We do the same thing with histopathology. There’s so much more in histopathology than your pathologist typically looks at. In MRI data, your radiologist doesn’t see more than a small percentage of what’s there in your radiology images. It’s also not just imaging. There are also other modalities where there’s an equal amount of information left on the table. Over time, we’re learning the languages of different biological modalities and the ability to translate between them.

Engineering disease and drug discovery

Vijay: This concept of a foundation model for biology is particularly exciting because 10 years ago, you could have ML that was predictive, you just needed maybe 100 activities. The problem is if you have 100 examples of a drug that works, you don’t need to design a drug. These low-shot, 0-shot approaches that come from a foundation model are really night and day. How far does this go? The big problem in biology is that biology is hard.

Daphne: Biology is really hard. Sometimes I ask myself, “Why am I doing this?” I could go write an app for, like, a chat agent company.

Vijay: It would be a lot easier. So, why are you doing it? What is the big win? Where does this go by the end of the decade? What could you hope to do that we couldn’t do before?

Daphne: We want do it in a different way and come up with a very systematic recipe for you to go from a decision that I want to work on ALS or fatty liver disease, through a sequence of steps toward something that results in a meaningful intervention in the right patient population.

The hope is by the end of this decade, we will have built this process, run through it a number of times, and delivered some medicines to patients in our first tranche of indications. Then we will have learned enough from that so we can now say, “Here’s how we’re going to do it here, and here, and here.”

It’s not only machine learning that moves forward over time, it’s also the biological tools that we’re relying on. It used to be that there wasn’t any CRISPR. There was just siRNA. Actually, there wasn’t even that. Then there’s CRISPR base editing and now there’s CRISPR prime that replaces entire regions of the genome. The tools that we’re building on also get better and better over time, which unlocks more and more diseases that we could tackle in a meaningful way.

Vijay: Let’s step back for a second because it may not be clear for everyone why biology is so hard. One of the biggest reasons is that if we can do tons of experiments on mice, it’s a great time to be a rich mouse. You could be cured of any disease. All these diseases can be cured in mice, but it’s obviously unethical to experiment on people. That’s one of the big reasons why trials fail. When you go into a clinical trial, you spend all this money to get there. You’re spending hundreds of millions of dollars in the trial, and it turns out mice are different than people—and it fails. How can AI help that?

Daphne: First of all, this notion that we can cure lots of mice is something that really drove our discovery strategy at Insitro, which is that all of our work is done in human and human-derived systems. That incorporates at least some subset of human cells working together.

That’s one piece, and the nice thing about it is that it allows you to intervene in those systems and ask the “what if” questions. The counterfactuals like, “What if I had this person’s biology, but in a world where this gene was inactive versus active, or the other way around?” That’s great, but obviously you want to cure people, not cells or even organoids, so the other source of data we bring in is data from people, from clinical records.

Without machine learning, without AI, the space would be so complex and so high-dimensional that you couldn’t even make sense of it, far less a bridge between those 2 different worlds.

Bridging the divide between bits and atoms

Vijay: That makes sense. Let’s change gears a bit and talk a bit about company building. One of the interesting things that you’ve done is you’ve brought together people who are biology experts with people who are ML/AI experts. How do you build that culture? What does that look like, especially since they’re from fairly different parts of the universe?

Daphne: First of all, it may not have been obvious to everybody, but the company name, Insitro, is actually the blend of “in silico” and “in vitro”—”in silico” being in the computer and “in vitro” being in the lab. Those elements of bringing those 2 strands together are so deeply woven even into our logo.

How you build that is really hard. If you take your average machine learning scientist and your average life scientist, even if they’re very well-intentioned, and put them into the room together, they might as well be talking Thai and Swahili to each other. The languages are different, the ways in which they think are totally different. So how do you create a shared language, a shared vision?

There are a few tricks or approaches that we use. First of all, we hire some number of people—you can’t get enough of them, unfortunately—who are in the middle and can be translators for both sides and bring them together. The other really important part is that you create a culture and you hire very rigorously to that culture of people who are genuinely interested in engaging with the other side.

We have a list of company values. The final value, which is one that I hold particularly dear—it’s last, not because it’s least important, but because they’re ordered from what we do to how we do it—is that we engage with each other openly, constructively, and with respect. Each of those words matters. Engage means that we’re not siloed. All of our work is done in cross-functional project teams. “Openly” means an openness to asking really naive questions when you don’t understand and to accepting really naive suggestions from somebody else because sometimes, the best ideas come from an orthogonal mindset.

Vijay: Especially as AI gets into areas that are not just the world of bits, but in the world of atoms: any advice for how to bridge those gaps?

Daphne: Having an appreciation for the complexity of atoms, especially when your atoms are part of life’s systems, they behave in unexpected, unpredictable, idiosyncratic ways that sometimes cause a lot of pain. When you do biological experiments, 1 of the strongest signals when you apply machine learning is: who was the technician who actually did the experiment? They behave a little bit differently, they pipette a little bit differently, they treat the cells a little bit differently. It’s amazing how hard it is to clean that up, which is one of the reasons why we spend so much of our time building robots. They do the same thing over and over again.

I think having a lot of respect for atoms but also an appreciation for the fact that the next frontier of the impacts that AI can have is when AI starts to touch the physical world. We’ve all seen just how much harder that is. We’ve all seen how hard it is to build a self-driving car compared to building a chatbot. We’ve made so much progress on building chatbots and self-driving cars are still blocking fire trucks in San Francisco. Having an appreciation for that complexity, but also an appreciation for the magnitude of the impact, if you can actually nail it.

The opportunity ahead

Vijay: You’re talking about life sciences in terms of healthcare and drug design, but there’s a lot more to biology than just drugs. Where do you think this confluence between AI and life sciences goes from here?

Daphne: I actually think that there is this incredible opportunity at this intersection between the 2 fields. Think back on the history of science. At certain times in our history, there have been eras where a particular scientific discipline has made incredible amounts of progress in a relatively short amount of time because there was a click. We started to see the world in different ways or there was a tool that wasn’t available before.

If you think back to the late 1800s, that was chemistry where we suddenly realized we couldn’t really turn lead into gold. There was this thing called the periodic table and there were electrons. It really shifted chemistry. Then, in the early 1900s, obviously that discipline was physics. The connection between energy and matter and between space and time completely shifted our understanding of the universe.

In the 1950s, that discipline was computing. We got these machines that perform calculations that, up until that point, only a human was able to perform. Then in the 1990s, there was this interesting bifurcation. On the one side, there was data science that ultimately drew on computers, but also had elements of neuroscience and optimization, and statistics. That ultimately gave us modern-day machine learning and AI.

And then the other side was quantitative biology, which was the first time where we started to measure biology on a scale that was more than tracking 3 genes across an experiment that took 5 years. That was the first microarray data and the first human genome.

This time that we’re living in is when those last 2 disciplines are actually going to merge. They’re giving us an era of what I think of as digital biology, which is the ability to measure biology at unprecedented fidelity and scale, interpret the unbelievable masses of data, different biological scales, and different systems using the tools of machine learning and data science. Then bring that back to engineer biology using tools like CRISPR and genome editing, so we can make biology do things that it would otherwise not want to do.

Vijay: Like what?

Daphne: There’s applications in human health and agriculture. I don’t think we need to tell anybody anymore, although there’s still some people who might need to hear it, about the impact of global warming and climate change on our world, and the fact that we need to have crops that are much more resistant to drought and severe weather.

Vijay: And to feed 10B people.

Daphne: To feed 10B people. There are opportunities in the environment to maybe do better carbon sequestration using plants or algae. There’s biomaterials and so on. There are so many opportunities at this intersection that I would encourage any of you in this audience who are looking for something truly aspirational and exciting to do. This convergence is a moment in time for us to make a really big difference in the world that we live in using tools that exist today that did not exist even 5 years ago.

Vijay: I think that’s the opportunity at hand. We’ll wrap up there. Let’s thank Daphne one more time.