We’ve come a long way from having to make the case for artificial intelligence (AI) in biology. Just a few years ago, I argued in The New York Times how fears around the “black box” of AI in medicine are often misplaced — especially given how much of a black box the doctor’s mind is — and where the limits and opportunities may be. Today, however, there’s so much evidence of how AI can revolutionize healthcare and the life sciences (not to mention other fields), even outperforming humans on a wide range of tasks once thought to be too complex to be tackled by algorithms.
But even with this evidence now in hand, the reality, in practice, is that the potential of artificial intelligence in biology will be limited — unless it gets a lot smarter. Current methodologies rely on a naïve, blank slate as the starting point. AI can be trained (much like dogs), but not understand; it can play the game, but only with known rules; and it can’t really go beyond their training. Take for example the application of identifying small molecules that can bind to a disease-causing protein, where AI could accelerate and expand drug discovery beyond human capabilities. Today’s AI has to infer the laws of physics (e.g. how close atoms can pack), chemistry (e.g. the strength of different chemical bonds), and biology (e.g. the flexibility of the protein’s binding pocket) from the data it is trained on. And if that dataset is too limited in any direction, the basic rules of these fields will be violated, leading to pointless results.
To be clear, what we are discussing is not some human-like, sci-fi conception of AI (aka “general intelligence”), nor is this challenge of “dumb” AI unique to biology. But the implications of relying on these naïve algorithms will be most keenly felt in bio and healthcare, due to the amount of specialized domain expertise needed to understand both the root of the problem, and the scope of possible solutions.
If we’re going to make true strides in biology and healthcare by applying AI meaningfully, we need to be able to create AI algorithms with domain knowledge — more smart, less naïve. So how do we get there, and what does this mean for players in the space?
While in real estate the adage is “location, location, location,” in AI it will always be “data, data, data.” However, our existing datasets are poorly suited to AI for practical application in biology. Exploring these datasets can uncover corner and edge cases, but not generalizable biological insights. Existing datasets also lack key controls to verify that the AI is learning what it should be learning, and not just some artifact of the dataset. There are a number of questions and mindsets involved in assessing whether it’s even worth the time, effort, and money involved in adopting AI technology in biology (which we detail in this article) — there are too many pitfalls otherwise.
For AI to be more practically and more widely adopted in bio and healthcare, new datasets need to be generated via automation. This would improve things in many ways: it would be more systematized, reproducible, and free from the emotional constraints of tasks that are too laborious or tedious for humans.
But even more critical is how the experiments are designed. Too often, AI is shoehorned in at the end of an experiment. Experiments need to be designed at the outset to feed AI, thus ensuring higher quality data and also circumventing data artifacts.
In fact, AI can and should be used before any data is gathered, to assist in experimental design and chart the course of what experiments ought to be run. This is totally different from how many scientists are taught to design experiments — to test a specific hypothesis — because AI massively expands the space of possibilities, and can guide where to invest time, energy, and resources. AI inverts the way we’ve been trained to do science, in other words — it opens up to us a world where we directly embrace the reality that we don’t know what we don’t know.
Using our example of identifying a small molecule to bind a protein target (an application increasingly used in startups and pharmaceutical companies): AI can be particularly powerful with multiple drug development programs, where results and information can be pooled, further strengthening the dataset and therefore the derived learnings. Traditionally, a medicinal chemist would make a series of bets of what modifications would improve affinity and selectivity, often ruling out options that they “know” won’t work. In addition to running an astronomically larger number of experiments in silico, AI wouldn’t have to rule out any possible modifications — thus expanding the scope of exploration — and helping to choose which molecules the medicinal chemist should make and test. AI might choose some molecules that seem counterintuitive, but these can still inform the global model the AI is building.
This is the power of AI — going beyond what humans can do — when it is smart, not dumb.
In addition to better data, we need smarter algorithms, obviously. Since the problem here is lack of domain expertise — which means that training datasets must be enormous in depth and scope — the next generation of AI needs to have expertise embedded before tackling specific problems. Three methods for doing this right now include:
Learning to play the guitar is challenging, but it is much easier for people who already know how to play the piano (as they already know how to read music, operate a musical instrument, and have an ear for pitch and tone). You can think of learning piano as “pre-training” for learning guitar. In bio, pre-training could look like an algorithm for medical transcription that is trained with English language and grammar before training with medical terminology and taxonomy. Pre-training essentially gives the AI lots of practice and teaches it about the relationship between concepts, and it has tangible benefits, such as achieving higher accuracy levels faster and with fewer inputs.
The downside of pre-training is that it still relies on AI to discover and deduce known rules based on the data provided. An alternative approach is to encode domain expertise directly into the algorithm. The key here is in representing the data in a way that is sufficiently general that it can handle all the different permutations, while also being sufficiently specific to handle the task at hand. In natural language processing, for example, naïve AI is fed data in the form of pixels, which it then interprets into letters, and words, and sentences, and so on. With smarter encoding, you present text as letters, which allows one to use dramatically less training data, opening the door to data-poor environments and much more predictive algorithms. In bio, this could mean instead of feeding AI with data in the form of voxels (3D pixels) to describe a molecule, you would start from a graph — which contains information about the chemical bond and thus the larger chemical space.
Representing data so that it can include greater information about the subject is tricky and must be handled thoughtfully, as it can just as easily obscure as enlighten.
Algorithms will also become smarter and gain domain expertise when they start being designed from the outset for a specific biological application. A lot, if not most, of the AI technology used in bio was directly carried over from non-bio applications; algorithms in radiology are the same type of neural nets used for basic image recognition.
But now we’re starting to see algorithms and training protocols emerge that are designed with biological problems in mind, such as variants of self-supervised algorithms that start from generic counterparts but that incorporate biological insights to aid in learning. For example, cell imaging algorithms that understand the natural features of the cell (chromatin, organelles, etc), can allow us to use self-supervised methods even more naturally than in non-biological examples. This is because the data is more consistent (all of the same type, all cell imaging), and the elements within the image are well known without advanced machine learning (as we do understand the basic biology of cells). This too would lead to much better overall performance as well as require considerably fewer data points to train.
Lastly, algorithms will be smarter and grow their applicable power when they successfully blend with domain-specific computational approaches.
A prime example here is molecular dynamics simulation, a powerful computational method that encodes many of the aspects of molecular physics and chemistry, but that still relies on parameters and training done in an ad hoc, biased, and human judgment-reliant way. By incorporating AI into these simulations, AI can make parameter selection more robust and reproducible, thus improving the method overall. Today, we are seeing this combination at the protein, cell, and organ level, but the future is simulating whole organisms, powered by AI.
Put together, all of these lead to a sea change in the “intelligence” of AI — moving from simple training in a task-based manner (akin to training a dog on a specific trick) toward a more generalized intelligence that requires less training, extends beyond the training set more naturally (i.e. within the bounds of the scientific discipline), and leads to much more accurate predictions.
It is easy to riff off the old “Intel inside” motto to say everything will have AI inside — but what does that mean for startups and incumbents? Bio isn’t the first industry to have adjusted (or that is adjusting) to the big shift. What we’ve seen as everyone from Wall Street to Madison Avenue to Silicon Valley adapts to AI is that cultural hurdles are just as high as the technical ones.
For a bio company, more practically and broadly adopting AI means integrating AI — and people with AI competencies — into every team. This is in stark contrast to a siloed, separate group that gets called in (often only at the end) to make sense of what others have done. This could mean staffing with people who are “bilingual” in AI and the biological domain, as well as building a culture that values both sides: that is, biologists who crave the power of computation, as well as computational scientists who are deeply rooted in biology.
It’s notoriously difficult to change entrenched cultures. Startups have an obvious advantage here, in starting from scratch and in building these teams and mindsets natively. For incumbents, it’s similar to other innovations, where the ones that get ahead are the ones that can adapt their established ways; or that build new AI-centric teams and have those teams take on more and more responsibilities, disrupting themselves.
There’s also a new kind of talent coming: It used to be that “drug hunters” were medicinal chemists, with a gut instinct for creating the best molecules. But with the rise of contract research organizations that can help with rote laboratory work and molecular synthesis, who makes the molecules is now far less important than who designs them. We saw a similar shift in finance with the rise of the “quants” — those with quantitative skills rather than a heuristic knowledge of the domain. This shift will also happen in both the chemistry lab and the bio lab.
Until now, these bio quants have had to rely on big datasets to power their statistical methods, which are rare due to expense and complexity. But the smart algorithms of the future will enable quants to apply their skills to small data — and therefore into all areas of a company. Big data was a problem of infrastructure and pipes; small data was always an intellectual problem, and can now be addressed with smart algorithms, not just smart people.
It is this ability of smart algorithms to work with small data that will enable AI everywhere.
Finally, as with other big technology shifts before it (such as the move to SaaS), the transition from naïve to smart AI will reshape the entire organizational structure, not just the functions closest to it. Smart algorithms will be the domain not just of the CSO, but other areas such as legal, finance, all the way to the CEO’s office. Why? Because smarter AI can help answer the key business questions that were once solely the domain of savvy human judgment.
Decision making from bench to boardroom will be improved with small data regimes when AI is unleashed by smart intelligence not dumb data.
Too many people view AI as the next incremental step in a long history of advances that has come to biopharma. It is tempting to view AI as yet another technical advance that has inspired a new generation of companies and moved us a step forward. Yet this is an overly narrow perspective, because unlike other tech, AI — and particularly these smart algorithms — are not just one tool for solving one problem, but a tool that can be applied to every problem. The real power comes in not just using it as a singular tool, but using it to amplify and integrate every tool and technology in the company. It’s not just another box on the benchtop, but our apprentice and ally in every role. With AI everywhere, AI will be smarter and so, collectively, will we.