Bharath Ramsundar, Peter Eastman, Pat Walters, and I wrote Deep Learning for the Life Sciences with one primary message in mind: you can easily now do world-class AI in Bio from your laptop. Bringing AI to the life sciences has never been more exciting, or more rewarding — resulting in everything from a deeper understanding of genomics to more rapid and successful drug design. Given that, we wanted to create a hands-on practical guide for newly minted Deep Neural Net (DNN) practitioners and scientists on how to determine hype from reality in the space, the tools you will need for the highest impact, and what foundational concepts and applications to keep in mind. Below are the three biggest questions I believe the practitioner wanting to use AI in the life sciences should be thinking about.
Asking an ML-appropriate question requires understanding where applying ML makes sense, to what data sets, and what type of data. Machine learning in general (and deep learning in particular) is not a panacea — but given all the new recent advances, there are unquestionably also opportunities to do new things and more easily, with just a few (10 to 50) lines of Python code directly from one’s laptop.
A natural place to start applying deep learning is to images. Image recognition has been one of the major triumphs of deep learning approaches, perhaps not surprisingly, given the original inspiration from parts of the brain such as the visual cortex! Given that so much of life science data are images, e.g. all the variants of microscopy, radiology, pathology, etc., images are both a natural representation (the form of data into the computer) as well as a rich source of labeled data, such as “this picture has a tumor in it at this location.”
While images are a natural representation for computers — after all, images are just an orderly arranged array of numbers — clearly not all life sciences data is in the form of an image. Some data sets are by their nature pretty close to images; for example, a DNA or protein sequence could be viewed as a very long one-dimensional image, and deep learning approaches for images naturally carry over. But other data sets are more challenging. How to represent a molecule, such as a small molecule drug, for example, is much more challenging, and continues to be an area of active research on the forefront. Recent treatment of molecules as graphs is making major inroads — a departure from how one represents images as arrays of numbers.
Finally, choosing ML-appropriate questions also critically relies on the existence of enough labeled data. Often in the life sciences, there is plenty of data, but the existence or accuracy of labels are a challenge. For example, if one gets an AUC of 0.99 (i.e. exceedingly high accuracy) on a small data set, this is actually bad news, not good news, as there is likely some hidden “cheat” circumventing the ML from true determination. A recent, soon-to-be-classic example in pathology prediction was super high accuracy of tumor prediction from images, only later to be found that all of the images with tumors also showed rulers to measure the size of the tumor, so in the end, the ML was truly a ruler detector, not a tumor detector! Understanding what’s possible is key to avoiding these sorts of mistakes that don’t serve the user and in fact work to set the field backwards.
As an interesting side note, this all means “data cleaning” has become a major art in and of itself. More and more, a natural solution to both the quantity and quality challenges involves data generation de novo. This puts the ML engineer in a critically important role, not just in data analysis a posteriori, but at the very inception of the experimental design.
With the right question and appropriate data in hand, the next steps lie in applying the appropriate algorithms and understanding the basic framework of these algorithms. Since images have been such a success for deep learning more generally, it’s natural to directly bring to the life sciences algorithms used for image recognition in other areas. For example, Convolutional Neural Networks (CNN) — which incorporate translational symmetries into learning — work particularly well on life science images (and image-like objects, such as genomics), as well.
Interestingly, one can also indirectly bring over algorithms, meaning use algorithms that work in a similar, related way. For example, some of the magic of CNNs is their convolutional nature, i.e. if you’re looking to identify a cat in an image, it doesn’t matter where the cat is (center of the image, upper left, etc); the convolutional neural net will be translationally invariant, i.e. find the right spot in the image. Similarly, graph-convolutional neural nets retain much of the spirit of CNNs in images, but on a graph of a molecule where translational invariance translates to invariance of the location of a chemical group, i.e. the key chemical moiety to recognize could be anywhere on the molecule.
The final frontier is the development of novel architectures for life science data, with entirely new kinds of representations. This is a very active frontier of research: the very general nature of DNN tools (such as the ability to rapidly compute on generalized tensor quantities) greatly facilitates pushing into these new frontiers the significant work in building the infrastructure already accomplished by the DNN field more broadly. Tools like Tensorflow or Torch are very general, much like a turbocharged mathematical library, and not limited to the specific network architectures used today.
Perhaps the most exciting results of this effort is the new opportunity for asking questions which before were either considerably more challenging or unaddressable. My favorite in this category is the ability to discern causal relationships. While “correlation does not mean causation” is a truism frequently quoted, it is not true that this means that computation cannot approach causality at all. Indeed, recent theories of statistical theories of causality crack open the door to rooting out causal factors from time series data. In a nutshell, the idea is that by understanding the sequence of events in time, such as “eating breakfast then healthy then drink poison then dead” one can ferret out causal connections — most importantly in cases which are much more nuanced than my example here.
While this field is still in active development, I can imagine that in the not so far future, if the right tools and data sets emerge, we even might begin to understand a clinical trial as a poor man’s surrogate for causality — second to truly data-based, statistically rigorous causality as determined from much more data than is ever included in a single clinical trial (or three, or ten).
We are also beginning to be able to ask questions of AI itself — and how it works. A second fairy tale associated with machine learning is that ML is a black box, opaque to human understanding. There are in fact now means to interrogate DNNs to understand how they arrive at their predictions (we dedicate a whole chapter to this in the book). DNNs ironically create a framework which can be directly probed and understood, with the right tools and approaches, much more than the original black box of human intelligence.
Part of the broader project of our book, and the toolkit we hope it provides, is in fact a much grander goal: to help foster open source biology. It once seemed quite unlikely that Linus Torvalds, a college student working on a whole new open source operating system called Linux, could create the main code that today drives the vast number of computing devices in the world. But this is how major transformations can start. And while the challenges here are of course different, requiring “real world” experiments, labs and materials and scientific knowledge beyond the scope of a laptop, we believe that the age of open source biology — where a brilliant contributor can begin to change the world from a humble laptop, and perhaps in a dorm room! — is only beginning.