A Practical Toolkit for Deep Learning in the Life Sciences

Vijay Pande Posted June 6, 2019

Bharath Ramsundar, Peter Eastman, Pat Walters, and I wrote Deep Learning for the Life Sciences with one primary message in mind: you can easily now do world-class AI in Bio from your laptop. Bringing AI to the life sciences has never been more exciting, or more rewarding — resulting in everything from a deeper understanding of genomics to more rapid and successful drug design. Given that, we wanted to create a hands-on practical guide for newly minted Deep Neural Net (DNN) practitioners and scientists on how to determine hype from reality in the space, the tools you will need for the highest impact, and what foundational concepts and applications to keep in mind. Below are the three biggest questions I believe the practitioner wanting to use AI in the life sciences should be thinking about.

#1: What are the right kinds of problems to use AI on?

Asking an ML-appropriate question requires understanding where applying ML makes sense, to what data sets, and what type of data. Machine learning in general (and deep learning in particular) is not a panacea — but given all the new recent advances, there are unquestionably also opportunities to do new things and more easily, with just a few (10 to 50) lines of Python code directly from one’s laptop.

A natural place to start applying deep learning is to images. Image recognition has been one of the major triumphs of deep learning approaches, perhaps not surprisingly, given the original inspiration from parts of the brain such as the visual cortex! Given that so much of life science data are images, e.g. all the variants of microscopy, radiology, pathology, etc., images are both a natural representation (the form of data into the computer) as well as a rich source of labeled data, such as “this picture has a tumor in it at this location.”

While images are a natural representation for computers — after all, images are just an orderly arranged array of numbers — clearly not all life sciences data is in the form of an image. Some data sets are by their nature pretty close to images; for example, a DNA or protein sequence could be viewed as a very long one-dimensional image, and deep learning approaches for images naturally carry over. But other data sets are more challenging. How to represent a molecule, such as a small molecule drug, for example, is much more challenging, and continues to be an area of active research on the forefront. Recent treatment of molecules as graphs is making major inroads — a departure from how one represents images as arrays of numbers.

Finally, choosing ML-appropriate questions also critically relies on the existence of enough labeled data. Often in the life sciences, there is plenty of data, but the existence or accuracy of labels are a challenge. For example, if one gets an AUC of 0.99 (i.e. exceedingly high accuracy) on a small data set, this is actually bad news, not good news, as there is likely some hidden “cheat” circumventing the ML from true determination. A recent, soon-to-be-classic example in pathology prediction was super high accuracy of tumor prediction from images, only later to be found that all of the images with tumors also showed rulers to measure the size of the tumor, so in the end, the ML was truly a ruler detector, not a tumor detector! Understanding what’s possible is key to avoiding these sorts of mistakes that don’t serve the user and in fact work to set the field backwards.

As an interesting side note, this all means “data cleaning” has become a major art in and of itself. More and more, a natural solution to both the quantity and quality challenges involves data generation de novo. This puts the ML engineer in a critically important role, not just in data analysis a posteriori, but at the very inception of the experimental design.

#2: What are the right algorithms, where, and when?

With the right question and appropriate data in hand, the next steps lie in applying the appropriate algorithms and understanding the basic framework of these algorithms. Since images have been such a success for deep learning more generally, it’s natural to directly bring to the life sciences algorithms used for image recognition in other areas. For example, Convolutional Neural Networks (CNN) — which incorporate translational symmetries into learning — work particularly well on life science images (and image-like objects, such as genomics), as well.

Interestingly, one can also indirectly bring over algorithms, meaning use algorithms that work in a similar, related way. For example, some of the magic of CNNs is their convolutional nature, i.e. if you’re looking to identify a cat in an image, it doesn’t matter where the cat is (center of the image, upper left, etc); the convolutional neural net will be translationally invariant, i.e. find the right spot in the image. Similarly, graph-convolutional neural nets retain much of the spirit of CNNs in images, but on a graph of a molecule where translational invariance translates to invariance of the location of a chemical group, i.e. the key chemical moiety to recognize could be anywhere on the molecule.

The final frontier is the development of novel architectures for life science data, with entirely new kinds of representations. This is a very active frontier of research: the very general nature of DNN tools (such as the ability to rapidly compute on generalized tensor quantities) greatly facilitates pushing into these new frontiers the significant work in building the infrastructure already accomplished by the DNN field more broadly. Tools like Tensorflow or Torch are very general, much like a turbocharged mathematical library, and not limited to the specific network architectures used today.

#3: What new questions does this mean can we now ask?

Perhaps the most exciting results of this effort is the new opportunity for asking questions which before were either considerably more challenging or unaddressable. My favorite in this category is the ability to discern causal relationships. While “correlation does not mean causation” is a truism frequently quoted, it is not true that this means that computation cannot approach causality at all. Indeed, recent theories of statistical theories of causality crack open the door to rooting out causal factors from time series data. In a nutshell, the idea is that by understanding the sequence of events in time, such as “eating breakfast then healthy then drink poison then dead” one can ferret out causal connections — most importantly in cases which are much more nuanced than my example here.

While this field is still in active development, I can imagine that in the not so far future, if the right tools and data sets emerge, we even might begin to understand a clinical trial as a poor man’s surrogate for causality — second to truly data-based, statistically rigorous causality as determined from much more data than is ever included in a single clinical trial (or three, or ten).

We are also beginning to be able to ask questions of AI itself — and how it works. A second fairy tale associated with machine learning is that ML is a black box, opaque to human understanding. There are in fact now means to interrogate DNNs to understand how they arrive at their predictions (we dedicate a whole chapter to this in the book). DNNs ironically create a framework which can be directly probed and understood, with the right tools and approaches, much more than the original black box of human intelligence.

Part of the broader project of our book, and the toolkit we hope it provides, is in fact a much grander goal: to help foster open source biology. It once seemed quite unlikely that Linus Torvalds, a college student working on a whole new open source operating system called Linux, could create the main code that today drives the vast number of computing devices in the world. But this is how major transformations can start. And while the challenges here are of course different, requiring “real world” experiments, labs and materials and scientific knowledge beyond the scope of a laptop, we believe that the age of open source biology — where a brilliant contributor can begin to change the world from a humble laptop, and perhaps in a dorm room! — is only beginning.

Want More a16z Bio + Health?

Insights, analysis, and additional reading on bio and health, and how both are shaping our future.

Learn More
Recommended For You
Bio + Health

Infinite Healthcare: What’s It Worth?

Jay Rughani, Jane Rhee, and Julie Yoo
Enterprise

Can AI Help Save Lives?

Kimberly Tan and Michael Chime

Expert News by a16z

We have built a network of experts who are deeply rooted in technology and how it’s shaping our future. Subscribe to our newsletters to receive their perspectives.

Views expressed in “posts” (including podcasts, videos, and social media) are those of the individual a16z personnel quoted therein and are not the views of a16z Capital Management, L.L.C. (“a16z”) or its respective affiliates. a16z Capital Management is an investment adviser registered with the Securities and Exchange Commission. Registration as an investment adviser does not imply any special skill or training. The posts are not directed to any investors or potential investors, and do not constitute an offer to sell — or a solicitation of an offer to buy — any securities, and may not be used or relied upon in evaluating the merits of any investment.

The contents in here — and available on any associated distribution platforms and any public a16z online social media accounts, platforms, and sites (collectively, “content distribution outlets”) — should not be construed as or relied upon in any manner as investment, legal, tax, or other advice. You should consult your own advisers as to legal, business, tax, and other related matters concerning any investment. Any projections, estimates, forecasts, targets, prospects and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Any charts provided here or on a16z content distribution outlets are for informational purposes only, and should not be relied upon when making any investment decision. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, posts may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein. All content speaks only as of the date indicated.

Under no circumstances should any posts or other information provided on this website — or on associated content distribution outlets — be construed as an offer soliciting the purchase or sale of any security or interest in any pooled investment vehicle sponsored, discussed, or mentioned by a16z personnel. Nor should it be construed as an offer to provide investment advisory services; an offer to invest in an a16z-managed pooled investment vehicle will be made separately and only by means of the confidential offering documents of the specific pooled investment vehicles — which should be read in their entirety, and only to those who, among other requirements, meet certain qualifications under federal securities laws. Such investors, defined as accredited investors and qualified purchasers, are generally deemed capable of evaluating the merits and risks of prospective investments and financial matters.

There can be no assurances that a16z’s investment objectives will be achieved or investment strategies will be successful. Any investment in a vehicle managed by a16z involves a high degree of risk including the risk that the entire amount invested is lost. Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by a16z is available here: https://a16z.com/investments/. Past results of a16z’s investments, pooled investment vehicles, or investment strategies are not necessarily indicative of future results. Excluded from this list are investments (and certain publicly traded cryptocurrencies/ digital assets) for which the issuer has not provided permission for a16z to disclose publicly. As for its investments in any cryptocurrency or token project, a16z is acting in its own financial interest, not necessarily in the interests of other token holders. a16z has no special role in any of these projects or power over their management. a16z does not undertake to continue to have any involvement in these projects other than as an investor and token holder, and other token holders should not expect that it will or rely on it to have any particular involvement.

With respect to funds managed by a16z that are registered in Japan, a16z will provide to any member of the Japanese public a copy of such documents as are required to be made publicly available pursuant to Article 63 of the Financial Instruments and Exchange Act of Japan. Please contact compliance@a16z.com to request such documents.

For other site terms of use, please go here. Additional important information about a16z, including our Form ADV Part 2A Brochure, is available at the SEC’s website: http://www.adviserinfo.sec.gov.