Evaluating AI in Bio: How to Know Whether it is Worth the Work

Andy Tran and Vijay Pande

It’s Time to Heal is a special package about engineering the future of bio and healthcare. See more at: https://a16z.com/time-to-heal/.

Given its immense potential utility, AI is now a part of every link on bio’s value chain, from drug discovery, to diagnostic development, to healthcare delivery technologies. There are so many new applications for AI in bio — with more coming seemingly every day — that it is increasingly difficult to delineate the signal from the noise. A common question that we get from leaders in biopharma and healthcare, as well as investors and operators, is: “how do I assess a new AI-driven technology and make sure it is worth my time/effort/money?” This is an important question and in this piece, we will provide principles to abide by, point out some common pitfalls, and share how we think about evaluating AI-driven bio technologies.

(0) Do you really need AI to solve this problem?

The very first question is not about the product, but about the problem you want to solve. AI is not a panacea, so start by thinking through whether this problem requires, or would significantly benefit from, an AI-based approach. AI uniquely shines in complex tasks or analyses that require dealing with lots of unstructured data, where the key features aren’t well defined or intuitive to humans. If you want software that simply predicts trends affected by some known (or knowable) variables, AI would be overkill (or even detrimental). Conversely, AI can help you sift through data like complex medical images or unstructured health record notes to help diagnose diseases that are caused by an extensive set of interacting or unclear factors. You also have to consider the data itself. Is there enough high quality non-noisy data — both for training and testing — so that AI can be effective? Would you need to begin first with a separate effort for data generation and curation? If, and only if, you have meticulously pressure tested your problem and your data and are convinced the bio problem is ripe for AI, is it time to assess the platform or product itself.

(1) Is it really AI or is it just marketing hype? 

It’s very common to confuse (and sometimes intentionally misuse) the term “AI” when really what’s meant is automated data analysis with pre-programmed software. When we talk about AI, we are referring to algorithms or platforms that autonomously uncover unique insights that would be extremely hard or even impossible for humans to deduce, at least within a reasonable time scale. Those insights then continue to be improved and optimized as the data scales over time — true AI systems are iterative and become increasingly autonomous.

On the other hand, automation uses rules-based systems to “predict” outcomes — but those predictions don’t adapt. Automation might allow for completion of repetitive tasks, but it cannot learn from those tasks in order to complete new tasks. A medical transcription software not powered by AI, for instance, can be great at understanding classical cardiovascular terminology, but if it encounters newer oncology research concepts or previously unexplored ontologies, it will have no way to adapt and learn. Beware of companies claiming to use AI but are actually just doing basic data analysis based on statistical analyses selected by humans. This could look like a model that estimates the length of hospital stays based on a regression analysis that utilizes features (disease severity, age, etc) selected by physicians. This is not AI.

Beware of companies claiming to use AI but are actually just doing basic data analysis based on statistical analyses selected by humans.

To identify real AI, it is important to dig into how the platform is trained. Are features learned autonomously or are they all pre-anticipated or pre-selected? Can it actually adjust on its own based on trial and error or is it bound by certain parameters? Does the accuracy and predictive power autonomously improve over time or is it flat? Does it create its own high-volume data exhaust? This data exhaust is a notable aspect of AI and it is used to iteratively improve the model. In sum, fake AI systems have a heavy reliance on laborious input and human oversight and don’t adapt. True AI systems are capable of learning, independent feature identification, improving over time. Once you’ve established that you are working with true AI, you can dig in deeper to how the AI works and how it stacks up to competitors. 

(2) Can the model actually achieve something differentiated?

The next step of evaluating any new AI driven technology is to determine whether it is differentiated from the competition. To understand the innovative nature of a given product, it is naturally important to have a deeper understanding of the application domain (medical transcription, drug design, biomarker discovery, clinical trial prediction, etc). The core question here is whether it actually enables something unprecedented in the field, be it a completely new use case or order of magnitude improvement in speed/efficiency/cost/etc. The logic here is the same for all new products, AI or not. 

When we think about differentiation, it all boils down to how hard it would be for a third party to replicate this technology — or improve on it (is there a moat?). Understanding where the datasets and even the AI algorithms themselves come from are extremely important. One of the amazing things about the democratization of high quality open-sourced AI packages and datasets is that they are readily available off-the-shelf. Even introductory computer science students now can pull together a simple machine learning classifier. While this represents an incredible advance for the broader field, one must also discern whether a given platform can be replicated with off-the-shelf tools or if there is some fundamental advancement. 

(3) Is it working? How do you know?

Once you have ascertained that the product in question is true AI and is differentiated from competitors, now it is time to look under the hood and understand how it performs quantitatively. To do this, it is critical to know your metrics for a given application. For instance, if you are dealing with a classification problem (e.g., classifying whether a tissue sample is cancer or not) you should aim to maximize accuracy. Knowing the AUC* value, sensitivity, specificity, etc. is very important. Alternatively, for a complex regression problem like predicting the molecular property values of a drug or the ideal dose of a treatment for a patient, you should strive to minimize error, with metrics like R2† or RMSE‡ being key.

However, maximizing accuracy or minimizing error is not enough for guaranteed success in a real world setting. You have to know your thresholds for utility. As predictive modeling is relative and specific to the application at hand, maximum accuracy is normally not feasible (nor is it required). It is all based on context and current benchmarks for AI-driven algorithms applied to that problem. A 0.71 R2 might not seem impressive at face value, but can be astounding if there is no precedence for a particular application. For example, if you are predicting clinical trial outcomes, even an imperfect system that provides only a modest boost in predictive performance (perhaps allowing you to de-prioritize one extra program per year) can mean billions of dollars saved for an organization. Once you understand your performance, it is important to also compare it to how simpler methods may perform. Knowing how performance changes if you replace your hyper-tuned sophisticated deep learning algorithm with a simpler random forest or logistic regression will allow you to discover the limits of your model’s skill. As AI finds its way to more corners of bio, new applications may arise that might not even have established benchmarks. In those cases, the most important aspect is understanding how AI improves the accuracy, speed or precision of a particular task compared to the standard methodology. (Though these uncharted situations may seem tricky, those greenfield opportunities are often the most exciting!)

(4) Is it working… too well? 

At this point, perhaps you have the data, you know your metrics and benchmarks, you have your AI trained, and… it is spitting out .99 AUC! It seems like you have cracked the code and your platform is ready for prime time! But hold the champagne. Unfortunately, as investors and practitioners in the space, we have seen how this movie ends far too many times. Spoiler alert: This superhuman AI algorithm quickly falls flat once it is released into the wild and is exposed to real world data, giving you predictions on par with a coin flip. “But how can that be?” one may ask, especially after the months of training and validation, all while leveraging state-of-the-art AI tools. 

One possible explanation for this super high accuracy, is that the answer may have been already hidden in the training data set, so essentially the process was corrupted from the start. Simply speaking, the answers to the test set were accidentally leaked into the training data set. Technically speaking, the data preparation and cross-validationº processes caused data leakage. A classic illustrative example is the development of a seemingly perfect-accuracy AI-driven tumor detector from tissue images. The system, however, failed completely when it was used on tumor images that came from a different hospital. Looking back at the data, the scientists realized that all the images with a tumor had a little white ruler in the picture to measure the size of the tumor! The ruler was a hidden cheat in the training set, rendering the model to be a well-trained ruler detector. The take home message here is to be mindful of cleansing your data from its “white ruler”. Knowledge of stats alone isn’t enough.

Sometimes the pitfall of an AI model is even more insidious and can’t be pinpointed to a particular feature. These are tougher to spot because they may not be as obvious or as binary, and hence may be the difference between R2 of 0.6 and 0.78. A sneaky example that commonly plagues AI algorithms is the problem with time-series data. Take an AI-driven platform that strives to predict the probability of success (PoS) of a drug in clinical trials. At first blush, it seems natural to use the entirety of clinical trial information available. Upon testing, you also would be (mistakenly) impressed when your model confidently predicts the result of some pivotal trial from 2007. The fallacy here is that the AI model already had cues from the future incorporated into it, making the problem much easier to predict. Despite the data being cleaned, deduplicated and devoid of any hidden cues, the clinical trial dataset with info up to 2020 has incorporated “cheats” from new biological and clinical learnings (e.g., from new dosage regimens, interactions with new modalities, trials with more refined patient subgroups etc.) that it would not have had in 2007, and thus the model is not generalizable to future trials. In the case of introducing data leakage in time series, we have to be careful not to let our model peek into the future. 

Ultimately, a good model (and thus a good product) ensures that the training data is truly representative of and generalizable to the real world prospective data that it would be given to analyze. 

(5)  Did you run a prospective test, the gold standard of any validation?

Lastly, it goes without saying that the proof is usually in the pudding. Even if you have painstakingly followed all of the steps above (and have picked clear cut controls to establish the baseline, ensured no bias or cues from data leakage, checked that training data is generalizable), you have still only tested your AI platform using historical data with predetermined answers. Simply put, everything was retrospective. But for real world applications, you can only control so much and the unknown unknowns can trip you up — even if you were not intending to cheat!. 

When making a final decision on a given technology, nothing beats a carefully crafted randomized clinical trial-like prospective test to truly validate the AI platform. That is the holy grail of tests — a real life dry run. Realistically, this may sometimes be impractical in terms of time, resources, and cost for a new technology, so the next best test would be some form of a retrospective, blinded test. A classic benchmarking dataset can give you the opportunity to compare performance of competing technologies in a head-to-head study. 

When making a final decision on a given technology, nothing beats a carefully crafted randomized clinical trial-like prospective test to truly validate the AI platform. That is the holy grail of tests -- a real life dry run.

In conclusion, as AI continues to seep into every corner of bio, we believe these guiding principles are of paramount importance for both practitioners and business partners alike. But these complex models — and their applications to complex biology — require a unique skillset to truly understand. We believe that companies must intertwine their AI experts with their subject matter experts. Only this synergistic combination will be able to capture the full — and massive — potential of AI in bio. But this framework can be a point of entry for those who were once standing on the sidelines, either with curiosity or skepticism, to begin to assess whether a given AI-driven product is worth the investment of their time and capital.


*Area under the curve (AUC) is a performance measurement for a classification problem, representing degree or measure of separability. It tells how well a model can of distinguish between classes. Higher the AUC, better the model is at prediction. Theoretical maximum is 1. When AUC is 0.5, it means model has no class separation capacity whatsoever.

R2  or R-squared is a measure of how well data points align with with the model. The ideal value for R2 is 1. The closer the value of R2 to 1, the better is the model fits the data.

‡RSME: Root Mean Square Error (RMSE) is a measure of the error of a model in predicting quantitative data. The smaller the RSME, the better.

ºCross-validation is primarily applied AI to estimate the skill of a model on unseen data. First, the available dataset is split into three subsets of training, validation, and test data. The model is trained using the training set–in which the goal of the process is to aim for a model that scores highest on some metric, like accuracy. The model is then optimized by tuning certain parameters and judging its performance on the validation set. Lastly, the model’s success is measured by judging its performance using the test dataset. 

Want more a16z Bio + Health?

Sign up for our bio + health newsletter to get the latest take from us on the future of biology, technology, and care delivery.

Thanks for signing up for the a16z Bio + Health newsletter.

Check your inbox for a welcome note.

MANAGE MY SUBSCRIPTIONS By clicking the Subscribe button, you agree to the Privacy Policy.