- by Stefan Schroedl, Head of Machine Learning at Atomwise
Originally posted on Medium
Having applied machine learning to search and advertising for many years, two years ago I was ready for a transition to something new. In particular, the field of life sciences seemed to stand out as an opportunity to have a rewarding and positive impact. I was excited to join Atomwise, working on deep learning for drug discovery.
Deep neural networks started to become particularly popular around 2012, when researchers from the University of Toronto [Krizhevsky et al, 2012] won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). In recent years, this brand of machine learning techniques has revolutionized several artificial fields such as computer vision, natural language processing, and game playing. Will it be able to similarly transform chemistry, biology, and medicine? Experimental sciences give rise to a wealth of unstructured, noisy, and sometimes poorly understood data. One central attraction of deep neural nets is their ability to build complex and suitable featurizations without the need to handcraft them explicitly. Progress has undoubtedly been made, and compelling ideas have been proposed and are being developed. In the same year of the first ImageNet competition, Kaggle hosted the Merck Molecular Activity Challenge; this sparked a lot of interest as well, spurred research into life sciences applications and also captured attention in the popular press. But despite significant progress, it is fair to say that many obstacles and challenges are still waiting to be conquered before an overall breakthrough for practical applications can be declared.
If you are like me, coming from a machine learning or software engineering background but without extensive exposure to medicine, biology, or chemistry, be encouraged: The algorithmic problem can often be formulated and separated cleanly from the (wet) hardware. With this blog post, I am going to try to give you a flavor of and a starting point into the wondrous and peculiar world of drug discovery. It is mainly based on osmosis from my brilliant co-workers, many papers I read and Khan Academy and Youtube videos I watched. There are plenty of excellent online resources available, but be warned: It is really easy to get too deep down into the rabbit hole. I hope this post will help you with some high-level orientation. Needless to say, while trying to cover the breadth fairly, it will be necessarily subjective. Familiarity with machine learning, particularly contemporary neural network approaches, is assumed; it will rather focus on things that seem new and unusual from this perspective. In addition, I would also highly recommend this blog post, as it was quite a helpful introduction to me.
People say the practice of machine learning is (at least) 80% data processing and cleaning and 20% algorithms. This is not different for machine learning in chemistry and biology. I will have to briefly start with the underlying physical sciences and experimentation; touch on cheminformatics representations; and then see how the type of machine learning follows from the structure of the inputs and targets.
This post is organized in two parts (apologies, it grew longer than I originally planned …). This first one will recap the basics of drug mechanisms, the research pipeline, experimental databases, and evaluation metrics. This will set the stage for the second part, in which we delve into greater detail of representation formalisms as well as the range of traditional machine learning and deep learning approaches.
To skip to the conclusion, I found the transition into this field very rewarding. There are plenty of novel, cutting-edge problems that can be addressed by computer scientists and machine learning engineers even without a deep background in life sciences. So if you are curious and interested in getting involved, don’t hesitate!
So now, without further ado, let’s dive right in and recap the basic premises.
Many proteins in our body are enzymes, which means that they speed up the rate of a chemical reaction without being irreversibly changed themselves. Others are part of signaling pathways that control highly specific cell adaptations and reactions. We can imagine these proteins working via a lock-and-key principle: One or multiple smaller molecules (substrates, ligands) fit snugly into a “hole” (binding pocket, active site) of a protein, thereby facilitating a subtle structure change, which in turn can lead to a domino chain of reactions. Agonists (resp. antagonists) tend to speed up (resp. slow down or block) reactions. Because proteins ‘receive’ signals via other molecules, they are also called receptors. There are also more subtle ways to interact than this competitive inhibition; for example, in allosteric regulation, compounds bind in other places away from the active site, but indirectly affect the reaction through distant conformation changes in the protein.
Binding drug molecules (compounds) are made to be structurally similar to native ligands (the “original” keys); they bind, but fail to induce a reaction and block it by displacing the latter. They are like a fake duplicate key, close enough to fit into the hole, but not of the right shape to turn the lock.
Most current drugs are of this type described above called “small molecules”, on which we will focus here. They usually consist of less than, say, 10–100 atoms — compared to proteins, which are huge (thousands). Apart from that, other classes of medicines exist and are being developed, such as biologic drugs or therapeutic antibodies.
I am probably not telling you anything new when saying that the drug development cycle is slow and expensive, 15 years and 2.5 billion dollars — yada yada yada. There are many good online resources you can read about it, e.g. here. Just a brief recap, the development stages are [Hughes et al, 2011]:
In this post, I will focus on the use of machine learning in the context of VHTS and hit-to-lead/lead optimization (in the latter case, the technical term is QSAR — quantitative structure activity relationship, i.e., models to predict binding affinity values from compound and possibly protein features). A few larger companies (e.g., IBM Watson) and startups also work on supporting target selection, e.g., by mining vast amounts of clinical, genomic, expression, and text data from publications. However, here I will not discuss these approaches further.
What initially confused me was that researchers make fine grained distinction between hit discovery/hit-to-lead and lead optimization. From a purely machine learning standpoint, you would expected the goal and methods to be identical: find a compound with a high binding affinity (and maybe, as a bonus, good metabolic properties). Even so-called “docking” methods to identify the spatial protein-ligand interaction could be seen as one approach to it; but as we will see later, it is treated as its own separate category of algorithms.
The varying degrees of a) noise in the data and b) computational budget due to the size and success rate of the candidate set, give rise to different approaches. ML for VHTS must be fast, have good ranking performance, while precision of score values matters. On the other hand, QSAR applications can be more elaborate, don’t need to deal too well with non-binders, but should predict binding affinity with high accuracy and correlation.
We could see it as a “chemical version of the uncertainty principle”: with increasing complexity of experiments, from pure binding assays to multi-step chemical reactions to cell- and animal-based assays, biological relevance increases with respect to the ultimate goal of developing a viable medicine; but so does the measurement noise as well. Even simple molecular properties can only be known to a limited precision; this unavoidable label noise is referred to as chemical accuracy.
In addition to experimentation error, the medical and biochemical literature is heavily subject to publication bias. Incentives exists to report positive results, but naturally less so for negative results or repetition of previous experiments. In fact, the entire drug discovery pipeline is usually geared towards “failing fast and cheap”, i.e., eliminating unpromising candidates as early as possible (no point in spending money to find out how inactive exactly an inactive compound is). Consequently, negative data is scarce and has low accuracy.
Chembl is a public database containing millions of bioactive molecules and assay results. The data has been manually transcribed and curated from publications. Chembl is an invaluable source, but it is not without errors — e.g., sometimes affinities are off by exactly 3 or 6 orders of magnitude due to wrongly transcribed units (micromols instead of nanomols).
Patents come with their own particular issues. Obviously, they are under the incentive to be as broad as possible. So they often contain tables with one column for a molecule diagram, and another column for the associated affinity value. But the molecule can contain one or multiple R-groups, where R can be substituted by multiple groups as listed in the text. Obviously, not all these (sometimes hundreds of) molecules can have exactly the same affinity. On average, affinities from patents are higher than those from publications, since the pharmaceutical industry strives to optimize ligand potency as much as possible, while academic research is more focused on basic research and early stage candidates.
Pharmaceutical companies maintain their own proprietary databases of experimental results, and assemble large libraries of molecules that are routinely used in screening. The scale and reproducibility of internal databases exceeds all publicly available data probably by an order of magnitude.
In the image recognition domain, very large benchmark datasets (e.g., ImageNet) exist and researchers can more or less agree on uncontroversial evaluation criteria. This allows the predictive accuracy of different techniques to be be adequately quantified and compared. Unfortunately, this is currently not the case for virtual screening and QSAR; indeed, machine learning in drug discovery has been held back by the lack of such common standards and metrics.
It is not that efforts in this direction wouldn’t have been undertaken. The widely used DUDE-E benchmark [Mysinger et al, 2014] chose challenging decoys (negative examples) by making them similar to binders (positive examples) in terms of 1-D properties such as molecular weight, rotatable bonds, etc, but at the same time be topologically dissimilar to minimize the likelihood of actual binding. For the ‘topologically dissimilar’ part, 2-D fingerprints were used. Later it was convenient for researchers to continue to publish their method’s accuracy in terms of this benchmark, even for QSAR methods for which DUD-E had not been designed. Not too surprisingly then, fingerprint-based methods turned out to excel at this task — and hence justified the publications.
Another frequently used database is PDBbind, which contains protein-ligand co-crystal structures together with binding affinity values. Again, while certainly very valuable, PDBbind has some well-known data problems. Most authors have adhered to a data split with significant overlap or similarity between examples in the train and test sets. Some works have not distinguished between validation and test. Though there have been updated versions over the years, the 2007 version is still frequently used in order to maintain comparability with previous publications.
More broadly, the question of how to split existing data into training and test is actually quite tricky — I can attest to the fact that by choosing a particular such scheme, almost any model can be made look excellent or mediocre!
Classical machine learning literature spends little attention to this aspect. Most often, the underlying assumption is that training and test examples are drawn i.i.d. from the same distribution. However, this does not necessarily reflect a good indicator of prospective performance in drug discovery. A random split is usually overly optimistic, since molecules and targets can be the same or very similar between training and test. Sometimes, when chemists plan which molecules to synthesize next, they intentionally choose ones that are dissimilar from those they already know, in order to explore a larger chemical space. One sensible approach is time split, simulating the inference of future data at a particular moment. Alternatively, [Martin et al, 2017] cluster the compounds for a given target, and sort the smallest cluster to the test set. We see that the topic of bias in drug discovery datasets is a thorny one [Wallach and Heifets, 2018] …
We have mentioned above that negative examples (in this context sometimes called decoys) are hard to retrieve from publicly available sources; and those that do exist have noisier measurements, they might in fact be false negatives. The way we choose negative examples exerts a strong influence on our models and outcomes. We could make the simplifying assumption that since biological activity is a rare phenomenon, any molecule is presumed inactive until proved otherwise. Then, we could proceed with an approach like PU-learning (only positives and unknown labels are given, see e.g., [Han et al 2016]), or we could generate synthetic examples. But the prediction model will still be influenced by the chosen sampling distribution. And for QSAR, it is even harder to invent not only the class (positive/negative), but also an actual affinity value as target.
Let us now turn to the question of how to choose appropriate evaluation criteria. A good metric should be [Lopes et al, 2017]: sensitive to algorithm variations, but insensitive of data set properties; statistically robust; have as few parameters as possible; have straightforward error bounds; have a common range of values, regardless of data; and be easily understandable and interpretable.
In lead optimization, we want to obtain precise, continuous binding affinity predictions for not too many different compounds, so a standard metrics would be root mean square error. Measures intended to reduce the influence of outliers, such as mean absolute error or Huber loss, are reasonable choices as well.
The situation can be much tougher in virtual high throughput screening. Let’s work backward from the ultimate use case: Assume we have a set of 100,000 potential candidate molecules, out of which 500 are true hits. We run our latest newfangled model and rank all candidates by score. Now we can select the top 1000 molecules and use a chemical assay to try to confirm our model’s prediction. By chance, we would expect to find 5 hits. If the tests now reveal, say, 20 hits in the list of top-ranked candidates, we would say our model has an enrichment factor of 4 (funny how this metric sounds like something chemical itself?).
While the enrichment factor metric is intuitive and well understood by chemists, it is however not great to operate with: it strongly depends on the ratio of positives, on the cutoff, and can saturate. It is also hard to convert it into something differentiable to use for gradient descent.
A host of other metrics has been proposed in the literature. The area under the ROC curve (AUC) is widely used in other domains as well; it is independent of the positive rate and has an intuitive probabilistic interpretation. Its drawback is that it is a classification measure — instead, we should give more weight to the early positions in the ranking. The (ranked) correlation coefficient implicitly assumes a linear relationship. Another measure, the area under the accumulation curve (AUAC), was shown to be functionally related to AUC, and to the average rank of positives, as well. Validated by simulations, the power metrics [Lopes et al, 2017], defined as TPR/(TPR+FPR)), was proposed as an improvement. Robust initial enrichment (RIE) is parameterized by an exponential weighting of the ranks of positives, as an alternative to a hard cutoff; BEDROC [Truchon and Bayly, 2007] is derived from RIE by normalizing the range of possible values to [0,1]. Unfortunately, a drawback by many of these metrics not shared by AUC is that they depend on the overall positive rate, which is not related to algorithm performance.
An eye-opening paper on the pitfalls of benchmarks and metrics is [Gabel et al, 2014]. A random forest model was trained on PDBbind for affinity prediction from co-crystals; the features were distance-binned counts of protein-ligand atom pairs. This model strongly outperforms a standard force-field based scoring function on this dataset. But surprisingly, when applying the same model for the task of finding the right poses (relative positions between ligands and proteins), or on the DUD-E dataset for virtual screening, the model performs rather poorly. Through a number of ablation experiments, the authors show that the learned score in fact relies on type atom counts, and is almost independent of protein-ligand interactions.
While there is a lot more material that we could get into, I will stop at this point but hope that I could give you a brief glimpse of the drug discovery problem, its data sets, benchmarks, and evaluation metrics. In the next installment, we will dive into more details of concrete machine learning approaches.