| Literature DB >> 31749705 |
Laurianne David1,2, Josep Arús-Pous1,3, Johan Karlsson4, Ola Engkvist1, Esben Jannik Bjerrum1, Thierry Kogej1, Jan M Kriegl5, Bernd Beck5, Hongming Chen1,6.
Abstract
In recent years, the development of high-throughput screening (HTS) technologies and their establishment in an industrialized environment have given scientists the possibility to test millions of molecules and profile them against a multitude of biological targets in a short period of time, generating data in a much faster pace and with a higher quality than before. Besides the structure activity data from traditional bioassays, more complex assays such as transcriptomics profiling or imaging have also been established as routine profiling experiments thanks to the advancement of Next Generation Sequencing or automated microscopy technologies. In industrial pharmaceutical research, these technologies are typically established in conjunction with automated platforms in order to enable efficient handling of screening collections of thousands to millions of compounds. To exploit the ever-growing amount of data that are generated by these approaches, computational techniques are constantly evolving. In this regard, artificial intelligence technologies such as deep learning and machine learning methods play a key role in cheminformatics and bio-image analytics fields to address activity prediction, scaffold hopping, de novo molecule design, reaction/retrosynthesis predictions, or high content screening analysis. Herein we summarize the current state of analyzing large-scale compound data in industrial pharmaceutical research and describe the impact it has had on the drug discovery process over the last two decades, with a specific focus on deep-learning technologies.Entities:
Keywords: Artificial intelligence; Chemogenomics; Large-scale data; deep learning; pharmaceutical industry
Year: 2019 PMID: 31749705 PMCID: PMC6848277 DOI: 10.3389/fphar.2019.01303
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
Figure 1Different categories of large-scale compound data in industrial pharmaceutical research.
Number of SAR data point in large pharmaceutical companies reported in literatures.
| Company | # of SAR point | Date | Reference |
|---|---|---|---|
| AstraZeneca | 150 million single-shot SAR points, 14 milliona CR SAR points | Up to 2008 | ( |
| Boehringer Ingelheim | 260 million single-shot SAR points, 7 million CR SAR points | Up to 2011 | ( |
| Pfizer | 0.6 million CR SAR points | Up to 2005 | ( |
| Johnson & Johnson | 30 million SAR points | Up to 2006 | ( |
a) This number includes external sources, up to 2012.
Figure 2Illustration of applying HTS-FP for building multi-task learning models. A chemogenomic matrix represents the interactions between the compound collection and a panel of biological target. Such a matrix is very often sparsely filled activities and missing cells represent unknown activity for the compound/target pair. Employing machine learning and HTSFP is an example of how unknown activities can be predicted.
Figure 3Typical neural network architecture for image classification. Alternating convolutional and max pool layers are followed by a number of fully connected layers, and finally an output layer with either sigmoid or softmax functions, depending on the task (Gawehn et al., 2016).
Performances comparison of traditional ML and DL in Drug Discovery.
| Ref. | Performance traditional ML | Performance deep-learning |
|---|---|---|
| ( | RF: MCC = 0.89 | DNN: MCC = 0.91 |
| ( | RF: AUC = 0.78 | MT NN: AUC = 0.82 |
| ( | SVM: MCC = 0.50, BEDROC = 0.88 | DNN_MC: MCC = 0.57, BEDROC = 0.92 |
| RF: MCC = 0.56, BEDROC = 0.82 | ||
| ( | SVM: AUC = 0.71 | ST: AUC = 0.72 |
| MT: AUC = 0.75 | ||
| ( | RF: Pearson = 0.783 | GNN: Pearson = 0.822 |
| ( | LR: Acc = 0.86 (reaction prediction) | NN: Acc = 0.92 (reaction prediction) |
| LR: Acc = 0.64 (retrosynthesis) | NN: Acc = 0.78 (retrosynthesis) | |
| ( | SVM: AUC = 0.822 | GC: AUC = 0.829 |
| ( | SVM: AUC = 0.792 | Attentive FP: AUC = 0.832 |
| ( | RF: AUC = 0.619 | FFN: AUC = 0.788 |
| ( | RF: R2 = 0.42 | DNN: R2 = 0.49 |
| ( | RF: R2 = 0.428 | ST: R2 = 0.448 |
| MT: R2 = 0.468 |
LR, ST, MT, GC, GNN, and FFN refer to Linear Regression, Single- and Multi-Task, Graph Convolution, Graph, and Feedforward Neural Network, respectively. (1) Averaged performance on validation sets over 7 datasets. (2) Averaged performance on test sets over 19 datasets. (3) Performance on a test subset of the Tox21 dataset. (4) Performance on the HIV dataset. (5) Performance on the Tox21 dataset. (6) Averaged performance over 15 datasets. (7) Model performance on a test set.
Figure 4Process of reaction prediction on an exemplary target molecule [lidocaine (Reilly, 2009)]. Machine-learning methods are applied to, first, predict the synthetic feasibility of the molecule and, second, predict the chemical context leading to the best yield possible for the reaction.
Figure 5Canonical (A) and randomized (B) SMILES representations of Aspirin. Numbers represent the atom numberings assigned by the canonicalization algorithm (A) or randomized (B). Green arrows indicate how the molecular graph is traversed. Both SMILES strings represent the same molecule but, as the atom numbering changes, the generated SMILES strings do too. Figure extracted with permission from Arús-Pous et al. (2019b).
Figure 6Sampling process of a pre-trained recurrent neural network. The generation process starts with a GO token, and at each step, the model computes a probability distribution of all possible characters. Then, the next character is sampled from it and fed back to predict the next character. The internal memory in the long short-term memory (LSTM) cells enables the predictions to take previous characters into account when generating the next character.