| Literature DB >> 36051875 |
Sangsoo Lim1, Sangseon Lee2, Yinhua Piao3, MinGyu Choi4,5, Dongmin Bang6, Jeonghyeon Gu7, Sun Kim3,7,8,5.
Abstract
A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.Entities:
Keywords: Chemical information modeling; Chemical space; Computer-aided drug discovery; Data augmentation; Deep learning
Year: 2022 PMID: 36051875 PMCID: PMC9399946 DOI: 10.1016/j.csbj.2022.07.049
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Overview of the present review on building a chemical space using deep learning methods. In Section 2, benchmark tasks on drug discovery are introduced. In Section 3, several deep learning methods are introduced with selected state-of-the-art approaches. In Section 4, discussions on how to build an improved representation space are made in terms of self-supervised, generative, and mixup methods. Finally, in Section 5, features are introduced that can provide additional information other than structural formats such as gene expression or physico-chemical properties.
Chemical tasks in drug discovery. The data imported from [12]. (Binary: Binary classification, Reg: Regression)
| Task | Dataset | Size | ML Type | Reference |
|---|---|---|---|---|
| Absorption | Caco-2 (Cell Effective Permeability) | 910 | Reg | |
| HIA (Human Intestinal Absorption) | 578 | Binary | ||
| Pgp (P-glycoprotein) inhibition | 1,218 | Binary | ||
| Bioavailability | 640 | Binary | ||
| Lipophilicity | 4,200 | Reg | ||
| Solubility | 9,982 | Reg | ||
| Hydration Free Energy | 642 | Reg | ||
| Subtotal | 16,558 | |||
| Distribution | BBBP (Blood–Brain Barrier Permeability) | 1,975 | Binary | |
| PPBR (Plasma Protein Binding Rate) | 1,797 | Reg | ||
| VDss (Volume of Distribution at steady state) | 1,130 | Reg | ||
| Subtotal | 4,678 | |||
| Metabolism | CYP P450 - 2C19 Inhibition) | 12,665 | Binary | |
| CYP P450 - (2D6 Inhibition) | 13,130 | Binary | ||
| CYP P450 - (3A4 Inhibition) | 12,328 | Binary | ||
| CYP P450 - (1A2 Inhibition) | 12,579 | Binary | ||
| CYP P450 - (2C9 Inhibition) | 12,092 | Binary | ||
| CYP2C9 Substrate | 666 | Binary | ||
| CYP2D6 Substrate | 664 | Binary | ||
| CYP3A4 Substrate | 667 | Binary | ||
| Subtotal | 16,877 | |||
| Excretion | Half Life | 667 | Reg | |
| Clearance (microsome) | 1,102 | Reg | ||
| Clearance (hepatocyte) | 1,020 | Reg | ||
| Subtotal | 1,592 | |||
| Toxicity | LD50 | 7,385 | Reg | |
| hERG blockers | 648 | Binary | ||
| hERG Central | 306,893 | Binary/Reg | ||
| Ames Mutagenicity | 7,255 | Binary | ||
| DILI (Drug-Induced Liver Injury) | 475 | Binary | ||
| Skin reaction | 404 | Binary | ||
| Carcinogens | 278 | Binary | ||
| Tox21 | 7,831 | Binary | ||
| ToxCast | 8,576 | Binary | ||
| ClinTox | 1,484 | Binary | ||
| Subtotal | 327,133 | |||
| Total | 349,036 | |||
Chemical tasks in Toxicity. We report ROC-AUC scores for Tox21 dataset. For acronyms used in “Data” column, ‘S’ refers to smiles string, ‘G’ refers to molecular graph, ‘F’ refers to molecular fingerprint and ‘P’ refers to molecular properties. In “Model”, the results of the methods from ‘M Model’ are reported from MoleculeNet; ‘T’ refers to transformer-based methods; ‘G’ refers to graph-based methods; ‘R’ refers to RNN-based methods; ‘C’ refers to CNN-based methods; ‘S’ refers to shallow embedding methods.
| Type | Name | Performance | Data | Model | Year | Ref |
|---|---|---|---|---|---|---|
| Machine Learning in | Logistic Regression | 0.781 | S | M | 2018 | |
| IRV | 0.796 | S | ||||
| XGBoost | 0.815 | S | ||||
| Deep Learning in | Weave | 0.807 | G | M | 2018 | |
| TextCNN | 0.838 | S | ||||
| GraphConv | 0.850 | G | ||||
| Deep Learning | ChemBERTa | 0.728 | S | T | 2020 | |
| MICRO-GRAPH | 0.770 | G | T | 2020 | ||
| MoCL | 0.780 | G | G | 2021 | ||
| MolCLR | 0.789 | G | T | 2021 | ||
| SMILES2Vec | 0.810 | S | RC | 2018 | ||
| KCL | 0.813 | G | T | 2021 | ||
| Transformer-CNN | 0.820 | S | CT | 2020 | ||
| DMPNN | 0.850 | G | G | 2019 | ||
| PotentialNet | 0.857 | G | G | 2018 | ||
| FraGAT | 0.860 | G | G | 2021 | ||
| TrimNet | 0.860 | G | G | 2021 | ||
| Mol2Context-Vec | 0.860 | FP | R | 2020 | ||
| CMPNN | 0.860 | G | G | 2020 | ||
| MPAD | 0.860 | SG | SG | 2020 | ||
| GAP | 0.880 | G | G | 2019 | ||
| FP2Vec | 0.880 | F | CT | 2019 | ||
| SA-MTL | 0.900 | SP | CT | 2021 | ||
| TOP | 0.950 | SP | R | 2020 |
(Influence Relevance Vector)
Fig. 2Approaches to address lack of data.
Graph learning methods for building a chemical space.
| Data level | Year | Self-supervied Framework | |
|---|---|---|---|
| predictive | contrastive | ||
| Node/Edge-level | EdgePred | Infomax | |
| 2020 | GPT-GNN | GRACE | |
| 2021 | MGSSL | GraphCL | |
| 2022 | |||
| Subgraph-level | |||
| 2020 | Grover | GCC | |
| 2021 | MGSSL | GraphCL | |
| 2022 | |||
| Graph-level | |||
| 2020 | InfoGraph | ||
| 2021 | GraphLoG | ||
| 2022 | D-SLA | KCL | |