| Literature DB >> 30103448 |
Yunyi Wu1, Guanyu Wang2.
Abstract
Toxicity prediction is very important to public health. Among its many applications, toxicity prediction is essential to reduce the cost and labor of a drug's preclinical and clinical trials, because a lot of drug evaluations (cellular, animal, and clinical) can be spared due to the predicted toxicity. In the era of Big Data and artificial intelligence, toxicity prediction can benefit from machine learning, which has been widely used in many fields such as natural language processing, speech recognition, image recognition, computational chemistry, and bioinformatics, with excellent performance. In this article, we review machine learning methods that have been applied to toxicity prediction, including deep learning, random forests, k-nearest neighbors, and support vector machines. We also discuss the input parameter to the machine learning algorithm, especially its shift from chemical structural description only to that combined with human transcriptome data analysis, which can greatly enhance prediction accuracy.Entities:
Keywords: chemical structure; deep learning; machine learning; molecular fingerprint; molecular fragment; toxicity prediction; transcriptome
Mesh:
Year: 2018 PMID: 30103448 PMCID: PMC6121588 DOI: 10.3390/ijms19082358
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Chemical structural description of Sitaxentan and Sulfisoxazole. (a) The 166-bit molecular access system (MACCS) molecular fingerprints, where the different values are indicated in yellow; (b) The undirected graphs with atoms as nodes and bonds as edges; (c) The molecular structures of Sitaxentan and Sulfisoxazole, where the cyan regions are their common molecular fragment identified by CNN training; (d) Other chemical properties.
The main types of traditional chemical descriptors [79].
| Descriptor Type | Descriptor Name | Description |
|---|---|---|
| Fingerprint-based | ECFP4 | atom type, extended connectivity fingerprint, maximum distance = 4 |
| FCFP4 | functional-class-based, extended connectivity fingerprint, maximum distance = 4 | |
| MACCS | 166 predefined MDL keys (public set) | |
| Connectivity-matrix-based | BCUT | atomic charges, polarizabilities, H-bond donor and acceptor abilities, and H-bonding modes of intermolecular interaction |
| Shape-based | rapid overlay of chemical structures (ROCS), combo Tanimoto (shape and electrostatic score) | shape-based molecular similarity method; molecules are described by smooth Gaussian function and pharmacophore points |
| PMI | normalized principal moment-of-inertia ratios | |
| Pharmacophore-based | GpiDAPH3 | graph-based 3-point pharmacophore, eight atom types computed from three atom properties (in pi system, donor, acceptor) |
| TGD | typed graph distances, atom typing (donor, acceptor, polar, anion, cation, hydrophobe) | |
| TAD | typed atom distances, atom typing (donor, acceptor, polar, anion, cation, hydrophobe) | |
| Bioactivity-based | Bayes affinity fingerprints | bioactivity model based on multicategory Bayes classifier trained on data from ChEMBL v. 14 |
| Physicochemical-property-based | prop2D | physicochemical properties (such as molecular weight, atom counts, partial charges, hydrophobicity etc.) |
The mainstream data resources of toxicity chemicals.
| Database | Database Description | Online Websites | Reference |
|---|---|---|---|
| TOXNET | A collection of toxicity databases. |
| [ |
| ToxCast | High-throughput toxicity data on thousands of chemicals. |
| [ |
| Tox21 |
Chemical Effects in Biological Systems; Individual data and summaries from National Toxicology Program studies; The growth, survival, pathology and other toxicology data. |
| [ |
| PubChem |
Chemical structures; Identifiers; Chemical and physical properties; Biological activities; Toxicity data Patents and health, safety and so on. |
| [ |
| DrugBank | Detailed drug data and corresponding drug target information. |
| [ |
| ToxBank Data Warehouse | Data for systemic toxicity. |
| [ |
| ECOTOX | Single chemical environmental toxicity data on aquatic life, terrestrial plants and wildlife. |
| [ |
| SuperToxic | Toxic compound data from literature and web sources. |
| [ |
Figure 2Tox21 screening workflow in drug discovery (qHTS: quantitative high-throughput screening; NCGC: NIH Chemical Genomics Center) [105].
Comparison of area under the curve (AUC) scores among different combinations of molecular descriptors and machine learning models.
| Molecular Descriptor | Model | AUC | Reference | |
|---|---|---|---|---|
| Shallow architectures | Dragon descriptors (2489 descriptors) | RF | 0.81 | [ |
| Pubchem keys | SVM | 0.948 | [ | |
| MACCS fingerprints | RF | 0.947 | [ | |
| Deep learning | Molecular fragments learned by CNN | DNN | 0.837 | [ |
| Unidirectional graph learned by CNN | Graph CNN | 0.867 | [ | |
| LSTM graph | One-shot learning | 0.84 | [ |
Figure 3An acute oral toxicity prediction. The prediction starts from a chemical molecular structure in the simplified molecular-input line-entry system (SMILES) format, as an input to the MEG-CNN, where the pink, purple, and cyan circles represent the first, second, and third iterations, respectively. During each iteration, the chemical structure is processed by the convolutional kernel according to the atom degree to obtain the corresponding pre-fingerprint. All of the pre-fingerprints are integrated to generate the fingerprint, which was further processed to generate the deep-mined fingerprint. The deep-minded fingerprint was then tested by the regression model (the blue circle) and the multiclass/multitask models (the green circles) [127].
Figure 4Toxicity prediction with gene expression data.
Databases of drug induced gene expression.
| Database | Description | Websites | References |
|---|---|---|---|
| GEO database | Gene expression data of drug-treated samples in subsets. |
| [ |
| Connectivity Map (CMap) |
Genome-wide transcriptional expression data from cultured human cells treated with bioactive small molecules; Simple pattern-matching of functional connections between drugs, genes and diseases through the transitory feature of common gene-expression changes. |
| [ |
| DSigDB |
Drug and small molecule-related genes based on quantitative inhibition; Drug-induced gene expression changes data. |
| [ |
| LINCS Canvas Browser (LCB) |
Experiment data about the landmark gene expression changes in response to a drug; Both gene expression records before and after drug application. |
| [ |
| Therapeutic target database (TTD) |
Drug resistance mutations in drug-target genes; Drug resistance mutations in regulatory genes; Differential expression profiles of drug-targets in the disease-relevant drug-targeted tissues of different diseases; Expression profiles of drug-targets in the non-targeted tissues of healthy individuals; Target combinations of different drugs. |
| [ |
| Comparative Toxicogenomics Database (CTD) |
Cross-species chemical-gene/protein interactions data; Chemical- and gene-disease relationships. |
| [ |
| Drug-Path | Drug-induced pathways. |
| [ |
| CancerDR |
Anticancer drugs and their effectiveness against cancer cell lines; Drug target gene information like function, structure, and gene sequences in respective cancer cell lines. |
| [ |
| KEGG DRUG |
Chemical structures and/or chemical components; The interaction network with target molecules, metabolizing enzymes, and other drugs; The chemical structure transformation network in the history of drug development. |
| [ |
Figure 5Toxicity prediction with RNA-seq data.