| Literature DB >> 30598075 |
Maya Hirohara1, Yutaka Saito2,3, Yuki Koda1, Kengo Sato1, Yasubumi Sakakibara4.
Abstract
BACKGROUND: Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features.Entities:
Keywords: Chemical compound; Convolutional neural network; SMILES; TOX 21 Challenge; Virtual screening
Mesh:
Substances:
Year: 2018 PMID: 30598075 PMCID: PMC6311897 DOI: 10.1186/s12859-018-2523-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Chemical motif detection by CNN in comparison with sequence motif detection. a One-hot coding representation of four DNA nucleotides, a filter (kernel) with a one-dimensional convolution operation that is considered a position weight matrix for representing a motif. b The same strategy for applying one-dimensional CNN to SMILES linear representations of chemical compounds and the extraction of learned filters to discover the chemical motifs
Fig. 2Overview of our CNN. The SMILES string of a compound is represented as a feature matrix. CNN has multiple layers consisting of two convolutional and pooling layers with a subsequent global pooling layer. CNN is applied to the feature matrix and produces a low-dimensional feature representation (actually, 64-dimensional vector) termed the SCFP. Classification models are constructed by using SCFP as input for subsequent fully connected layers
Features
| Feature | Description | Size |
|---|---|---|
| Atom | 21 | |
| Atom type | H, C, O, N, or others | 5 |
| NumHs | Total number of H atoms attached to it | 1 |
| Degree | Its degree of unsaturation | 1 |
| Charge | Its formal charge | 1 |
| Valence | Its total valence | 1 |
| Ring | Whether it is included in a ring | 1 |
| Aromaticity | Whether it is included in an aromatic structure | 1 |
| Chirality | R, S, or others | 3 |
| Hybridization | 7 | |
| SMILES original symbol | 21 | |
| ( | Branch start | 1 |
| ) | Branch end | 1 |
| [ | Atom or atom group start | 1 |
| ] | Atom or atom group end | 1 |
| . | Ionic bond | 1 |
| : | Aromatic bond | 1 |
| = | Double bond | 1 |
| # | Triple bond | 1 |
| \ | cis | 1 |
| / | trans | 1 |
| @ | Chirality (above or below) | 1 |
| + | Cation (positive ion) | 1 |
| - | Anion (negative ion) | 1 |
| Ion charge | Numbers show ionic charge (2-7) | 6 |
| Start | Numbers show ring start | 1 |
| End | Numbers show ring end | 1 |
Model hyperparameters
| Hyperparameter | Considered values |
|---|---|
| 1st convolution | |
| No. of filters | [1,1024] |
| Window size | [1,51] |
| Stride size | {1,3,5} |
| Padding | {None, Half of window size} |
| 1st pooling | |
| Type | {Max, Average} |
| Window size | [1,51] |
| Stride size | {1,3,5} |
| Padding | {None, Half of window size} |
| 2nd convolution | |
| No. of filters | [1,1024] |
| Window size | [1,51] |
| Stride size | {1,3,5} |
| Padding | {None, Half of window size} |
| 2nd pooling | |
| Type | {Max, Average} |
| Window size | [1,51] |
| Stride size | {1,3,5} |
| Padding | {None, Half of window size} |
| Global pooling | {None, Max pooling} |
| Output layer | {softmax, sigmoid} |
| Activation function | {ReLU, Leaky ReLU, Parametric ReLU} |
| Minibatch size | {32, 64, 128, 256, 512} |
| Batch normalization | {None, after conv.} |
| Dropout | {None, before output} |
| Optimizer | {Adam, AdaGrad} |
| Learning rate | {0.0001, 0.001, 0.01, 0.1} |
| Loss function | {Mean squared error, Cross entropy} |
Fig. 3Detection of chemical motifs. Each dimension of SCFP is associated with the substructure of an input compound by tracing back through the CNN
TOX 21 assays
| Subdataset | qHTS assay target |
|---|---|
| NR-AR | Androgen receptor using the MDA cell line |
| NR-AR-LBD | Androgen receptor ligand binding domain |
| NR-ER | Estrogen receptor |
| NR-ER-LBD | Estrogen receptor |
| NR-AhR | Aryl hydrocarbon receptor |
| NR-Aromatase | Aromatase enzyme |
| NR-PPAR- | Peroxisome proliferator-activated receptor |
| SR-ARE | Antioxidant response element |
| SR-ATAD5 | Luciferase-tagged ATAD5 in human embryonic kidney cells |
| SR-HSE | Heat shock response |
| SR-MMP | Mitochondrial membrane potential |
| SR-p53 | p53 response |
TOX 21 dataset
| Subdataset | Train | Test | Score | |||
|---|---|---|---|---|---|---|
| Active | Inactive | Active | Inactive | Active | Inactive | |
| NR-AR | 380 | 8982 | 3 | 289 | 12 | 574 |
| NR-AR-LBD | 303 | 8296 | 4 | 249 | 8 | 574 |
| NR-ER | 937 | 6760 | 27 | 238 | 51 | 465 |
| NR-ER-LBD | 446 | 8307 | 10 | 277 | 20 | 580 |
| NR-AhR | 950 | 7219 | 31 | 241 | 73 | 537 |
| NR-Aromatase | 360 | 6866 | 18 | 196 | 39 | 489 |
| NR-PPAR- | 222 | 7962 | 15 | 252 | 31 | 574 |
| SR-ARE | 1098 | 6069 | 48 | 186 | 93 | 462 |
| SR-ATAD5 | 338 | 8753 | 25 | 247 | 38 | 584 |
| SR-HSE | 248 | 7722 | 10 | 257 | 22 | 588 |
| SR-MMP | 1142 | 6178 | 38 | 200 | 60 | 483 |
| SR-p53 | 537 | 8097 | 28 | 241 | 41 | 575 |
Summary of training statistics
| Subdataset | Time (s/epoch) | Memory (MiB) | Convergence (epoch) |
|---|---|---|---|
| NR-AR | 121.7 | 6551 | 15 |
| NR-AR-LBD | 12.9 | 6459 | 19 |
| NR-ER | 36.0 | 2763 | 17 |
| NR-ER-LBD | 37.7 | 2309 | 25 |
| NR-AhR | 13.3 | 1475 | 33 |
| NR-Aromatase | 15.2 | 6317 | 20 |
| NR-PPAR- | 2.7 | 4413 | 23 |
| SR-ARE | 16.7 | 1615 | 18 |
| SR-ATAD5 | 74.7 | 4581 | 21 |
| SR-HSE | 49.0 | 3047 | 15 |
| SR-MMP | 40.3 | 3427 | 9 |
| SR-p53 | 8.3 | 1211 | 11 |
The computation time is measured with a GPU server with NVIDIA Tesla P100 SXM2 16GB
Fig. 4ROC-AUC of our model compared with those reported by previous studies. (Left) ROC-AUC averaged for 12 subdatasets were compared between our model (blue) and previous studies (gray). (Right) ROC-AUC of our model for each subdataset
Comparison of our CNN and DeepTox (the winning model of the TOX 21 Challenge 2014)
| Input | Model | Ave. | AR | AR-LBD | ER | ER-LBD | AhR | Aromatase | PPAR- | ARE | ATAD5 | HSE | MMP | p53 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SMILES Matrix | CNN | 0.813 | 0.789 | 0.793 | 0.776 | 0.765 | 0.905 | 0.786 | 0.791 | 0.754 | 0.803 | 0.835 | 0.928 | 0.832 |
| ECFP | DNN | 0.768 | 0.850 | 0.690 | 0.840 | 0.760 | 0.660 | 0.720 | 0.700 | 0.730 | 0.860 | 0.810 | 0.820 | 0.780 |
| ECFP+DeepTox | DNN | 0.837 | 0.778 | 0.825 | 0.791 | 0.811 | 0.923 | 0.804 | 0.856 | 0.829 | 0.775 | 0.863 | 0.930 | 0.860 |
| ECFP+DeepTox | SVM | 0.832 | 0.882 | 0.748 | 0.799 | 0.798 | 0.919 | 0.819 | 0.856 | 0.818 | 0.781 | 0.848 | 0.946 | 0.854 |
| ECFP+DeepTox | RF | 0.820 | 0.776 | 0.812 | 0.770 | 0.746 | 0.917 | 0.806 | 0.827 | 0.810 | 0.786 | 0.826 | 0.945 | 0.835 |
| ECFP+DeepTox | ElNet | 0.803 | 0.788 | 0.692 | 0.765 | 0.805 | 0.897 | 0.763 | 0.805 | 0.778 | 0.768 | 0.844 | 0.924 | 0.818 |
Our CNN takes SMILES feature matrices as input, while DeepTox uses ECFP and its original features
Fig. 5Chemical space analysis of the SR-MMP subdataset. SCFP (a) and ECFP (b) computed for all compounds in the dataset were plotted by MDS
Fig. 6Examples of learned filters and chemical motifs for the NR-AR subdataset. a Filter 61 and corresponding chemical motifs on different compounds. b Filter 0 and corresponding chemical motifs on different compounds. c Filter 2 and corresponding chemical motifs on different compounds
Fig. 7Filters representing similar chemical motifs. Each filter represents a similar but slightly different chemical motif