| Literature DB >> 34885843 |
Yaqin Li1, Yongjin Xu2, Yi Yu2.
Abstract
Molecular latent representations, derived from autoencoders (AEs), have been widely used for drug or material discovery over the past couple of years. In particular, a variety of machine learning methods based on latent representations have shown excellent performance on quantitative structure-activity relationship (QSAR) modeling. However, the sequence feature of them has not been considered in most cases. In addition, data scarcity is still the main obstacle for deep learning strategies, especially for bioactivity datasets. In this study, we propose the convolutional recurrent neural network and transfer learning (CRNNTL) method inspired by the applications of polyphonic sound detection and electrocardiogram classification. Our model takes advantage of both convolutional and recurrent neural networks for feature extraction, as well as the data augmentation method. According to QSAR modeling on 27 datasets, CRNNTL can outperform or compete with state-of-art methods in both drug and material properties. In addition, the performances on one isomers-based dataset indicate that its excellent performance results from the improved ability in global feature extraction when the ability of the local one is maintained. Then, the transfer learning results show that CRNNTL can overcome data scarcity when choosing relative source datasets. Finally, the high versatility of our model is shown by using different latent representations as inputs from other types of AEs.Entities:
Keywords: CNN; DEEP learning; QSAR; RNN; molecular autoencoders; transfer learning
Mesh:
Substances:
Year: 2021 PMID: 34885843 PMCID: PMC8658888 DOI: 10.3390/molecules26237257
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1The architecture of CRNN and transfer learning method between large and small datasets.
Overview of the optimization settings.
| Settings | CNN | GRU |
|---|---|---|
| activation function | (anh, ReLU) | (Sigmoid, ReLU) |
| learning rate | (0.001, 0.0005, 0.0001) | (0.001, 0.0005, 0.0001) |
| number of layers | (3–5) | (1,2) |
Coefficient of determination (r2) for regression datasets of drug properties.
| Dataset a | CNN | CRNN | AugCRNN | SVM | RF b |
|---|---|---|---|---|---|
| EGFR | 0.67 | 0.70 |
| 0.70 | 0.69 |
| EAR3 | 0.64 | 0.68 |
| 0.65 | 0.53 |
| AUR3 | 0.55 | 0.57 |
| 0.60 | 0.54 |
| FGFR1 | 0.63 | 0.68 |
| 0.71 | 0.68 |
| MTOR | 0.64 | 0.68 |
|
| 0.66 |
| PI3 | 0.43 | 0.47 | 0.50 |
| 0.45 |
| LogS | 0.91 | 0.92 |
| 0.92 | 0.90 |
| Lipo | 0.63 | 0.67 | 0.70 |
| 0.66 |
| BP | 0.95 | 0.96 |
| 0.96 | 0.93 |
| MP | 0.47 | 0.46 |
| 0.46 | 0.45 |
The standard mean errors are shown in Supplementary Table S2. Bold texts represent the best performance. a The information in detail for each dataset is summarized in Materials and Methods. b Calculated with ECFP representation.
The area under the receiver characteristic curve (ROC-AUC) for classification datasets of drug properties.
| Dataset a | CNN | CRNN | AugCRNN | SVM | RF b |
|---|---|---|---|---|---|
| HIV | 0.80 | 0.82 |
| 0.76 | 0.78 |
| AMES | 0.86 | 0.87 | 0.88 |
|
|
| BACE | 0.88 | 0.89 | 0.90 | 0.90 |
|
| HERG | 0.83 | 0.84 |
|
| 0.85 |
| BBBP | 0.88 | 0.89 | 0.91 |
| 0.89 |
| BEETOX | 0.89 | 0.91 |
|
| 0.90 |
| JAK3 | 0.72 | 0.74 |
| 0.76 | 0.76 |
| BioDeg | 0.75 | 0.77 |
| 0.74 | 0.73 |
| TOX21 | 0.75 | 0.77 |
| 0.74 | 0.73 |
| SIDER | 0.68 | 0.70 |
| 0.70 | 0.68 |
The standard mean errors are shown in Supplementary Table S3. Bold texts represent the best performance. a The information in detail for each dataset is summarized in Materials and Methods. b Calculated with ECFP representation.
Coefficient of determination (r2) for regression datasets of material properties.
| Dataset a | CNN | CRNN | AugCRNN | SVM | RF b |
|---|---|---|---|---|---|
| Absmax | 0.75 | 0.87 |
| 0.89 | 0.88 |
| Emmax | 0.72 | 0.82 |
|
| 0.81 |
| Logε | 0.52 | 0.73 |
| 0.73 | 0.73 |
| σabs | 0.44 | 0.56 |
| 0.52 | 0.55 |
| lifetime | 0.45 | 0.59 |
| 0.59 | 0.58 |
| ET1 | 0.49 | 0.48 |
| 0.48 |
|
| PCE | 0.42 | 0.43 |
| 0.42 | 0.43 |
The standard mean errors are shown in Supplementary Table S4. Bold texts represent the best performance. a The first five datasets represent the absorption peak position (Absmax), emission peak position (Emmax), extinction coefficient in logarithm (logε), bandwidth in full width at half maximum (σabs), and the molecular lifetime, respectively; the sixth one is the triplet state energy (ET1) for the TADF molecules; the last one is the power conversion efficiency (PCE) from HOPV15 b Calculated with ECFP representation.
Names, SMILES, molecular structures, and melting points of isomers.
| Name | SMILES | Molecular Structure | Melting Point (°C) |
|---|---|---|---|
| 2-Hydroxypropanamide | CC(O)C(N)=O |
| 78 |
| Alanine | CC(N)C(=O)O |
| 292 a |
| 1,3-Dimethoxypropane | COCCCOC |
| −82 |
| 1,5-Pentanediol | OCCCCCO |
| −16 |
| Methyl benzoate | COC(=O)c1ccccc1 |
| −12 |
| Phenylacetic acid | O=C(O)Cc1ccccc1 |
| 77 |
a Alanine decomposes before melting, the value here is the temperature at which it decomposes.
Figure 2Transfer learning results for PI3 and AUR3 as target datasets and FGFR1, MOTR, and EGFR as source datasets, learning from scratch as the baseline to evaluate the improvement.
Binding site similarities between different targets. SMAP p-values represent the similarities and the lower the SMAP p-value, the more similarity between different targets.
| Targets | PI3 | AUR3 |
|---|---|---|
| FGFR1 | 1.3 × 10−4 | 2.1 × 10−5 |
| MTOR | 7.8 × 10−6 | 9.4 × 10−3 |
| EGFR | 5.2 × 10−3 | 8.6 × 10−3 |
Overview of the datasets of drug properties (left for regression and right for classification).
| Acronym | Description | Size | Acronym | Description | Size |
|---|---|---|---|---|---|
| EGFR | Epidermal growth factor inhibition [ | 4113 | HIV | Inhibition of HIV replication [ | 41101 |
| EAR3 | Ephrin type-A receptor 3 [ | 587 | AMES | Mutagenicity [ | 6130 |
| AUR3 | Aurora kinase C [ | 1001 | BACE | Human β-secretase 1 inhibitors [ | 1483 |
| FGFR1 | Fibroblast growth factor receptor [ | 4177 | HERG | HERG inhibition [ | 3440 |
| MTOR | Rapamycin target protein [ | 6995 | BBBP | Blood–brain barrier penetration [ | 1879 |
| PI3 | PI3-kinase p110-gamma [ | 2995 | BEETOX | Toxicity in honeybees [ | 188 |
| LogS | Aqueous solubility [ | 1144 | JAK3 | Janus kinase 3 inhibitor [ | 868 |
| Lipo | Lipophilicity [ | 3817 | BioDeg | Biodegradability [ | 1698 |
| BP | Boiling point [ | 12451 | TOX21 | In-vitro toxicity [ | 7785 |
| MP | Melting point [ | 283 | SIDER | Side Effect Resource [ | 1412 |
Overview of the datasets of material properties.
| Acronym | Description | Size |
|---|---|---|
| Absmax | absorption peak position [ | 6433 |
| Emmax | emission peak position [ | 6412 |
| Logε | extinction coefficient in logarithm [ | 3848 |
| σabs | bandwidth in full width at half maximum [ | 1606 |
| lifetime | molecular lifetime [ | 2755 |
| ET1 | triplet state energy [ | 60 |
| PCE | power conversion efficiency [ | 249 |