| Literature DB >> 34056263 |
Daniel Fernández-Llaneza1, Silas Ulander1, Dea Gogishvili1, Eva Nittinger1, Hongtao Zhao1, Christian Tyrchan1.
Abstract
Activity prediction plays an essential role in drug discovery by directing search of drug candidates in the relevant chemical space. Despite being applied successfully to image recognition and semantic similarity, the Siamese neural network has rarely been explored in drug discovery where modelling faces challenges such as insufficient data and class imbalance. Here, we present a Siamese recurrent neural network model (SiameseCHEM) based on bidirectional long short-term memory architecture with a self-attention mechanism, which can automatically learn discriminative features from the SMILES representations of small molecules. Subsequently, it is used to categorize bioactivity of small molecules via N-shot learning. Trained on random SMILES strings, it proves robust across five different datasets for the task of binary or categorical classification of bioactivity. Benchmarking against two baseline machine learning models which use the chemistry-rich ECFP fingerprints as the input, the deep learning model outperforms on three datasets and achieves comparable performance on the other two. The failure of both baseline methods on SMILES strings highlights that the deep learning model may learn task-specific chemistry features encoded in SMILES strings.Entities:
Year: 2021 PMID: 34056263 PMCID: PMC8153912 DOI: 10.1021/acsomega.1c01266
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1Siamese Recurrent Neural Network architecture.
Chemical Similarity of the Paired Compounds from the Five Datasets
| datasets | training pairs | validation pairs | |||
|---|---|---|---|---|---|
| BACE1 | 20,450 (2490) | 14,788 | 2954 | 0.14 (0.13) | 0.14 (0.13) |
| CCR5 | 4998 (506) | 3104 | 616 | 0.20 (0.13) | 0.22 (0.17) |
| DRD2 | 106,341 (43,493) | 98,282 | 19,570 | 0.16 (0.14) | 0.16 (0.14) |
| EGFR | 11,364 (2134) | 5932 | 1184 | 0.21 (0.20) | 0.21 (0.20) |
| NR1H2 | 2712 (821) | 1956 | 382 | 0.19 (0.16) | 0.20 (0.16) |
Number of compounds and clusters (in brackets).
Mean Tanimoto coefficient of paired compounds in the subset having similar or dissimilar (in brackets) bioactivity.
Figure 2Distribution of Tanimoto coefficients of paired compounds from the dataset NR1H2. (Top) the training set and (bottom) the validation set.
Figure 3Performance of the deep model SiameseCHEM on the validation set with data augmentation. The coefficients are the mean value from the last five epochs, and the error bar depicts the standard deviation.
Performance on the Validation Set with a Threshold of 5 or 7 in pXC50a
| dataset | threshold | MCC | recall | precision |
|---|---|---|---|---|
| BACE1 | 5 | 0.77 | 0.92 | 0.84 |
| 7 | 0.55 | 0.61 | 0.87 | |
| CCR5 | 5 | 0.83 | 0.93 | 0.91 |
| 7 | 0.61 | 0.83 | 0.75 | |
| DRD2 | 5 | 0.78 | 0.94 | 0.84 |
| 7 | 0.56 | 0.90 | 0.59 | |
| EGFR | 5 | 0.67 | 0.82 | 0.86 |
| 7 | 0.38 | 0.48 | 0.79 | |
| NR1H2 | 5 | 0.74 | 0.89 | 0.84 |
| 7 | 0.43 | 0.79 | 0.54 |
The numbers are the mean values from the last five epochs.
Performance Comparison with Baseline Methods on the Validation Seta
| dataset | threshold | SiameseCHEM | MLP (ECFP6) | RF (SMILES) | RF (ECFP6) | SVM (SMILES) | SVM (ECFP6) |
|---|---|---|---|---|---|---|---|
| BACE1 | 5 | 0.77 | 0.73 | 0.20 | 0.60 | –0.13 | 0.09 |
| 7 | 0.55 | 0.44 | –0.10 | 0.11 | –0..9 | –0.06 | |
| CCR5 | 5 | 0.83 | 0.83 | 0.51 | 0.80 | –0.13 | 0.49 |
| 7 | 0.61 | 0.52 | 0.07 | 0.34 | –0.03 | –0.01 | |
| DRD2 | 5 | 0.78 | 0.23 | –0.10 | –0.13 | –0.04 | –0.03 |
| 7 | 0.56 | –0.01 | –0.12 | –0.08 | –0.02 | –0.11 | |
| EGFR | 5 | 0.67 | 0.71 | 0.41 | 0.67 | 0.25 | 0.52 |
| 7 | 0.38 | 0.022 | –0.06 | 0.06 | –0.04 | –0.06 | |
| NR1H2 | 5 | 0.74 | 0.60 | 0.14 | 0.55 | –0.15 | –0.05 |
| 7 | 0.43 | 0.04 | 0.0 | –0.10 | 0.0 | 0.0 |
MCCs averaged from the 10-fold stratified cross-validation.
Figure 4Performance of binary classification via N-shot learning with regard to the number of reference compounds in the support set (N). The error bars indicate the 95% CI bounds, evaluated by the 10 repeated predictions each with a different support set.
Performance of the Categorical Classification via N-Shot Learning
| dataset | κ | τ | |
|---|---|---|---|
| BACE1 | 32 | 0.69 | 0.74 |
| CCR5 | 16 | 0.68 | 0.73 |
| DRD2 | 32 | 0.79 | 0.83 |
| EGFR | 32 | 0.56 | 0.61 |
| NR1H2 | 8 | 0.57 | 0.65 |
The number of reference compounds in the support set.
Cohen’s weighted kappa (κ) measures the degree of absolute agreement between the ground truth and predictions, with the value ranging from −1 to 1. It treats all misclassifications equally.
Kendall’s correlation coefficient (τ) measures the ordinal association between the ground truth and predictions, with the value ranging from −1 to 1. It penalizes ordinal misclassification more heavily than the kappa statistics.
Nonadditivity Analysis of the Five Datasets
| nonadditivity metrics | BACE1 | CCR5 | DRD2 | EGFR | NR1H2 |
|---|---|---|---|---|---|
| estimated uncertainty | 0.55 | 0.28 | 0.10 | 0.48 | 0.58 |
| % Cpds outside 95% CI | 6.71 | 6.66 | 0.17 | 4.81 | 2.77 |
| mispredictions | 493 | 102 | 441 | 265 | 82 |
| mispredictions with an outlier | 23 | 2 | 0 | 10 | 1 |
Number of total pairs whose similarity labels were wrongly predicted by the deep model SiameseCHEM.
Number of wrongly predicted pairs having at least one compound outside the 95% CI from the nonadditivity analysis.
Figure 5Additivity shift per compound for the DRD2 dataset for illustration of the results summarized in Table . Shown is the average additivity shift per compound and the standard deviation of the shift. Black lines indicate the 95% CI for a perfectly additive dataset with an experimental uncertainty of σ = 0.1 log unit.