| Literature DB >> 34887473 |
Rafael Mamede1, Florbela Pereira1, João Aires-de-Sousa2.
Abstract
Machine learning (ML) algorithms were explored for the classification of the UV-Vis absorption spectrum of organic molecules based on molecular descriptors and fingerprints generated from 2D chemical structures. Training and test data (~ 75 k molecules and associated UV-Vis data) were assembled from a database with lists of experimental absorption maxima. They were labeled with positive class (related to photoreactive potential) if an absorption maximum is reported in the range between 290 and 700 nm (UV/Vis) with molar extinction coefficient (MEC) above 1000 Lmol-1 cm-1, and as negative if no such a peak is in the list. Random forests were selected among several algorithms. The models were validated with two external test sets comprising 998 organic molecules, obtaining a global accuracy up to 0.89, sensitivity of 0.90 and specificity of 0.88. The ML output (UV-Vis spectrum class) was explored as a predictor of the 3T3 NRU phototoxicity in vitro assay for a set of 43 molecules. Comparable results were observed with the classification directly based on experimental UV-Vis data in the same format.Entities:
Year: 2021 PMID: 34887473 PMCID: PMC8660842 DOI: 10.1038/s41598-021-03070-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Distribution of UV–Vis absorption features in the data sets (MEC values in Lmol−1 cm−1).
| Training set | Test set I | Test set II | |
|---|---|---|---|
| 1000 ≤ MEC ≤ 5000 | 21.3 | 22.4 | 19.5 |
| 5000 ≤ MEC < 10,000 | 24.0 | 23.3 | 25 |
| MEC ≥ 10,000 | 54.7 | 54.3 | 55.5 |
| λ < 290 nm, MEC < 1000 | 10.4 | 10.7 | 10.5 |
| λ < 290 nm, MEC ≥ 1000 | 91.1 | 88.9 | 91.6 |
| λ > 700 nm, MEC < 1000 | 0.005 | 0 | 0 |
| λ > 700 nm, MEC ≥ 1000 | 0.07 | 0.20 | 0 |
| 290 ≤ λ ≤ 700 nm, MEC ≤ 900 | 6.5 | 8.0 | 5.6 |
| 290 ≤ λ ≤ 700 nm, MEC > 900 | 0.23 | 0.4 | 0.21 |
aStatistics concerning the peak with the highest MEC within the 290–700 nm window; b statistics concerning any listed peak.
Hyper-parameter settings of the best dMLP model.
| Hyper-parameter | Setting |
|---|---|
| Initializer | Random normal |
| Number of hidden layers | 4 |
| Number of neurons in the 1st and 2nd layers | 250 |
| Number of neurons in the 3rd layer | 8 |
| Number of neurons in the 4th layer | 4 |
| Number of neurons in the 5th layer | 1 |
| Activation 1st-4th layers | Relu |
| Activation 5th layer | Sigmoid |
| Batch size | 36 |
| Optimizer | Adam |
| Loss | Binary crossentropy |
| Epochs | 100 |
| Learning rate | 0.001 |
| Decay | 10–6 |
Evaluation of different molecular descriptors and fingerprints for the prediction of UV–Vis spectrum class using the RF algorithm.
| Descriptors | Q Tra | Qb | SPc | SEd | MCCe | TPf | TNg | FPh | FNi |
|---|---|---|---|---|---|---|---|---|---|
| RDKitMorganFP | 0.88 | 0.88 | 0.88 | 0.88 | 0.76 | 441 | 435 | 62 | 60 |
| ExtCDK | 0.89 | 0.87 | 0.86 | 0.9 | 0.75 | 448 | 425 | 72 | 53 |
| CDK | 0.89 | 0.87 | 0.85 | 0.9 | 0.74 | 449 | 421 | 76 | 52 |
| MACCS | 0.85 | 0.87 | 0.85 | 0.89 | 0.74 | 444 | 422 | 75 | 57 |
| Md | 0.87 | 0.87 | 0.85 | 0.89 | 0.74 | 448 | 422 | 75 | 53 |
| PubChem | 0.88 | 0.86 | 0.85 | 0.88 | 0.73 | 443 | 420 | 77 | 58 |
| RDKitFP | 0.87 | 0.87 | 0.85 | 0.88 | 0.73 | 442 | 425 | 73 | 58 |
| 1D&2D | 0.86 | 0.85 | 0.83 | 0.87 | 0.7 | 434 | 415 | 83 | 66 |
| SubC | 0.85 | 0.84 | 0.81 | 0.86 | 0.67 | 430 | 404 | 93 | 71 |
| Sub | 0.8 | 0.8 | 0.77 | 0.83 | 0.61 | 420 | 382 | 115 | 81 |
| MLQD | 0.77 | 0.75 | 0.72 | 0.79 | 0.51 | 391 | 362 | 138 | 107 |
aOverall predictive accuracy for the training set in OOB estimation. bOverall predictive accuracy (test set I). cSpecificity (test set I). dSensitivity (test set I). eMCC, Matthews correlation coefficient (test set I). fTrue Positives (test set I). gTrue Negatives (test set I). hFalse Positives (test set I). iFalse Negatives (test set I).
Evaluation of the performance of combined descriptors for the prediction of UV–Vis spectrum class using the RF algorithm.
| Descriptors | Q Tra | Qb | SPc | SEd | MCCe | TPf | TNg | FPh | FNi |
|---|---|---|---|---|---|---|---|---|---|
| ExtCDK + MLQD | 0.89 | 0.88 | 0.86 | 0.90 | 0.76 | 452 | 428 | 69 | 49 |
| ExtCDK + Md | 0.89 | 0.88 | 0.86 | 0.89 | 0.75 | 446 | 429 | 68 | 53 |
| RDKitMorganFP + Md | 0.88 | 0.87 | 0.85 | 0.88 | 0.73 | 441 | 424 | 73 | 60 |
| RDKitMorganFP + MLQD | 0.87 | 0.86 | 0.84 | 0.88 | 0.73 | 443 | 418 | 79 | 58 |
| 1D&2D + MLQD | 0.87 | 0.86 | 0.83 | 0.88 | 0.72 | 442 | 414 | 83 | 59 |
| Md + MLQD | 0.87 | 0.86 | 0.84 | 0.87 | 0.71 | 436 | 418 | 79 | 65 |
| ExtCDK + 1D&2D | 0.87 | 0.86 | 0.84 | 0.87 | 0.71 | 438 | 416 | 81 | 63 |
| Md + 1D&2D | 0.87 | 0.86 | 0.85 | 0.86 | 0.71 | 432 | 422 | 75 | 69 |
| RDKitMorganFP + 1D&2D | 0.86 | 0.85 | 0.83 | 0.86 | 0.70 | 434 | 413 | 84 | 68 |
aOverall predictive accuracy for the training set in OOB estimation. bOverall predictive accuracy (test set I). cSpecificity (test set I). dSensitivity (test set I). eMCC, Matthews correlation coefficient (test set I). fTrue Positives (test set I). gTrue Negatives (test set I). hFalse Positives (test set I). iFalse Negatives (test set I).
Evaluation of alternative ML algorithms for the prediction of UV–Vis absorption spectrum class using 250 selected RDKitMorganFP molecular attributes.
| Model | Q Tra | Qb | SPc | SEd | MCCe | TPf | TNg | FPh | FNi |
|---|---|---|---|---|---|---|---|---|---|
| RF | 0.88 | 0.87 | 0.87 | 0.88 | 0.75 | 440 | 431 | 66 | 61 |
| dMLP | 0.94 | 0.82 | 0.82 | 0.82 | 0.65 | 412 | 409 | 88 | 89 |
| SVM | 0.87 | 0.84 | 0.82 | 0.85 | 0.68 | 428 | 410 | 87 | 75 |
aOverall predictive accuracy for the training set in OOB estimation. bOverall predictive accuracy (test set I). cSpecificity (test set I). dSensitivity (test set I). eMCC, Matthews correlation coefficient (test set I). fTrue Positives (test set I). gTrue Negatives (test set I). hFalse Positives (test set I). iFalse Negatives (test set I).
Figure 1Receiver operating characteristic curve (ROC) obtained for the test set I with the RF model trained with 250 RDKitMorganFP attributes.
Evaluation of RF models trained with circular fingerprints to predict the UV–Vis absorption spectrum of organic molecules in test set II.
| Model | Qa | SPb | SEc | MCCd | TPe | TNf | FPg | FNh |
|---|---|---|---|---|---|---|---|---|
| RDKitMorganFP | 0.89 | 0.88 | 0.90 | 0.78 | 454 | 432 | 60 | 52 |
| ExtCDK | 0.89 | 0.87 | 0.91 | 0.78 | 460 | 428 | 64 | 46 |
| CDK | 0.88 | 0.86 | 0.90 | 0.77 | 458 | 423 | 68 | 49 |
aOverall predictive accuracy. bSpecificity. cSensitivity. dMCC, Matthews correlation coefficient. eTrue Positives. fTrue Negatives. gFalse Positives. hFalse Negatives.
Evaluation of classification trees based on interpretable fingerprints and molecular descriptors for the classification of UV–Vis absorption spectra of organic molecules in test set I.
| Model | Qa | SPb | SEc | MCCd | TPe | TNf | FPg | FNh |
|---|---|---|---|---|---|---|---|---|
| 1D&2D | 0.73 | 0.71 | 0.76 | 0.47 | 380 | 353 | 146 | 119 |
| MLQD | 0.72 | 0.65 | 0.80 | 0.45 | 398 | 325 | 174 | 101 |
| MACCS | 0.71 | 0.67 | 0.75 | 0.42 | 373 | 335 | 164 | 126 |
| PubChem | 0.71 | 0.67 | 0.76 | 0.42 | 377 | 332 | 167 | 122 |
| Sub | 0.69 | 0.68 | 0.70 | 0.38 | 349 | 339 | 160 | 150 |
aOverall predictive accuracy for the training set in OOB estimation. bOverall predictive accuracy (test set I). cSpecificity (test set I). dSensitivity (test set I). eMCC, Matthews correlation coefficient (test set I). fTrue Positives (test set I). gTrue Negatives (test set I). hFalse Positives (test set I). iFalse Negatives (test set I).
Figure 2Classification tree based on PubChem fingerprints for the discrimination of molecules of the POS/NEG classes related to potential photoreactivity. A665, C–C=C–C=C; A672, O=C–C=C–C; A601, N–C:C:C–C; A336, C(~C)(~C)(~C)(~H); A383, C(~O)(:C)(:C); A438, C(–C)(–N)(=C).
Figure 3The chemical structures of four FN (1–4) predicted with high probability and their most similar training set counterpart structures (5–8). Experimental data is included as retrieved from the database.
Confusion matrices relating RF-predicted and experimental UV–Vis spectrum class with the 3T3 NRU phototoxicity in vitro assay.
| Predicted UV–Vis spectrum class | Experimental UV–Vis spectrum class | |||
|---|---|---|---|---|
| POS | NEG | POS | NEG | |
| Toxic | 16 | 3 | 15 | 4 |
| Non-toxic | 15 | 9 | 12 | 12 |