| Literature DB >> 33372637 |
Gabriel Idakwo1, Sundar Thangapandian2, Joseph Luttrell1, Yan Li3, Nan Wang4, Zhaoxian Zhou1, Huixiao Hong5, Bei Yang6, Chaoyang Zhang7, Ping Gong8.
Abstract
The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure-Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman's aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.Entities:
Keywords: Bootstrap aggregation (bagging); Chemical classification; Class distribution imbalance; Edited nearest neighbor (ENN); Ensemble learning; Molecular fingerprints; Random forest (RF); Random undersampling (RUS); Resampling; Structure–activity relationship (SAR); Synthetic minority over-sampling technique (SMOTE)
Year: 2020 PMID: 33372637 PMCID: PMC7592558 DOI: 10.1186/s13321-020-00468-x
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Workflow of structure–activity relationship (SAR)-based chemical classification with imbalanced data processing designed for this study
Class distribution and imbalance ratio (IR) of the preprocessed training and test chemical datasets from Tox21 Data Challenge
| In vitro qHTS assay ID | Total number of chemicals | Training set | Test set | ||||
|---|---|---|---|---|---|---|---|
| Inactive | Active | IR | Inactive | Active | IR | ||
| NR-AR | 6436 | 5698 | 166 | 34.3 | 560 | 12 | 46.7 |
| NR-AR-LBD | 5931 | 5223 | 143 | 36.5 | 557 | 8 | |
| NR-AhR | 5596 | 4445 | 561 | 7.9 | 520 | 70 | 7.4 |
| NR-Aromatase | 4901 | 4193 | 193 | 21.7 | 478 | 37 | 12.9 |
| NR-ER | 5171 | 4167 | 500 | 8.3 | 455 | 49 | 9.3 |
| NR-ER-LBD | 6043 | 5239 | 221 | 23.7 | 563 | 20 | 28.2 |
| NR-PPAR-γ | 5712 | 5005 | 120 | 558 | 29 | 19.2 | |
| SR-ARE | 4808 | 3669 | 603 | 6.1 | 448 | 88 | |
| SR-ATAD5 | 6320 | 5515 | 203 | 27.2 | 568 | 34 | 16.7 |
| SR-HSE | 5529 | 4733 | 206 | 23.0 | 573 | 17 | 33.7 |
| SR-MMP | 4955 | 3763 | 666 | 472 | 54 | 8.7 | |
| SR-p53 | 6009 | 5110 | 303 | 16.9 | 558 | 38 | 14.7 |
The highest and lowest IRs for the training and test sets are in bold
Fig. 2A spot check of six popular machine learning algorithms: performance of classifiers trained using the preprocessed Tox21 training datasets as evaluated using F1 score
Fig. 3The relationship between model performance and the number of classifiers in the RF base classifier
Fig. 4Performance metrics of SMN models measured as the number of nearest neighbors (k) varied in the ENN
Nine metrics for evaluating the performance of four classification methods (RF, RUS, SMO and SMN) with twelve Tox21 qHTS assay datasets
| Metrics | Classifier | NR-AR | NR-AR-LBD | NR-AhR | NR-Aromatase | NR-ER | NR-ER-LBD | NR-PPAR-γ | SR-ARE | SR-ATAD5 | SR-HSE | SR-MMP | SR-p53 | Mean | CVa (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 score | RF | 0.1538 | 0.0000 | 0.4340 | 0.2326 | 0.2727 | 0.2400 | 0.0606 | 0.3359 | 0.2500 | 0.5106 | 0.1364 | 0.2397 | 60 | |
| RUS | 0.1176 | 0.4507 | 0.2222 | 0.2605 | 0.1849 | 0.4185 | 0.2063 | 0.1058 | 0.2527 | 0.2815 | 53 | ||||
| SMO | 0.0000 | 0.3883 | 0.1905 | 0.3692 | 0.2857 | 0.1765 | 0.2927 | 0.2439 | 0.1905 | 0.3902 | 0.1395 | 0.2431 | 47 | ||
| SMN | 0.1951 | 0.1111 | 0.3929 | 0.2400 | 0.5850 | ||||||||||
| MCC | RF | − 0.0050 | 0.4101 | 0.3202 | 0.2726 | 0.2891 | 0.0767 | 0.2770 | 0.4701 | 0.1801 | 0.2647 | 49 | |||
| RUS | 0.1056 | 0.4209 | 0.1914 | 0.1816 | 0.1908 | 0.2950 | 0.2049 | 0.1190 | 0.2769 | 0.2568 | 53 | ||||
| SMO | 0.2805 | − 0.0071 | 0.3669 | 0.2792 | 0.3990 | 0.3018 | 0.2355 | 0.2498 | 0.3091 | 0.2327 | 0.3662 | 0.2019 | 0.2679 | ||
| SMN | 0.1886 | 0.0975 | 0.3627 | 0.3261 | 0.2226 | 0.5492 | 42 | ||||||||
| AUROC | RF | 0.7963 | 0.9063 | 0.7356 | 0.7601 | 0.6963 | 0.6640 | 0.7867 | 0.7827 | 0.7610 | 0.9194 | 0.7443 | 0.7813 | 10 | |
| RUS | 0.6785 | 0.8852 | 0.7627 | 0.7174 | 0.7619 | 0.7698 | 0.7791 | 0.7065 | 0.9295 | 0.8168 | 0.7929 | 10 | |||
| SMO | 0.7780 | 0.7509 | 0.8936 | 0.8112 | 0.7296 | 0.8072 | 0.7872 | 0.7714 | 0.7983 | 0.8893 | 0.8510 | 0.8069 | |||
| SMN | 0.6810 | 0.7969 | 0.7713 | 0.8093 | 8 | ||||||||||
| AUPRC | RF | 0.0565 | 0.2825 | 0.3203 | 0.1887 | 0.1120 | 0.4224 | 0.2881 | 0.1608 | 0.1881 | 0.2933 | 57 | |||
| RUS | 0.1444 | 0.4836 | 0.2043 | 0.2420 | 0.1545 | 0.4140 | 0.2423 | 0.0622 | 0.5237 | 0.2295 | 0.2762 | 59 | |||
| SMO | 0.3290 | 0.0821 | 0.5065 | 0.3504 | 0.3895 | 0.2806 | 0.4052 | 0.4928 | 0.2913 | 0.3273 | |||||
| SMN | 0.0685 | 0.0639 | 0.5660 | 0.2018 | 0.3736 | 0.2422 | 0.1134 | 0.5234 | 60 | ||||||
| Balanced accuracy (BA) | RF | 0.5417 | 0.4991 | 0.6518 | 0.5665 | 0.5830 | 0.5732 | 0.5146 | 0.6016 | 0.5726 | 0.5847 | 0.7053 | 0.5368 | 0.5776 | 10 |
| RUS | 0.5929 | 0.8129 | 0.6828 | 0.6513 | 0.6977 | 0.7085 | 11 | ||||||||
| SMO | 0.5815 | 0.4982 | 0.6304 | 0.5530 | 0.6181 | 0.5964 | 0.5499 | 0.5833 | 0.5718 | 0.5571 | 0.6354 | 0.5377 | 0.5761 | ||
| SMN | 0.5544 | 0.6858 | 0.6753 | 0.7018 | 0.6529 | 0.8452 | 0.6812 | 13 | |||||||
| Precision | RF | 0.0000 | 0.5294 | 0.2500 | 0.5116 | 0.4286 | 0.5000 | 48 | |||||||
| RUS | 0.0769 | 0.1250 | 0.2991 | 0.1302 | 0.1604 | 0.1111 | 0.3200 | 0.2869 | 0.1193 | 0.0576 | 0.4583 | 0.1464 | 0.1909 | 64 | |
| SMO | 0.5000 | 0.0000 | 0.6061 | 0.8000 | 0.5000 | 0.5143 | 0.7143 | 0.5714 | 0.5547 | ||||||
| SMN | 0.1379 | 0.1000 | 0.4775 | 0.5294 | 0.5849 | 0.3333 | 0.4074 | 0.2963 | 0.1818 | 0.4624 | 0.4545 | 0.3784 | 44 | ||
| Recall or Sensitivity | RF | 0.0833 | 0.0000 | 0.3286 | 0.1351 | 0.1837 | 0.1500 | 0.0345 | 0.2500 | 0.1471 | 0.1765 | 0.4444 | 0.0789 | 0.1677 | 75 |
| RUS | 0.2500 | 0.7727 | |||||||||||||
| SMO | 0.1667 | 0.0000 | 0.2857 | 0.1081 | 0.2449 | 0.2000 | 0.1034 | 0.2045 | 0.1471 | 0.1176 | 0.2963 | 0.0789 | 0.1628 | 54 | |
| SMN | 0.1250 | 0.7571 | 0.4865 | 0.6327 | 0.4000 | 0.3793 | 0.4706 | 0.3529 | 0.7963 | 0.3947 | 0.4965 | 43 | |||
| Brier score (BS) | RF | 0.5425 | 0.3404 | 0.3997 | 0.3883 | 0.4163 | 0.3961 | 0.3725 | 0.3947 | 0.4257 | 0.3215 | 0.3810 | 0.3967 | 14 | |
| RUS | 0.4461 | 0.3104 | 0.3724 | 0.3793 | 0.4299 | 0.3735 | 0.3829 | 0.4871 | 0.3892 | 0.3936 | 0.3894 | ||||
| SMO | 0.4263 | 0.6739 | 0.3281 | 0.3379 | 0.4205 | 0.4067 | 0.4138 | 0.3881 | 0.3924 | 0.4146 | 0.3467 | 0.3814 | 0.4109 | 22 | |
| SMN | 0.4303 | 0.4156 | 0.3503 | 18 | |||||||||||
| Sensitivity–specificity gap (SSG)b | RF | 0.9167 | 0.9982 | 0.6464 | 0.8628 | 0.7987 | 0.8464 | 0.9601 | 0.7031 | 0.8511 | 0.8165 | 0.5217 | 0.9157 | 0.8198 | 17 |
| RUS | 0.6857 | 0.2028 | 0.1499 | 87 | |||||||||||
| SMO | 0.8297 | 0.9964 | 0.6893 | 0.8898 | 0.7463 | 0.7929 | 0.8930 | 0.7576 | 0.8494 | 0.8789 | 0.6783 | 0.9175 | 0.8266 | ||
| SMN | 0.8588 | 0.4800 | 0.3189 | 0.5716 | 0.5920 | 0.4625 | 0.6000 | 0.0978 | 0.5730 | 0.4465 | 55 | ||||
| Averagec | RF | − 0.0215 | 0.3297 | 0.2048 | 0.1928 | 0.1638 | 0.0396 | 0.2344 | 0.2184 | 0.1535 | 0.3744 | 0.1187 | 0.1854 | 59 | |
| RUS | 0.0927 | 0.4171 | 0.2700 | 0.2714 | 0.2140 | 0.3479 | 0.4728 | 0.2788 | |||||||
| SMO | 0.1811 | − 0.0385 | 0.2956 | 0.2072 | 0.2593 | 0.1953 | 0.1585 | 0.2084 | 0.2105 | 0.1447 | 0.2907 | 0.1557 | 0.1890 | 46 | |
| SMN | 0.1329 | 0.0638 | 0.2689 | 0.2671 | 0.1848 | 0.2966 | 47 |
The metrics were calculated using the test datasets (see Table 1). The best performer among the four classifiers is highlighted in bold for each assay and each evaluation metric. The highest value represents the best performer except for Brier score and sensitivity–specificity gap which are the opposite (i.e., the lower the better). See Additional file 1: Table S1 for the specificity values
aCoefficient of variation (CV) = standard deviation/mean of 12 assays
bSSG = absolute value of (Specificity—Sensitivity)
cAverage (of 9 metrics) = (F1 + MCC + AUROC + AUPRC + BA + Precision + Recall-BS-SSG)/9. The values of BS and SSG are subtracted (instead of added) to the sum because BS and SSG are negatively correlated to model performance
Correlation coefficients (CCs) between log2IR and six performance metrics plus the average of nine metrics in Table 2 for all four classification algorithms
| Metrics | Algorithms | |||
|---|---|---|---|---|
| RF | RUS | SMO | SMN | |
| F1 score | − 0.7217 | − 0.7394 | − 0.6941 | − 0.9817 |
| MCC | − 0.5778 | − 0.6180 | − 0.6419 | − 0.9761 |
| BA | − 0.6539 | − 0.6274 | − 0.6227 | − 0.9461 |
| AUPRC | − 0.7034 | − 0.7148 | − 0.8418 | − 0.9628 |
| AUROC | − | − | − | − 0.7417 |
| SSG | 0.7158 | 0.7072 | 0.7006 | 0.9195 |
| Average | − 0.6536 | − 0.8421 | − 0.7725 | − 0.9822 |
Insignificant CCs are highlighted in bold and are those whose absolute values are smaller than 0.5760, the critical value at α = 0.05 significance level for the degree of freedom df = 10 (i.e., n−2, where n = 12 assays)
Fig. 5The relationship between imbalance ratio (Log2IR) and prediction performance metrics calculated for four classification methods (SMN, SMO, RUS and RF): a F1 score, b MCC, c SSG, and d the average of 9 metrics
Fig. 6Average Friedman ranks of the four classification methods (RF, RUS, SMO and SMN) with respect to five metrics (F1 score, AUPRC, AUROC, MCC and BA). Error bars represent standard errors. See Table 4 for statistical significance in the difference between classifiers
Friedman’s aligned rank test and Bergmann-Hommel post hoc analysis results showing corrected p-values for multiple and pair-wise comparisons between SMN and the other three classifiers, respectively
| Comparisons | F1 score | AUPRC | AUROC | MCC | BA | Precision | Recall | Brier score | SSG |
|---|---|---|---|---|---|---|---|---|---|
| All four classifiers | 0.0005 | 0.0462 | 0.0111 | 5.4e−06 | 9.0e−05 | 1.8e−06 | 0.0017 | 2.0e−06 | |
| SMN vs RF | 0.0003 | 0.0168 | 0.0088 | 0.0001 | 0.0278 | 0.0013 | 0.0009 | 0.0010 | |
| SMN vs RUS | 0.0051 | 0.0062 | 0.0022 | 0.0274 | |||||
| SMN vs SMO | 0.0003 | 0.0088 | 0.0001 | 0.0278 | 0.013 | 0.0007 | 8.4e−04 |
Insignificant statistics (p > 0.05) are highlighted in bold
Comparison between this study and Tox21 Data Challenge winners in terms of the classification performance metrics AUROC and balanced accuracy
| Assay ID | AUROC | Balanced accuracy (BA) | Best classifier / challenge winner | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Best classifier (this study) | Dmlab [ | Microsomes [ | Challenge winner | Best classifier (this study) | Dmlab [ | Microsomes [ | Challenge winner | |||||
| Value | Name | Value | Name | AUROC | BA | |||||||
| NR-AR | 0.823 | RF | N/A | 0.828 | 0.644 | SMN | 0.610 | N/A | 0.736 | 0.994 | 0.875 | |
| NR-AR-LBD | RUS | 0.820 | N/A | 0.879 | 0.612 | RUS | 0.490 | N/A | 0.650 | 1.039 | 0.942 | |
| NR-AhR | 0.920 | SMN | 0.780 | 0.901 | 0.928 | 0.823 | SMN | 0.560 | 0.698 | 0.853 | 0.991 | 0.965 |
| NR-Aromatase | SMN | N/A | 0.838 | 0.727 | SMN | 0.560 | N/A | 0.737 | 1.014 | 0.986 | ||
| NR-ER | SMN | 0.770 | 0.783 | 0.810 | SMN | 0.660 | 0.621 | 0.749 | 1.065 | 1.057 | ||
| NR-ER-LBD | 0.823 | SMN | 0.770 | 0.827 | 0.697 | RUS | 0.590 | 0.550 | 0.715 | 0.995 | 0.975 | |
| NR-PPAR-γ | 0.794 | RUS | 0.830 | 0.718 | 0.861 | 0.745 | RUS | 0.550 | N/A | 0.785 | 0.922 | 0.949 |
| SR-ARE | SMN | 0.770 | 0.804 | 0.840 | SMN | 0.520 | 0.605 | 0.729 | 1.061 | 1.173 | ||
| SR-ATAD5 | 0.815 | SMO | 0.800 | 0.812 | 0.828 | 0.713 | RUS | 0.610 | 0.539 | 0.741 | 0.984 | 0.962 |
| SR-HSE | 0.848 | SMN | 0.860 | N/A | 0.865 | 0.667 | RUS | 0.560 | N/A | 0.799 | 0.980 | 0.835 |
| SR-MMP | 0.930 | RUS | 0.950 | N/A | 0.950 | 0.852 | RUS | 0.69 | N/A | 0.904 | 0.978 | 0.942 |
| SR-p53 | 0.879 | SMN | 0.826 | 0.880 | RUS | 0.58 | 0.523 | 0.765 | 0.998 | 1.017 | ||
| Average | 0.830 | 0.810 | 0.861 | 0.742 | 0.58 | 0.589 | 0.764 | 1.002 | 0.973 | |||
The values in italics are the highest among all the classifiers (both this study and Tox21 Data Challenge) whereas the values in bold font are the best among the Tox21 Data Challenge participating teams [34]