| Literature DB >> 29626291 |
Sankalp Jain1, Eleni Kotsampasakou1,2, Gerhard F Ecker3.
Abstract
Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies.Entities:
Keywords: Classification model; Cost sensitive classifier; Imbalanced datasets; Machine learning; Meta-classifiers; Stratified bagging
Mesh:
Substances:
Year: 2018 PMID: 29626291 PMCID: PMC5919997 DOI: 10.1007/s10822-018-0116-z
Source DB: PubMed Journal: J Comput Aided Mol Des ISSN: 0920-654X Impact factor: 4.179
An overview of the training and test datasets used in this study
| Dataset name | Total number of compounds | Number of positives | Number of negatives | Imbalance ratio (negatives: positives) | Source |
|---|---|---|---|---|---|
| OATP1B1 inhibition training | 1708 | 190 | 1518 | 8:1 | Kotsampasakou et al. [ |
| OATP1B1 inhibition testing | 201 | 64 | 137 | 2:1 | Kotsampasakou et al. [ |
| OATP1B3 inhibition training | 1725 | 124 | 1601 | 13:1 | Kotsampasakou et al. [ |
| OATP1B3 inhibition testing | 209 | 40 | 169 | 4:1 | Kotsampasakou et al. [ |
| Cholestasis human training | 1766 | 347 | 1419 | 4:1 | Mulliner et al. [ |
| Cholestasis human testing | 231 | 53 | 178 | 3:1 | Kotsampasakou et al. [ |
| Cholestasis animal training | 1578 | 75 | 1503 | 20:1 | Mulliner et al. [ |
Fig. 1Comparison of performances of different meta-classifiers on test sets a OATP1B1 inhibition b OATP1B3 inhibition c human cholestasis. x-axis corresponds to the sensitivity and on the y-axis is the specificity. The squares correspond to MOE descriptors, the triangles correspond to ECFP6 fingerprints and the circles correspond to MACCS fingerprints. Each classifier is depicted in a different color: red for RF standalone, green for Bagging, blue for Stratified Bagging, dark pink for CostSensitiveClassifier, cyan for MetaCost, yellow for ThresholdSelector, orange for SMOTE and dark violet for ClassBalancer. Please note that the scaling for the two axes are different