| Literature DB >> 31374986 |
Anita Rácz1, Dávid Bajusz2, Károly Héberger1.
Abstract
Machine learning classification algorithms are widely used for the prediction and classification of the different properties of molecules such as toxicity or biological activity. the prediction of toxic vs. non-toxic molecules is important due to testing on living animals, which has ethical and cost drawbacks as well. The quality of classification models can be determined with several performance parameters. which often give conflicting results. In this study, we performed a multi-level comparison with the use of different performance metrics and machine learning classification methods. Well-established and standardized protocols for the machine learning tasks were used in each case. The comparison was applied to three datasets (acute and aquatic toxicities) and the robust, yet sensitive, sum of ranking differences (SRD) and analysis of variance (ANOVA) were applied for evaluation. The effect of dataset composition (balanced vs. imbalanced) and 2-class vs. multiclass classification scenarios was also studied. Most of the performance metrics are sensitive to dataset composition, especially in 2-class classification problems. The optimal machine learning algorithm also depends significantly on the composition of the dataset.Entities:
Keywords: ANOVA; ROC; classifiers; machine learning; performance metrics; ranking; toxicity prediction
Year: 2019 PMID: 31374986 PMCID: PMC6695655 DOI: 10.3390/molecules24152811
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Workflow of the comparative study. Briefly, after descriptor generation and reduction, eleven machine learning methods are applied for model building (for each combination of 2-class/multiclass and balanced/imbalanced cases). After the calculation of the performance parameters, statistical analysis of the results is carried out with sum of ranking differences (SRD) and factorial analysis of variance (ANOVA). The complete process is carried out on three datasets.
Figure 2(A) Summary of the number of molecules for the three datasets with specific conditions. (B) Illustration of merged datasets for the SRD analyses. Datasets 1, 2 and 3 contain the performance parameters of the calculated models. (CV is short for cross-validation.)
Figure 3SRD analysis (for the balanced 2-class version). Normalized SRD values are plotted on the X and left Y axes (to make the ordering visually illustrative) ‒ the smaller the better. The abbreviations of the performance metrics can be found in the Section 3.3. The cumulative relative frequencies (right Y axis) correspond to the randomization test (see Section 3.4). Here, the diagnostic odds ratio (DOR) is closest to the reference (smallest SRD value), while AUAC (area under the accumulation curve) overlaps with the cumulative frequency curve, and is therefore statistically indistinguishable from random ranking.
Figure 4(A) SRD values are, on average, higher for 2-class classification scenarios (farther from the reference), meaning that there is a greater degree of disagreement between the performance metrics in this case, highlighting the importance for their informed selection and application during model evaluation. The difference is even more pronounced if the dataset is imbalanced. (B) Most of the performance metrics are quite robust in multiclass scenarios, while in 2-class cases, the balanced or imbalanced datasets have a much greater effect on model ranking. (Normalized SRD values are shown always on the Y axis. The markers denote average values, and the vertical lines denote 95% confidence intervals.)
Figure 5ANOVA result of the performance parameters based on the SRD (%) values. Arbitrary dotted lines denote the classification of the performance parameters into good, medium and not recommended categories.
Figure 6Results of the SRD-COVAT method: 2-class classification with balanced (A) and imbalanced (B) classes; and multiclass classification with balanced (C) and imbalanced (D) classes. Clusters of similarly behaving performance parameters are separated with black lines (squares) on the plot based on visual inspection.
Figure 7SRD analysis (for the Daphnia Magna dataset in the multiclass case, with imbalanced classes). Normalized SRD values are plotted on the X and left Y axes. The abbreviations of the classifiers can be found in Section 3.2. The cumulative relative frequencies (black curve, right Y axis) correspond to the randomization test. The “T” suffix indicates external test validated predictions, the lack thereof indicates cross-validated predictions.
Figure 8(A) Normalized SRD values for the eleven classifiers. Error bars mean 95% confidence intervals. Recommended classifiers are below the dotted line, dotted circle shows an intermediate one. (B) Decomposition of the classifiers according to dataset composition (balanced vs. imbalanced classes). Normalized SRD [%] was scaled between 0 and 100.
Dataset compositions for the three case studies.
| Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | |||
|---|---|---|---|---|---|---|---|---|
| Dataset 1 | Balanced | Training | 116 | 116 | 116 | 116 | ||
| Test | 29 | 50 | 48 | 37 | ||||
| Imbalanced | Training | 116 | 166 | 213 | 164 | |||
| Test | 29 | 50 | 48 | 37 | ||||
| Dataset 2 | Balanced | Training | 28 | 28 | 28 | 28 | ||
| Test | 8 | 8 | 24 | 15 | ||||
| Imbalanced | Training | 48 | 65 | 58 | 84 | |||
| Test | 8 | 8 | 24 | 15 | ||||
| Dataset 3 | Balanced | Training | 199 | 199 | 199 | 199 | 199 | 199 |
| Test | 58 | 132 | 267 | 587 | 291 | 14 | ||
| Imbalanced | Training | 199 | 557 | 1053 | 2325 | 1178 | 619 | |
| Test | 58 | 132 | 267 | 587 | 291 | 14 |
Summary of the machine learning algorithms. Abbreviations are the names found in the WEKA package. Classification schemes are more general categories (types) of the algorithms. (* Func. is short for “Function”.)
| Name (Abbreviation) | Class. Scheme | Details |
|---|---|---|
| Naïve Bayes (NaiB) | Bayes | This algorithm is based on the Bayes theorem and the assumption of the independence of all attributes. The samples are examined separately and the individual probability of belonging to a class is calculated for each particular class. Standard options were used in WEKA NäiveBayes node [ |
| FilteredClassifier (Fil) | Meta | The algorithm is running an arbitrary classifier on data that has been passed through an arbitrary filter. Attribute selection filter was used with CfsSubset Evaluation and the best first search method [ |
| lBk, | Lazy | One of the simplest algorithms, where the class membership is assigned based on the majority vote of the |
| HyperPipe (Hip) | Misc | Fast and simple algorithm, which is working well with many attributes. The basic idea of the method is the construction of pipes with different pattern of attributes to each class. The samples are monitored and selected to each class based on the pipes and the corresponding class [ |
| MultiboostAB (Mboo) | Meta | This algorithm is the modified version of the AdaBoost technique with wagging. The idea of wagging is to assign random weights to the cases in each training set based on Poisson distribution. |
| libSVM, library SVM (SVM) | Func.* | Support vector machine can define hyperplane(s) in a higher dimensional space to separate the classes of samples distinctly. The plane should have the maximum margin between data points. Support vectors (points) can maximize the margin of the classifier. Different kernel functions and optimization parameters can be used for the classification task with SVM [ |
| oneR, based on 1-rule, (OneR) | Rule | This algorithm ranks the attributes based on the error rate (on the training set). The basic concept is connected to 1-rules algorithms, where the samples are classified based on a single attribute [ |
| Bagging (Bag) | Meta | The basic concept of bagging is the creation of different models based on the bootstrapped training sets. The average (or vote) of these multiple versions are used for the prediction of class memberships for each sample [ |
| Ensemble Selection (EnS) | Meta | It combines several classifier algorithms in the ensemble selection. The average prediction of the models in the ensemble is applied for the class membership determination. The selection of the models is based on an error metric (in our case RMSE). Forward selection was used for the optimization process of the ensemble. Iterations (here, 100) are also carried out such as in the case of Bagging. |
| Decorate (Dec) | Meta | It is also an ensemble-type algorithm, where the ensembles are constructed directly with diverse hypotheses with the application of additional artificially-constructed training examples to the original one. The classifier is working on the union of the original training and the artificial data (diversity data). The new classifiers are added to the ensemble, if the training error is not increased [ |
| Random Forest (RF) | Trees | Random forest is a tree-based method, which can be used for classification and regression problems alike. The basic idea is that it builds many trees and each of them predicts a classification. The final classification is made by a voting of the sequences of trees. The trees are weak predictors, but together they produce an ensemble; with the vote of each tree, the method can make good predictions [ |
Confusion matrix of observations in 2-class classification.
| Predicted + (PP) | Predicted − (PN) | |
|---|---|---|
| Actual + ( | True positive ( | False negative ( |
| Actual | False positive ( | True negative ( |
Figure 9Mock datasets to showcase common classification scenarios. (A) In structure-based virtual screening, a docking score is commonly used as a rough estimator of the free energy of binding between ligand and protein (the smaller the better). Predicting a ligand to be active/positive requires setting a threshold value of the docking score (T): each ligand with a better score will be considered a predicted active/positive. (B) In 2-class classification, machine learning methods typically output probability values for each sample, for belonging to the positive class. A probability value of 0.5 or higher is a natural choice to assign the samples into the positive class. Naturally, other choices can be applied as well: in the above example, setting the threshold value to either 0.6 or 0.4 would reduce the number of misclassified samples by one. (C) In multiclass classification, the most straightforward option is to assign each sample to the class with the highest predicted probability. (Green: correct, red: incorrect classification.).
Local performance metrics for 2-class classification—One-sided.
| Name | Alternative Names | Formula | Complementary | Complementary |
|---|---|---|---|---|
| True positive rate (TPR) | Sensitivity, recall, hit rate |
| False negative rate (FNR), miss rate |
|
| True negative rate (TNR) | Specificity, selectivity |
| False positive rate (FPR), fall-out |
|
| Positive predictive value (PPV) | precision |
| False discovery rate (FDR) |
|
| Negative predictive value (NPV) |
| False omission rate (FOR) |
|
Local performance metrics for 2-class classification—Two-sided. (n: total number of samples, k: total number of classes).
| Name | Formula | Description |
|---|---|---|
| Accuracy (ACC), or Correct classification rate (CC) |
| Readily generalized to multiple classes. |
| Balanced accuracy (BACC) |
| Alternative of accuracy for imbalanced datasets. |
| F1 score (F1), or F measure |
| Harmonic mean of precision and recall |
| Matthews correlation coefficient (MCC) [ |
| Readily generalized to multiple classes. |
| Bookmaker informedness (BM), or Informedness [ |
| |
| Markedness (MK) [ |
| |
| Positive likelihood ratio (LR+) |
| |
| Negative likelihood ratio (LR−) |
| |
| Diagnostic odds ratio (DOR) |
| |
| Enrichment factor (EF) |
| Ratio of true positives in the top x% of the predictions, divided by ratio of positives in the whole dataset. |
| ROC enrichment (ROC_EF) [ |
| Ratio of TPR and FPR at a fixed FPR value (x). Independent of dataset composition. |
| Cohen’s kappa [ |
| Readily generalized to multiple classes. |
| Jaccard score (J) |
| Jaccard-Tanimoto similarity between the sets of predicted and actual (true) labels for the complete set of samples. |
| Brier score loss (B) |
| Readily generalized to multiple classes. |
| Robust initial enhancement (RIE) [ |
|
Global performance metrics for 2-class classification.
| Name | Formula | Description |
|---|---|---|
| Area under the ROC curve (AUC) [ | Area under the TPR-FPR curve | Probability that a randomly selected positive sample will be ranked before a randomly selected negative. |
| Area under the accumulation curve (AUAC) | Area under the TPR-score (or TPR-rank) curve | If the ranks are normalized, then 0 ≤ AUAC ≤ 1 |
| Average precision (AP) | Area under the precision-recall (PPV-TPR) curve | |
| Boltzmann-enhanced discrimination of receiver operating characteristic (BEDROC) [ |
| See the definition of RIE above, |
| Average rank (position) of actives (positives) ( |
|