| Literature DB >> 35275931 |
Stephen R Piccolo1, Avery Mecham1, Nathan P Golightly1, Jérémie L Johnson1, Dustin B Miller1.
Abstract
By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist-and most support diverse hyperparameters-so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.Entities:
Mesh:
Year: 2022 PMID: 35275931 PMCID: PMC8942277 DOI: 10.1371/journal.pcbi.1009926
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Overview of analysis scenarios.
This study consisted of five separate but related analyses. This diagram indicates which data type(s) was/were used and whether we attempted to improve predictive performance via hyperparameter optimization or feature selection in each analysis.
Fig 2Comparison of ranks for classification algorithms across performance metrics.
We calculated 14 performance metrics for each classification task. This graph shows results for Analysis 1 (using only gene-expression predictors). For each combination of dataset and class variable, we averaged the metric scores across all Monte Carlo cross-validation iterations. For some metrics (such as Accuracy), a relatively high value is desirable, whereas the opposite is true for other metrics (such as FDR). We ranked the classification algorithms such that relatively low ranks indicated more desirable performance for the metrics and averaged these ranks across the dataset/class combinations. This graph illustrates that the best-performing algorithms for some metrics do not necessarily perform optimally according to other metrics. AUROC = area under the receiver operating characteristic curve. AUPRC = area under the precision-recall curve. FDR = false discovery rate. FNR = false negative rate. FPR = false positive rate. MCC = Matthews correlation coefficient. MMCE = mean misclassification error. NPV = negative predictive value. PPV = positive predictive value.
Fig 3Tradeoff between execution time and predictive performance for classification algorithms.
When using gene-expression predictors only (Analysis 1), we calculated the median area under the receiver operating characteristic curve (AUROC) across 50 iterations of Monte Carlo cross validation for each combination of dataset, class variable, and classification algorithm. Simultaneously, we measured the median execution time (in seconds) for each algorithm across these scenarios. sklearn/logistic_regression attained the top predictive performance and was the 4th fastest algorithm (median = 5.3 seconds). The coordinates for the y-axis have been transformed to a log-10 scale. We used arbitrary AUROC thresholds to categorize the algorithms based on low, moderate, and high predictive ability.
Fig 4Relative predictive performance when training on gene-expression predictors alone vs. using clinical predictors alone or gene-expression predictors in combination with clinical predictors.
In both A and B, we used as a baseline the predictive performance that we attained using gene-expression predictors alone (Analysis 1). We quantified predictive performance using the area under the receiver operating characteristic curve (AUROC). In A, we show the relative increase or decrease in performance when using clinical predictors alone (Analysis 2). In most cases, AUROC values decreased; however, in a few cases, AUROC values increased (by as much as 0.42). In B, we show the relative change in performance when using gene-expression predictors in combination with clinical predictors (Analysis 3). For 82/109 (75%) of dataset/class combinations, including clinical predictors had no effect on performance. However, for the remaining combinations, the AUROC improved by as much as 0.15 and decreased by as much as 0.09.
Fig 5Relative predictive performance when using default algorithm hyperparameters and all features vs. tuning hyperparameters or selecting features.
In both A and B, we use as a baseline the predictive performance that we attained using default hyperparameters for the classification algorithms (Analysis 3). We quantified predictive performance using the area under the receiver operating characteristic curve (AUROC). In A, we show the relative increase or decrease in performance when tuning hyperparameters within each training set (Analysis 4). In most cases, AUROC values increased. In B, we show the relative change in performance when performing feature selection within each training set (Analysis 5). AUROC increased for most dataset / class-variable combinations. The horizontal dashed lines indicate the median improvement across all dataset / class-variable combinations.
Summary of feature-selection algorithms.
We evaluated 14 feature-selection algorithms. The abbreviation for each algorithm contains a prefix that indicates which machine-learning library implemented the algorithm (mlr = Machine learning in R, sklearn = scikit-learn, weka = WEKA: The workbench for machine learning). For each algorithm, we provide a brief description of the algorithmic approach; we extracted these descriptions from the libraries that implemented the algorithms. In addition, we assigned high-level categories that indicate whether the algorithms evaluate a single feature (univariate) or multiple features (multivariate) at a time. In some cases, the individual machine-learning libraries aggregated algorithm implementations from third-party packages. In these cases, we cite the machine-learning library and the third-party package. When available, we also cite papers that describe the algorithmic methodologies used.
| Abbreviation | Description | Category |
|---|---|---|
| mlr/cforest.importance | Uses the permutation principle (based on Random Forests) to calculate standard and conditional importance of features[ | Multivariate |
| mlr/kruskal.test | Uses the Kruskal-Wallis rank sum test[ | Univariate |
| mlr/randomForestSRC.rfsrc | Uses the error rate for trees grown with and without a given feature[ | Multivariate |
| mlr/randomForestSRC.var.select | Selects variables using minimal depth (Random Forests)[ | Multivariate |
| sklearn/mutual_info | Calculates the mutual information between two feature clusterings[ | Univariate |
| sklearn/random_forest_rfe | Recursively eliminates features based on Random Forests classification[ | Multivariate |
| sklearn/svm_rfe | Recursively eliminates features based on support vector classification[ | Multivariate |
| weka/Correlation | Calculates Pearson’s correlation coefficient between each feature and the class[ | Univariate |
| weka/GainRatio | Measures the gain ratio of a feature with respect to the class[ | Univariate |
| weka/InfoGain | Measures the information gain of a feature with respect to the class[ | Univariate |
| weka/OneR | Evaluates the worth of a feature using the OneR classifier[ | Univariate |
| weka/ReliefF | Repeatedly samples an instance and considers the value of a given attribute for the nearest instance of the same and different class[ | Multivariate |
| weka/SVMRFE | Recursively eliminates features based on support vector classification[ | Multivariate |
| weka/SymmetricalUncertainty | Measures the symmetrical uncertainty of a feature with respect to the class[ | Univariate |
Fig 6Relative performance of classification algorithms using gene-expression and clinical predictors and performing feature selection.
We predicted patient states using gene-expression and clinical predictors with feature selection (Analysis 5). We used nested cross validation to estimate which features would be optimal for each algorithm in each training set. For each combination of dataset, class variable, and classification algorithm, we calculated the arithmetic mean of area under the receiver operating characteristic curve (AUROC) values across 5 iterations of Monte Carlo cross-validation. Next, we sorted the algorithms based on the average rank across all dataset/class combinations. Each data point that overlays the box plots represents a particular dataset/class combination.
Fig 7Relative classification performance per combination of feature-selection and classification algorithm.
For each combination of dataset and class variable, we averaged area under receiver operating characteristic curve (AUROC) values across all Monte Carlo cross-validation iterations. Then for each classification algorithm, we ranked the feature-selection algorithms based on AUROC scores across all datasets and class variables. Lower ranks indicate better performance. Dark-red boxes indicate cases where a particular feature-selection algorithm was especially effective for a particular classification algorithm. The opposite was true for dark-blue boxes.
Summary of classification algorithms.
We compared the predictive ability of 52 classification algorithms that were available in ShinyLearner and had been implemented across 4 open-source machine-learning libraries. The abbreviation for each algorithm contains a prefix indicating which machine-learning library implemented the algorithm (mlr = Machine learning in R, sklearn = scikit-learn, weka = WEKA: The workbench for machine learning; keras = Keras). For each algorithm, we provide a brief description of the algorithmic approach; we extracted these descriptions from the libraries that implemented the algorithms. In addition, we assigned high-level categories that characterize the algorithmic methodology used by each algorithm. In some cases, the individual machine-learning libraries aggregated algorithm implementations from third-party packages. In these cases, we cite the machine-learning library and the third-party package. When available, we also cite papers that describe the algorithmic methodologies used. Finally, for each algorithm, we indicate the number of unique hyperparameter combinations evaluated in Analysis 4.
| Abbreviation | Description | Category | Combos |
|---|---|---|---|
| keras/dnn | Multi-layer neural network with Exponential Linear Unit activation[ | Artificial neural network | 54 |
| keras/snn | Multi-layer neural network with Scaled Exponential Linear Unit activation[ | Artificial neural network | 54 |
| mlr/C50 | C5.0 Decision Trees[ | Tree- or rule-based | 32 |
| mlr/ctree | Conditional Inference Trees[ | Tree- or rule-based | 4 |
| mlr/earth | Multivariate Adaptive Regression Splines[ | Linear discriminant | 36 |
| mlr/gausspr | Gaussian Processes[ | Kernel-based | 3 |
| mlr/glmnet | Generalized Linear Models with Lasso or Elasticnet Regularization[ | Linear discriminant | 3 |
| mlr/h2o.deeplearning | Deep Neural Networks[ | Artificial neural network | 32 |
| mlr/h2o.gbm | Gradient Boosting Machines[ | Ensemble | 16 |
| mlr/h2o.randomForest | Random Forests[ | Ensemble | 12 |
| mlr/kknn | k-Nearest Neighbor[ | Miscellaneous | 6 |
| mlr/ksvm | Support Vector Machines[ | Kernel-based | 40 |
| mlr/mlp | Multi-Layer Perceptron[ | Artificial neural network | 14 |
| mlr/naiveBayes | Naive Bayes[ | Miscellaneous | 2 |
| mlr/randomForest | Breiman and Cutler’s Random Forests[ | Ensemble | 12 |
| mlr/randomForestSRC | Fast Unified Random Forests for Survival, Regression, and Classification[ | Ensemble | 108 |
| mlr/ranger | A Fast Implementation of Random Forests[ | Ensemble | 12 |
| mlr/rpart | Recursive Partitioning and Regression Trees[ | Tree- or rule-based | 1 |
| mlr/RRF | Regularized Random Forests[ | Ensemble | 24 |
| mlr/sda | Shrinkage Discriminant Analysis[ | Linear discriminant | 2 |
| mlr/svm | Support Vector Machines[ | Kernel-based | 28 |
| mlr/xgboost | eXtreme Gradient Boosting[ | Ensemble | 3 |
| sklearn/adaboost | AdaBoost[ | Ensemble | 8 |
| sklearn/decision_tree | A decision tree classifier[ | Tree- or rule-based | 96 |
| sklearn/extra_trees | An extra-trees classifier[ | Ensemble | 24 |
| sklearn/gradient_boosting | Gradient Boosting for classification[ | Ensemble | 6 |
| sklearn/knn | k-nearest neighbors vote[ | Miscellaneous | 12 |
| sklearn/lda | Linear Discriminant Analysis[ | Linear discriminant | 3 |
| sklearn/logistic_regression | Logistic Regression[ | Kernel-based | 32 |
| sklearn/multilayer_perceptron | Multi-layer Perceptron[ | Artificial neural network | 24 |
| sklearn/random_forest | Random Forests[ | Ensemble | 24 |
| sklearn/sgd | Linear classifiers with stochastic gradient descent training[ | Linear discriminant | 36 |
| sklearn/svm | C-Support Vector Classification[ | Kernel-based | 32 |
| weka/Bagging | Bagging a classifier to reduce variance[ | Ensemble | 32 |
| weka/BayesNet | Bayes Network learning using various search algorithms and quality measures[ | Miscellaneous | 2 |
| weka/DecisionTable | Simple decision table majority classifier[ | Tree- or rule-based | 6 |
| weka/HoeffdingTree | Hoeffding tree[ | Tree- or rule-based | 32 |
| weka/HyperPipes | HyperPipe classifier[ | Miscellaneous | 1 |
| weka/J48 | Pruned or unpruned C4.5 decision tree[ | Tree- or rule-based | 96 |
| weka/JRip | Repeated Incremental Pruning to Produce Error Reduction[ | Tree- or rule-based | 12 |
| weka/LibLINEAR | LIBLINEAR—A Library for Large Linear Classification[ | Kernel-based | 16 |
| weka/LibSVM | Support vector machines[ | Kernel-based | 32 |
| weka/NaiveBayes | A Naive Bayes classifier using estimator classes[ | Miscellaneous | 3 |
| weka/OneR | 1R (1 rule) classifier[ | Tree- or rule-based | 3 |
| weka/RandomForest | Forest of random trees[ | Ensemble | 18 |
| weka/RandomTree | Tree that considers K randomly chosen attributes at each node[ | Tree- or rule-based | 2 |
| weka/RBFNetwork | Normalized Gaussian radial basis function network[ | Miscellaneous | 18 |
| weka/REPTree | Fast decision tree learner (reduced-error pruning with backfitting)[ | Tree- or rule-based | 16 |
| weka/SimpleLogistic | Linear logistic regression models[ | Linear discriminant | 5 |
| weka/SMO | Sequential minimal optimization for a support vector classifier[ | Kernel-based | 20 |
| weka/VFI | Voting feature intervals[ | Miscellaneous | 6 |
| weka/ZeroR | 0-R classifier (predicts the mean for a numeric class or the mode for a nominal class)[ | Baseline | 1 |