| Literature DB >> 31510862 |
Max Westphal, Werner Brannath1.
Abstract
Model selection and performance assessment for prediction models are important tasks in machine learning, e.g. for the development of medical diagnosis or prognosis rules based on complex data. A common approach is to select the best model via cross-validation and to evaluate this final model on an independent dataset. In this work, we propose to instead evaluate several models simultaneously. These may result from varied hyperparameters or completely different learning algorithms. Our main goal is to increase the probability to correctly identify a model that performs sufficiently well. In this case, adjusting for multiplicity is necessary in the evaluation stage to avoid an inflation of the family wise error rate. We apply the so-called maxT-approach which is based on the joint distribution of test statistics and suitable to (approximately) control the family-wise error rate for a wide variety of performance measures. We conclude that evaluating only a single final model is suboptimal. Instead, several promising models should be evaluated simultaneously, e.g. all models within one standard error of the best validation model. This strategy has proven to increase the probability to correctly identify a good model as well as the final model performance in extensive simulation studies.Entities:
Keywords: Artificial intelligence; diagnosis; diagnostic accuracy; machine learning; model evaluation; multiple testing; prognosis
Mesh:
Year: 2019 PMID: 31510862 PMCID: PMC7270727 DOI: 10.1177/0962280219854487
Source DB: PubMed Journal: Stat Methods Med Res ISSN: 0962-2802 Impact factor: 3.021
Figure 1.Schematic representation of the machine learning and evaluation process.
Figure 2.Illustration of the optimal and median performance of the M = 100 candidate models over N = 5000 simulation runs for stratified by prediction task. The diamond symbols indicate the performance of the data generating model.
Figure 3.Rejection rate for the null hypothesis of the final model stratified by scenario (top: A, bottom: B) and evaluation sample size (from left to right: ).
Figure 4.Relative deviation of the naive (top) and corrected (bottom) point estimate compared to the true value of the final model performance for scenario B and different evaluation sample sizes . The diamond symbols indicate the sample means.
Figure 5.Distribution of final model performance relative to the optimal performance for learning task B. Results are stratified by evaluation sample size (columns) and learning sample size (rows).
Figure 6.Properties of selection rules when learning and evaluation population differ (measured via KL divergence from learning distribution to evaluation distribution ) for prediction task A. (a) Rejection rate under global null (). (b) Rejection rate under alternative (). (c) Relative final model performance .