| Literature DB >> 23895587 |
Weijie Chen1, Frank W Samuelson, Brandon D Gallas, Le Kang, Berkman Sahiner, Nicholas Petrick.
Abstract
BACKGROUND: The surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models. Recently, several authors reported the puzzling observation that, in assessing the added value of new biomarkers to existing ones in a logistic regression model, statistical significance of new predictor variables does not necessarily translate into a statistically significant increase in the area under the ROC curve (AUC). Vickers et al. concluded that this inconsistency is because AUC "has vastly inferior statistical properties," i.e., it is extremely conservative. This statement is based on simulations that misuse the DeLong et al. method. Our purpose is to provide a fair comparison of the likelihood ratio (LR) test and the Wald test versus diagnostic accuracy (AUC) tests. DISCUSSION: We present a test to compare ideal AUCs of nested linear discriminant functions via an F test. We compare it with the LR test and the Wald test for the logistic regression model. The null hypotheses of these three tests are equivalent; however, the F test is an exact test whereas the LR test and the Wald test are asymptotic tests. Our simulation shows that the F test has the nominal type I error even with a small sample size. Our results also indicate that the LR test and the Wald test have inflated type I errors when the sample size is small, while the type I error converges to the nominal value asymptotically with increasing sample size as expected. We further show that the DeLong et al. method tests a different hypothesis and has the nominal type I error when it is used within its designed scope. Finally, we summarize the pros and cons of all four methods we consider in this paper.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23895587 PMCID: PMC3733611 DOI: 10.1186/1471-2288-13-98
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Figure 1“Antler” plot for the logistic regression model. Fifteen simulated biomarkers are assumed to follow a pair of normal distributions for the two classes. At each training sample size, the AUC performance is estimated in one Monte Carlo (MC) trial with (#1) resubstitution, (#2) a small independent test set (60 observations per class), and (#3) a large independent test set (10,000 observations per class). The MC trial is repeated independently 1,000 times and the sample mean and the sample standard deviation (SD) of the estimated AUC values are calculated for each estimator. The figure plots the theoretically ideal AUC and the sample mean AUC (±1 SD) at training sample sizes 60, 120, 240, 360, and 480 (note that the plot is shifted a bit horizontally to avoid overlap between error bars).
User-selected mean and covariance matrix parameters for the normal distributions in simulating the joint distributions of 16 biomarkers
| Negative class | Mean | [ 0,…,0]1×16 | [ 0,…,0]1×16 |
| | Cov | Identity | Identity |
| Positive class | Mean | [ 0.7,0.6,0.6,0.5,0.5,0.3,0.3,0.2,… | [ 0.7,0.6,0.6,0.5,0.5,0.3,0.3,0.2,… |
| | | 0.2,0.1,0.1,0.1,0,0,0,0] | 0.2,0.1,0.1,0.1,0,0,0,0.6] |
| | Cov | Identity | Identity |
| Ideal AUCs for LDF | 0.8413 (15 biomarkers) vs. | 0.8413 (15 biomarkers) vs. | |
| 0.8413 (16 biomarkers) | 0.8613 (16 biomarkers) | ||
Comparison of different statistical tests in assessing whether a new biomarker has added value
| LR test | 0.1033 | 0.0668 | 0.0585 | 0.0600 | 0.0546 |
| Wald Test | 0.0781 | 0.0608 | 0.0546 | 0.0571 | 0.0538 |
| F test of ideal AUC | 0.0495 | 0.0482 | 0.0481 | 0.0522 | 0.0515 |
| Alt. Hypothesis | |||||
| (Power) | |||||
| LR test | 0.5956 | 0.8569 | 0.9539 | 0.9896 | 1.0000 |
| Wald Test | 0.5394 | 0.8453 | 0.9506 | 0.9889 | 1.0000 |
| F test of ideal AUC | 0.5196 | 0.8514 | 0.9538 | 0.9915 | 1.0000 |
Fraction of significant findings in 20,000 Monte Carlo trials at statistical significance cut-off of 0.05. On the top are the results for the null-hypothesis experiment where the fraction of significant findings is the observed type I error rate. At the bottom are the results for the alternative-hypothesis experiment where the fraction of significant findings is the statistical power.
Simulation results demonstrating the application of the DeLong et al. method [[3]] and the U-statistics based method [[9]] for comparing two fixed models under the null hypothesis
| 0.0531/0.0547 | 0.0501/0.0507 | 0.0515/0.0521 | ||
| | 0.0515/0.0527 | 0.0542/0.0558 | 0.0503/0.0511 | |
| | 0.0482/0.0498 | 0.0503/0.0512 | 0.0496/0.0500 | |
| 0.0507/0.0530 | 0.0488/0.0505 | 0.0515/0.0526 | ||
| | 0.0457/0.0493 | 0.0501/0.0519 | 0.0485/0.0500 | |
| 0.0453/0.0486 | 0.0444/0.0471 | 0.0501/0.0517 |
Fraction of significant findings in 20,000 Monte Carlo trials at statistical significance cut-off of 0.05 for the DeLong et al. method and the U-statistics based method respectively.