| Literature DB >> 21477282 |
Kevin K Dobbin1, Richard M Simon.
Abstract
BACKGROUND: We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?Entities:
Mesh:
Year: 2011 PMID: 21477282 PMCID: PMC3090739 DOI: 10.1186/1755-8794-4-31
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Figure 1Conceptual Diagram. Diagram of mean squared error decomposition.
Table of optimal allocations of the samples to the training sets
| Optimal number to training set | ||||
|---|---|---|---|---|
| 170 | 70+ | 30+ | 20+ | |
| 150 | 130 | 100 | 60+ | |
| 10 | 150 | 120 | 80 | |
| 70 | 80 | 30+ | 20+ | |
| 10 | 80 | 70 | 40+ | |
| 10 | 40 | 80 | 70 | |
| 10 | 40 | 30+ | 20+ | |
| 10 | 40 | 40 | 40 | |
| 10 | 10 | 30 | 40 | |
Entries in table are where t is the optimal number for the training set and Acc is the average accuracy for a training set of size n. Total sample size is n. "DEG" is the number of independent differentially expressed genes. "Effect" is the standardized fold change for informative genes (difference in mean expression divided by standard deviation). Notation such as "50+" indicates that the MSE was flat, achieving a minimum at t = 50 and remaining at that minimum for t > 50. (Here, "flat" is defined as having a range of MSE values less than 0.0001.) Data generated with dimension P = 22,000. Each table entry based on 1,000 Monte Carlo simulations. Equal prevalence from each of two classes.
Figure 2Example of MSE decomposition. Example figure showing the relative contributions of the three sources of variation to the mean squared error. This is a scenario from one entry in Table 1. Plots for all other scenarios associated with Table 1 and [Additional file 1: Supplemental Table S1]. Here there is m = 1 informative gene, n = 200 total samples available for study, and the standardized fold change for the informative gene is 2δ/σ = 1.0.
Figure 3Comparing two rules of thumb. Comparison of two common rules-of-thumb: 1/2 the samples to the training set and 2/3 rds of the samples to the training set. X-axis is the average accuracy (%) for training sets of size n. "Excess error" on the y-axis is the difference between the root mean squared error (RMSE) and the optimal RMSE. Each point corresponds to a cell in Table 1. Gray shading indicates scenarios where mean accuracy for full dataset size is below 60%.
Empirically estimated effects and covariance
| % | Full data Accuracy | Opt. Vs. | Opt. Vs. | ||||
|---|---|---|---|---|---|---|---|
| 0.9 | 0.962 | 240 | 50% | 58.3 | 0.961 | 0.001 | 0.002 |
| 0.6 | 0.861 | 240 | 50% | 54.2 | 0.860 | 0.003 | 0.002 |
Simulation results based on empirical estimates of covariance matrix and effect sizes. Columns are: p is the weight on a diagonal matrix, Bayes Acc. is the optimal accuracy possible, n is the total sample size, Prev. is the prevalence from the most prevalent group, %t is the optimal allocation proportion to training, Full data Accuracy is the mean accuracy when n = 240, and Opt. vs t = 2/3 is the root mean squared difference (RMSD) for the optimal rule and the 2/3 rds-to-training rule, and Opt vs t = 1/2 is the RMSD between the optimal rule and the 1/2-to-training rule. Sample covariance matrix S calculated from [12]. Effect sizes are estimated by the Empirical Bayes method of [10] with effect sizes shrunk to 80% of the empirical size. We followed methods similar to those previously proposed ([16], [17], [18]) to obtain non-singular covariance matrix estimates, namely , where diag(S) is a matrix of zero's and diagonal elements of S. Bayes accuracy is the optimal accuracy for a linear classifier in the population, which is (e.g., [13] where is a vector of half-distances between the class means. The number of informative genes was selected to achieve realistic Bayes (optimal) accuracies, so that all other gene effects were set to zero. Genes with largest standardized fold changes were selected as informative.
Applications to real datasets
| Dataset | Prevalence | % | Full dataset accuracy | Optimal vs. | Optimal vs. | |
|---|---|---|---|---|---|---|
| Rosenwald | 240 | 52% | 63% | 0.96 | 0.001 | 0.002 |
| Boer | 152 | 53% | 53% | 0.98 | 0.004 | 2e-4 |
| Golub | 72 | 65% | 56% | 0.95 | 0.002 | 0.004 |
| Sun | 131 | 62% | 31% | 0.83 | 0.022 | 0.008 |
| van't Veer | 117 | 67% | 26% | 0.78 | 0.004 | 0.001 |
Nonparametric bootstrap with smooth spline (or isotonic regression) learning curve method results [Additional file 1]. n is the total number of samples from the two classes, and "Prevalence" is the prevalence of the majority class. %t is the percent of samples allocated to the training set under optimal allocation, t/n ·100%. "Full dataset accuracy" is the estimated mean accuracy on the full dataset of size n. "Optimal vs. rule" is the difference between the root mean squared error for an optimal training set allocation and for the "2/3 rds to training set" allocation rule. The rightmost column is for the "1/2 to training set" allocation rule. Classes for datasets are: Germinal Center B-cell-like lymphoma versus other (Rosenwald et al., 2002), renal clear cell carcinoma primary tumor versus control normal kidney tissue (Boer et al., 2001), acute myelogenous leukemia versus acute lymphoblastic leukemia (Golub et al., 1999), glioblastoma versus oligodendroglioma (Sun et al., 2006), grade 1/2 versus grade 3 lung cancer (van't Veer et al., 2002).