| Literature DB >> 24489531 |
Carina M Rubingh1, Sabina Bijlsma1, Eduard P P A Derks1, Ivana Bobeldijk1, Elwin R Verheij1, Sunil Kochhar2, Age K Smilde1.
Abstract
Statistical model validation tools such as cross-validation, jack-knifing model parameters and permutation tests are meant to obtain an objective assessment of the performance and stability of a statistical model. However, little is known about the performance of these tools for megavariate data sets, having, for instance, a number of variables larger than 10 times the number of subjects. The performance is assessed for megavariate metabolomics data, but the conclusions also carry over to proteomics, transcriptomics and many other research areas. Partial least squares discriminant analyses models were built for several LC-MS lipidomic training data sets of various numbers of lean and obese subjects. The training data sets were compared on their modelling performance and their predictability using a 10-fold cross-validation, a permutation test, and test data sets. A wide range of cross-validation error rates was found (from 7.5% to 16.3% for the largest trainings set and from 0% to 60% for the smallest training set) and the error rate increased when the number of subjects decreased. The test error rates varied from 5% to 50%. The smaller the number of subjects compared to the number of variables, the less the outcome of validation tools such as cross-validation, jack-knifing model parameters and permutation tests can be trusted. The result depends crucially on the specific sample of subjects that is used for modelling. The validation tools cannot be used as warning mechanism for problems due to sample size or to representativity of the sampling.Entities:
Keywords: PLS-DA; cross-validation; jack-knife; megavariate data; metabolomics; permutation test; predictability
Year: 2006 PMID: 24489531 PMCID: PMC3906710 DOI: 10.1007/s11306-006-0022-6
Source DB: PubMed Journal: Metabolomics ISSN: 1573-3882 Impact factor: 4.290
Figure 1.Illustration of the procedure that was followed to obtain the data sets.
Figure 2.Visual evaluation of the permutation test.
Figure 3.PLS-DA results for data50:50: Cross-validation error rate (a), Prediction based on cross-validation (b; o = lean, * = obese), Prediction based on fit (c; o = lean, * = obese), and Jack-knife (d).
Summary of PLS-DA results based on all training sets (ER = cross validation error rate in %, LV = number of latent variables, P = evaluation permutation test with e = excellent, g = good, m = moderate, b = bad)
| Model | Training data set | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4040 | 3030 | 2020 | 1010 | 0505 | |||||||||||
| ER | LV | P | ER | LV | P | ER | LV | P | ER | LV | P | ER | LV | P | |
| 1 | 12.5 | 6 | e | 20.0 | 7 | e | 17.5 | 2 | g | 25.0 | 4 | m | 10.0 | 4 | m |
| 2 | 16.3 | 7 | e | 15.0 | 6 | e | 25.0 | 6 | g | 10.0 | 3 | g | 20.0 | 4 | m |
| 3 | 13.8 | 7 | e | 18.3 | 7 | e | 5.0 | 11 | g | 10.0 | 8 | g | 20.0 | 4 | b |
| 4 | 12.5 | 7 | e | 18.3 | 7 | e | 22.5 | 6 | g | 20.0 | 3 | m | 60.0 | 3 | b |
| 5 | 11.3 | 6 | e | 18.3 | 7 | e | 12.5 | 10 | g | 10.0 | 2 | g | 50.0 | 1 | b |
| 6 | 15.0 | 7 | e | 8.3 | 7 | e | 20.0 | 6 | e | 10.0 | 2 | g | 40.0 | 1 | b |
| 7 | 15.0 | 7 | e | 15.0 | 7 | e | 20.0 | 8 | g | 15.0 | 3 | m | 20.0 | 1 | m |
| 8 | 11.3 | 6 | e | 16.7 | 7 | e | 17.5 | 6 | g | 25.0 | 1 | m | 40.0 | 1 | b |
| 9 | 16.3 | 7 | e | 16.7 | 7 | e | 35.0 | 6 | g | 50.0 | 5 | b | 40.0 | 2 | b |
| 10 | 7.5 | 8 | e | 13.3 | 6 | e | 10.0 | 9 | g | 20.0 | 5 | m | 0.0 | 1 | g |
| Mean | 13.1 | 7.0 | 16.0 | 7.0 | 18.5 | 7.0 | 19.5 | 4.0 | 30.0 | 2.0 | |||||
| SD | 2.7 | 3.4 | 8.3 | 12.3 | 18.9 | ||||||||||
Figure 4.PLS-DA results for the 4th selection (A) and the 10th selection (B) of data05:05: Cross-validation error rate (a), Prediction based on cross-validation (b; o = lean, * = obese), Prediction based on fit (c; o = lean, * = obese), and Jack-knife (d).
Summary of PLS-DA results based on the projection of all test data sets (number of LVs based on corresponding training data sets).
| Model | Training data set | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 4040 | 3030 | 2020 | 1010 | 0505 | ||||||
| Size testset | Size testset | Size testset | Size testset | Size testset | ||||||
| 1010 | 1010 | 2020 | 1010 | 3030 | 1010 | 4040 | 1010 | 4545 | 1010 | |
| 1 | 15.0 | 15.0 | 35.0 | 45.0 | 30.0 | 35.0 | 40.0 | 45.0 | 50.0 | 50.0 |
| 2 | 10.0 | 10.0 | 30.0 | 30.0 | 41.7 | 35.0 | 30.0 | 10.0 | 38.9 | 40.0 |
| 3 | 30.0 | 30.0 | 27.5 | 25.0 | 50.0 | 50.0 | 46.3 | 55.0 | 50.0 | 50.0 |
| 4 | 30.0 | 30.0 | 42.5 | 40.0 | 18.3 | 10.0 | 48.8 | 45.0 | 46.7 | 50.0 |
| 5 | 30.0 | 30.0 | 17.5 | 10.0 | 13.3 | 15.0 | 50.0 | 50.0 | 50.0 | 50.0 |
| 6 | 25.0 | 25.0 | 22.5 | 30.0 | 48.3 | 50.0 | 43.8 | 45.0 | 50.0 | 50.0 |
| 7 | 15.0 | 15.0 | 17.5 | 10.0 | 33.3 | 40.0 | 38.8 | 55.0 | 50.0 | 50.0 |
| 8 | 20.0 | 20.0 | 20.0 | 25.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 |
| 9 | 5.0 | 5.0 | 35.0 | 30.0 | 35.0 | 35.0 | 50.0 | 50.0 | 37.8 | 35.0 |
| 10 | 25.0 | 25.0 | 35.0 | 40.0 | 30.0 | 20.0 | 50.0 | 50.0 | 50.0 | 50.0 |
| Mean | 20.5 | 20.5 | 28.3 | 28.5 | 35.0 | 34.0 | 44.8 | 45.5 | 47.3 | 47.5 |
| SD | 9.0 | 9.0 | 8.7 | 11.8 | 12.8 | 14.7 | 6.7 | 13.0 | 4.9 | 5.4 |