| Literature DB >> 18925941 |
M Slawski1, M Daumer, A-L Boulesteix.
Abstract
BACKGROUND: For the last eight years, microarray-based classification has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the so-called "p >> n" setting where the number of predictors p by far exceeds the number of observations n, hence the term "ill-posed-problem". Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for statisticians without experience in this area or for scientists with limited statistical background. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers.Entities:
Mesh:
Year: 2008 PMID: 18925941 PMCID: PMC2646186 DOI: 10.1186/1471-2105-9-439
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Evaluation schemes. The top panel illustrates the splitting into learning and test data sets. The whole sample S is split into a learning set ℒ and a test set . The classifier f(·) is constructed using the learning set ℒ and subsequently applied to the test set . The bottom panel displays schematically k-fold cross-validation (left), Monte-Carlo cross-validation with n = 5 and ntrain = 3 (middle), and bootstrap sampling (with replacement) with n = 5 and ntrain = 3 (right).
Figure 2Hyperparameter tuning. Schematic display of nested cross-validation. In the procedure displayed above, k-fold cross-validation is used for evaluation purposes, whereas tuning is performed within each iteration using inner (l-fold) cross-validation.
Overview of the classification methods in CMA.
| Componentwise boosting | compBoostCMA | CMA | [ |
| Diagonal discriminant analysis | dldaCMA | CMA | [ |
| Elastic net | ElasticNetCMA | 'glmpath' | [ |
| Fisher's discriminant analysis | fdaCMA | CMA | [ |
| Flexible discriminant analysis | flexdaCMA | 'mgcv' | [ |
| Tree-based boosting | gbmCMA | 'gbm' | [ |
| knnCMA | 'class' | [ | |
| Linear discriminant analysis * | ldaCMA | 'MASS' | [ |
| Lasso | LassoCMA | 'glmpath' | [ |
| Feed-forward neural networks | nnetCMA | 'nnet' | [ |
| Probalistic nearest neighbors | pknnCMA | CMA | - |
| Penalized logistic regression | plrCMA | CMA | [ |
| Partial Least Squares ⋆ + * | pls_ldaCMA | 'plsgenomics' | [ |
| ⋆ + logistic regression | pls_lrCMA | 'plsgenomics' | [ |
| ⋆ + random forest | pls_rfCMA | 'plsgenomics' | [ |
| Probabilistic neural networks | pnnCMA | CMA | [ |
| Quadratic discriminant analysis | qdaCMA | 'MASS' | [ |
| Random forest | rfCMA | 'randomForest' | [ |
| PAM | scdaCMA | CMA | [ |
| Shrinkage discriminant analysis | shrinkldaCMA | CMA | - |
| Support vector machines | svmCMA | 'e1071' | [ |
The first column gives the method name, whereas the name of the classifier in the CMA package is given in the second column. For each classifier, CMA uses either own code or code borrowed from another package, as specified in the third column.
Overview of hyperparameter tuning in CMA.
| gbmCMA | n.trees | 1, 2,... | number of base learners (decision trees) |
| LassoCMA | norm.fraction | [0;1] | relative bound imposed on the ℓ1 norm on the weight vector |
| knnCMA | k | 1, 2,...,|ℒ| | number of nearest neighbours |
| nnetCMA | size | 1, 2, ... | number of units in the hidden layer |
| scdaCMA | delta | ℝ+ | shrinkage towards zero applied to the centroids |
| svmCMA | cost | ℝ+ | cost: controls the violations of the margin of the hyperplane |
| gamma | ℝ+ | controls the width of the Gaussian kernel (if used) |
The first column gives the method name, whereas the name of the hyperparameter in the CMA package is given in the second column. The third column gives the range of the parameter and the fourth column its signification.
Figure 3Classification accuracy with Khan's SRBCT data. Boxplots representing the misclassification rate (top), the Brier score (middle), and the average probability of correct classification (bottom) for Khan's SRBCT data, using seven classifiers: diagonal linear discriminant analysis, linear discriminant analysis, quadratic discriminant analysis, shrunken centroids discriminant analysis (PAM), PLS followed by linear discriminant analysis, SVM without tuning, and SVM with tuning.
Running times.
| Method | Running time per learningset | |
| Multiclass F-Test | 3.1 s | |
| Krusal-Wallis test | 3.5 s | |
| Limma* | 0.16s | |
| Random Forest†,* | 4.1 s | |
| Method | # variables | Running time per |
| DLDA | all (2308) | 2.7 s |
| LDA | 10 | 1.4 s |
| QDA | 2 | 1.0 s |
| Partial Least Squares | all (2308) | 4.2 s |
| Shrunken Centroids | all (2308) | 2.8 s |
| SVM* | all (2308) | 88s |
Running times of the different variable selection and classification methods used in the real life example. †: 500 bootstrap trees per run.
| f.test | kru.test | lim.test | rf.imp | |
| top.1 | 1954 | 1194 | 22 | 545 |
| top.2 | 1389 | 545 | 26 | 1954 |
| top.3 | 1003 | 1389 | 723 | 2050 |
| top.4 | 129 | 2050 | 1897 | 1003 |
| top.5 | 1955 | 1954 | 148 | 246 |
| top.6 | 246 | 246 | 428 | 1389 |
| top.7 | 1194 | 1003 | 1065 | 187 |
| top.8 | 2050 | 554 | 11 | 554 |
| top.9 | 2046 | 1708 | 735 | 2046 |
| top.10 | 545 | 1158 | 62 | 1896 |
| 11 | 22 | 26 | 62 | 129 | 148 | 187 | 246 | 428 | 545 | 554 | 723 | 735 | 1003 | 1065 | 1158 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 1 | 3 | 2 | 1 | 1 | 3 | 1 | 1 |
| 1194 | 1389 | 1708 | 1896 | 1897 | 1954 | 1955 | 2046 | 2050 | |||||||
| 2 | 3 | 1 | 1 | 1 | 3 | 1 | 2 | 3 |
| misclassification | brier.score | average.probability | |
| DLDA | 0.06807692 | 0.13420913 | 0.9310332 |
| LDA | 0.04269231 | 0.07254283 | 0.9556106 |
| QDA | 0.24000000 | 0.34247861 | 0.7362778 |
| scDA | 0.01910256 | 0.03264544 | 0.9754012 |
| pls_lda | 0.01743590 | 0.02608426 | 0.9819003 |
| svm | 0.06076923 | 0.12077855 | 0.7872984 |
| svm2 | 0.04461538 | 0.10296755 | 0.8014135 |