| Literature DB >> 22142365 |
Tahir Mehmood1, Harald Martens, Solve Sæbø, Jonas Warringer, Lars Snipen.
Abstract
BACKGROUND: In genomics, a commonly encountered problem is to extract a subset of variables out of a large set of explanatory variables associated with one or several quantitative or qualitative response variables. An example is to identify associations between codon-usage and phylogeny based definitions of taxonomic groups at different taxonomic levels. Maximum understandability with the smallest number of selected variables, consistency of the selected variables, as well as variation of model performance on test data, are issues to be addressed for such problems.Entities:
Year: 2011 PMID: 22142365 PMCID: PMC3287970 DOI: 10.1186/1748-7188-6-27
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Flow chart. The flow chart illustrates the proposed algorithm for variable selection.
Figure 2An overview of the testing/training. An overview of the testing/training procedure used in this study. The rectangles illustrate the predictor matrix. At level 1 we split the data into a test set and training set (25/75) to be used by all four methods listed on the right. This was repeated 100 times. Inside our suggested method, the stepwise elimination, there are two levels of cross-validation. First a 10-fold cross-validation was used to optimize selection parameters f and d, and at level 3 leave-one-out cross-validation was used to optimize the regularized CPPLS method.
Figure 3A typical elimination. A typical elimination is shown based on the data for phylum Actinobacteria. Each dot in the figure indicates one iteration. The procedure starts on the left hand side, with the full model. After some iterations performance(P), which reflects the percentage of correctly classified samples, has increased, and reaches a maximum. Further elimination reduces performance, but only marginally. When elimination becomes too severe, the performance drops substantially. Finally, the selected model is found where we have the smallest model with performance not significantly worse than the maximum.
Figure 4The distribution of selected variables. The distribution of the number of variables selected by the optimum model and selected model for loading weights, VIP and regression coefficients is presented in upper panels, while lower panels display similar for Forward, Lasso and ST-PLS. The horizontal axes are the number of retained variables as percentage of the full model (with 4160 variables). All results are based on 100 random samples from the full data set, where 75% of the objects are used as training data and 25% as test data in each sample.
Figure 5Performance comparison. The left panel presents the distribution of performance of in the full model, optimum model and selected models on test and training data sets for loading weights, VIP and regression coefficients, while the right panels display similar for Forward, Lasso and ST-PLS. All results are based on 100 random samples from the full data set, where 75% of the objects are used as training data and 25% as test data in each sample.
Figure 6Selectivity score. The selectivity score is sorted in descending order for each criterion loading weights, regression coefficients significance and VIP in the left panels, while right panels display similar for Forward, Lasso and ST-PLS. Only the first 500 values (out of 4160) are shown.
Selectivity score based selected codons
| Phylum | Positive and Negative impact | ||
|---|---|---|---|
| 42 | 90.6 | ||
| 16 | 96.3 | ||
| 16 | 96.5 | ||
| 17 | 97.1 | ||
| 31 | 93.3 | ||
| 89 | 80.3 | ||
| 70 | 85.9 | ||
| 42 | 90.8 | ||
| 92 | 81.2 | ||
| 18 | 96.0 | ||
| 12 | 96.9 |
Results obtained for each phylum by using the VIP criterion. Gen. is the number of genomes for that phylum in the data set, Perf. is the average test-set performance i.e. percentage of correctly classified samples, when classifying the corresponding phylum. This is synonymous to the true positive rate. Positive impact variables are variables with selectivity score above 0.01 and with positive regression coefficients while Negative impact variables are similar with negative regression coefficients.