| Literature DB >> 26763892 |
Mari van Reenen1,2, Carolus J Reinecke3, Johan A Westerhuis4,5, J Hendrik Venter6.
Abstract
BACKGROUND: Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We introduce an approach using minimum classification error rates as test statistics to find discriminatory and therefore informative variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp.Entities:
Mesh:
Year: 2016 PMID: 26763892 PMCID: PMC4712617 DOI: 10.1186/s12859-015-0867-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Algorithm to simulate the null cumulative distribution functions
| • Generate |
| • Assign the first |
| • Minimize |
| • Repeat these steps |
| • If |
Fig. 1The null cumulative distribution functions. The graphs show the log 10 transformed null CDFs of (black lines) and (red line), for group sizes N 0 = 21 and N 1 = 12 using weight sets 1 (solid lines) and 2 (dashed lines). The dark (α = 0.001) and light (α = 0.05) blue lines represent points of reference discussed in the text
Fig. 2Simulation comparison of the different error rate test statistics. Figures a (weight set 1) and b (weight set 2) depict the average p-values associated with (red lines), (blue lines) and the MW test statistic (black lines). Figures c (weight set 1) and d (weight set 2) depict the proportions of repetitions in which the p-values of (blue lines) and MW (black lines) were below the p-values of . The dotted red line represents the 50 % cut-off. The dashed blue lines represent points of reference as discussed in the text
Significant variables based on weight set 1 and 2
The first column provides the variable names ordered according to increasing error rates which are shown in the second column. The third column provides the threshold estimates which can be used to classify new subjects by employing the “up” or “down” rule as indicated by the direction in the fourth column. The fifth column provides the p-values associated with the error rates, expressed as percentages. The significance of these values can be determined through comparison to the BH adjusted critical level. The last three columns provide these levels for three different FWERs namely 1, 5 and 10 %. The red, green and purple blocks encapsulate the variables that were significantly shifted at a 1, 5 and 10 % FWERs, respectively. For groups of variables with the same error rates and therefore the same p-values the most conservative BH level is applied. These groups are indicated in alternating blocks of white and grey
Fig. 3CART variable importance. The CART method provides a measure of importance for each variable. These values (grey bars), along with the normalized values (blue bars), are depicted here in the form of a bar chart for each variable (y-axis) in order of decreasing importance. Figures a and b represent weight sets 1 and 2 respectively. The vertical black and red dashed lines represent values at which large drops in Normalized Importance occur and therefore possible cut-off choices for variable selection. The dashed red lines represent the cut-offs chosen for comparison with ERp, while the dashed black lines represent alternative choices
Group classification and outlier detection using significant variables based on weight set 1 and 2
The body of the table shows the classification result due to each significantly shifted variable for each subject, where 0 indicates the subject was classified into the control group and 1 indicates the subject was classified into the experimental group. Misclassifications are indicated in red. The last three rows (i) provide the final classification based on the majority vote; (ii) flag subjects that were misclassified; and (iii) flag potential outlying subjects based on the number of variables that misclassified it compared to the remaining subjects