| Literature DB >> 21584263 |
Sandra L Taylor1, Kyoungmi Kim.
Abstract
With technological advances now allowing measurement of thousands of genes, proteins and metabolites, researchers are using this information to develop diagnostic and prognostic tests and discern the biological pathways underlying diseases. Often, an investigator's objective is to develop a classification rule to predict group membership of unknown samples based on a small set of features and that could ultimately be used in a clinical setting. While common classification methods such as random forest and support vector machines are effective at separating groups, they do not directly translate into a clinically-applicable classification rule based on a small number of features.We present a simple feature selection and classification method for biomarker detection that is intuitively understandable and can be directly extended for application to a clinical setting. We first use a jackknife procedure to identify important features and then, for classification, we use voting classifiers which are simple and easy to implement. We compared our method to random forest and support vector machines using three benchmark cancer 'omics datasets with different characteristics. We found our jackknife procedure and voting classifier to perform comparably to these two methods in terms of accuracy. Further, the jackknife procedure yielded stable feature sets. Voting classifiers in combination with a robust feature selection method such as our jackknife procedure offer an effective, simple and intuitive approach to feature selection and classification with a clear extension to clinical applications.Entities:
Keywords: classification; feature selection; gene expression; jackknife; voting classifier
Year: 2011 PMID: 21584263 PMCID: PMC3091410 DOI: 10.4137/CIN.S7111
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Characteristics of data sets.
| Leukemia | Gene expression | 3,051 | 11 | 27 | No | |
| Lung cancer | Gene expression | 12,533 | 16 | 16 | 149 (15 controls, 134 cases) | |
| Prostate cancer | Proteomics | 15,154 | 30 | 30 | 262 (223 controls, 39 cases | |
Note:
Patients with acute myeloid leukemia were considered “cases” and those with acute lymphoblastic leukemia were used as “controls”.
Figure 1Multiple random validation results for voting classifiers. Mean accuracy for voting classifiers (unweighted and weighted) with varying numbers of features included in the classifier based on 1,000 random training:test set partitions of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.
Figure 2Leave-one-out cross validation (LOOCV) results for voting classifiers. Accuracy for voting classifiers (unweighted and weighted) with varying numbers of features included in the classifier based on LOOCV of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.
Figure 3Comparison of voting classifiers, random forest and SVM. Accuracy (mean ± SE) for unweighted (Unwgt) and weighted (Wgt) voting classifiers, random forest (RF) and support vector machines (SVM) based on 1,000 random training:test set partitions of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples. Horizontal bars show LOOCV results. Results presented for weighted and unweighted voting classifiers are based on the number of features yielding the highest mean accuracy. For the leukemia data set the 49 or 51 features yielded the highest accuracy for the voting classifiers in the MRV procedure while for LOOCV, the best numbers of features for the unweighted voting classifier were 17 and 11 using the top 1% and 5% of features, respectively and were 13 and 51, respectively for the weighted voting classifier. For the lung cancer data set, 3 and 5 features were best with LOOCV for the weighted and unweighted classifier. Under MRV, 51 features yielded the highest accuracy for the weighted voting classifier while 19 or 39 features needed for the unweighted voting classifier based on the top 1% and 5% of features, respectively. With the prostate cancer data set, the unweighted voting classifier used 31 and 49 features with MRV and 35 and 17 features with LOOCV based on the top 1% and 5% of features, respectively. For the weighted voting classifier, these numbers were 49, 51, 31 and 3, respectively. The number of features used in random forest and SVM varied across the training:test set partitions. Depending on the validation strategy and percentage of features retained in the jackknife procedure, the number of features ranged from 67 to 377 for the leukemia data set, from 233 to 2,692 for the prostate cancer set and from 247 to 1,498 for the lung cancer data set.
Accuracy of classifiers applied to prostate cancer data set excluding benign prostate hyperplasia samples.
| LOOCV | ||||
| Top 1% | 88.3 | 93.3 | 93.3 | 95.0 |
| Top 5% | 98.3 | 96.7 | 91.7 | 95.0 |
| 60:40 partitions | ||||
| Top 1% | 84.8 ± 10.2 | 87.5 ± 8.6 | 89.6 ± 7.3 | 91.7 ± 6.8 |
| Top 5% | 76.2 ± 11.4 | 81.3 ± 10.3 | 89.5 ± 7.3 | 91.5 ± 6.4 |
Notes: Accuracy of voting classifiers (unweighted and weighted), random forest and SVM applied to the prostate cancer data set excluding benign prostate hyperplasia samples from the control group. Features to include in the classifiers were derived using the top 1% or 5% of features based on t-statistics through a jackknife procedure using training sets in leave-one-out cross validation (LOOCV) or multiple random validation (60:40 partitions). Mean ± SD accuracy reported for 1,000 60:40 random partitions.
Highest accuracy achieved with 7 features in classifier;
Highest accuracy achieved with 9 features in classifier;
Highest accuracy achieved with 13 features in classifier;
Highest accuracy achieved with 21 features in classifier;
Highest accuracy achieved with 47 features in classifier;
Highest accuracy achieved with 23 features in classifier,
Highest accuracy achieved with 51 features in classifier. The number of features used in random forest and SVM varied across the training:test set partitions. The ranges were:
265–340 features;
1,194–1,268 features;
212–533;
1,412–1,970 features.
Performance of classifiers applied to independent validation set of lung cancer data set.
| LOOCV | ||||
| Top 1% | 99.3 | 100 | 98.7 | 100 |
| Top 5% | 99.3 | 100 | 98.7 | 100 |
| 60:40 partitions | ||||
| Top 1% | 98.7 | 100 | 99.3 | 94.6 |
| Top 5% | 98.7 | 100 | 100 | 84.6 |
| LOOCV | ||||
| Top 1% | 100 | 100 | 99.2 | 100 |
| Top 5% | 100 | 100 | 99.2 | 100 |
| 60:40 partitions | ||||
| Top 1% | 99.2 | 100 | 100 | 94.0 |
| Top 5% | 99.2 | 100 | 100 | 82.8 |
| LOOCV | ||||
| Top 1% | 99.3 | 100 | 99.2 | 100 |
| Top 5% | 99.3 | 100 | 99.2 | 100 |
| 60:40 partitions | ||||
| Top 1% | 99.2 | 100 | 99.3 | 100 |
| Top 5% | 99.2 | 100 | 100 | 100 |
Notes: Accuracy, sensitivity, and positive predictive value of voting classifiers (unweighted and weighted), random forest and SVM applied to independent data sets from the lung cancer data set. Features to include in the classifiers were derived using the top 1% or 5% of features based on t-statistics through a jackknife procedure using training sets in leave-one-out cross validation (LOOCV) or multiple random validation (60:40 partitions).
Highest accuracy achieved with 37 features in classifier;
Highest accuracy achieved with 23 features in classifier;
Highest accuracy achieved with 15 features in classifier;
Highest accuracy achieved with 49 features in classifier. The number of features used in developing SVM and random forest classifiers were:
452 features;
1,791 features;
4,172 features;
9,628 features.
Performance of classifiers applied to independent validation set of prostate cancer data set.
| LOOCV | ||||
| Top 1% | 68.5 | 76.5 | 92.7 | 93.5 |
| Top 5% | 74.5 | 81.9 | 91.6 | 92.4 |
| 60:40 Partitions | ||||
| Top 1% | 86.3 | 88.2 | 91.6 | 89.7 |
| Top 5% | 86.7 | 89.9 | 90.5 | 86.6 |
| LOOCV | ||||
| Top 1% | 74.4 | 76.9 | 87.2 | 84.6 |
| Top 5% | 89.7 | 79.5 | 87.2 | 74.4 |
| 60:40 Partitions | ||||
| Top 1% | 74.4 | 76.9 | 82.0 | 65.0 |
| Top 5% | 69.2 | 74.4 | 84.6 | 64.1 |
| LOOCV | ||||
| Top 1% | 43.9 | 53.6 | 70.8 | 75.0 |
| Top 5% | 50.7 | 62.0 | 66.7 | 74.4 |
| 60:40 Partitions | ||||
| Top 1% | 52.7 | 57.7 | 68.1 | 66.7 |
| Top 5% | 54 | 61.7 | 63.5 | 54.3 |
Notes: Accuracy, sensitivity, and positive predictive value of voting classifiers (unweighted and weighted), random forest and SVM applied to independent data sets from the prostate cancer data set. Features to include in the classifiers were derived using the top 1% or 5% of features based on t-statistics through a jackknife procedure using training sets in leave-one-out cross validation (LOOCV) or multiple random validation (60:40 partitions).
Highest accuracy achieved with 37 features in classifier;
Highest accuracy achieved with 43 features in classifier;
Highest accuracy achieved with 45 features in classifier,
Highest accuracy achieved with 49 features in classifier;
Highest accuracy achieved with 47 features in classifier;
Highest accuracy achieved with 27 features in classifier. The number of features used in developing SVM and random forest classifiers were:
685 features;
2,553 features;
9,890 features;
14,843 features.
Figure 4Mean BSS/WSS of features in voting classifiers. Mean BSS/WSS of features included in the voting classifiers constructed from varying numbers of features for the leukemia and prostate cancer data sets. Mean values were calculated using the training sets from 1,000 random training: test set partitions. Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.
Figure 5Mean accuracy of weigthed voting classifier versus mean BSS/WSS. Mean accuracy of the weighted voting classifier using three features versus the mean BSS/WSS of these features for two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Mean values were calculated across 1,000 random training:test set partitions. Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% most significant features based on t-statistics across all jackknife samples. Mean BSS/WSS was calculated separately using the training and test set portions of each random partition.
Figure 6Frequency of occurrence of features in voting classifiers. Frequency of occurrence of features used in voting classifiers containing 51 features across 1,000 random training: test set partitions of two gene expression data sets (leukemia, lung cancer) and a proteomics data set (prostate cancer). Features to include in the classifiers were identified through a jackknife procedure through which features were ranked according to their frequency of occurrence in the top 1% or 5% most significant features based on t-statistics across all jackknife samples.