| Literature DB >> 24274115 |
Stephen O'Hara, Kun Wang, Richard A Slayden, Alan R Schenkel, Greg Huber, Corey S O'Hern, Mark D Shattuck, Michael Kirby1.
Abstract
BACKGROUND: We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity.Entities:
Mesh:
Year: 2013 PMID: 24274115 PMCID: PMC3879090 DOI: 10.1186/1471-2164-14-832
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Overview of data sets
| Influenza 14-16
[ | 51 | 57 | 12023 | Asymptomatic, Symptomatic |
| Influenza 11-14
[ | 68 | 75 | 12023 | Asymptomatic, Symptomatic |
| Lung Cancer
[ | 32 | 149 | 12533 | Mesothelioma, Adenocarcinoma |
| Prostate Cancer
[ | 102 | 34 | 12600 | Tumor, Normal |
| BCell Lymphoma
[ | 47 | – | 4026 | Germinal, Activated |
Influenza 14-16 and 11-14 represent different temporal intervals of the data from [10]. The former is used to compare with other published results, and the latter is used because it provides more training samples for the automated analysis. All data sets except the BCell lymphoma have defined training and test partitions.
Figure 1Iterative feature removal on influenza and lung cancer data. Iterative Feature Removal is shown using two data sets, influenza (top) and lung cancer (bottom). In each row, the left figure shows the accuracy at each iteration and the right figure shows the number of features selected per iteration. At each iteration, the model is trained without access to any of the genes selected in any of the previous iterations. For the influenza data set there are about 40 sets that are approximately equally predictive identifying approximately 1200 genes. For the lung cancer data there are about 30 sets, or some 900 genes that exhibit predictive properties. The red line represents the rolling average accuracy, illustrating the trend in the data. Figure best viewed in color.
Selected pathways from the first 40 iterations of IFR on the influenza data
| Interferon Stimulated Genes | 87.7 | AIM2 (4), DDX60 (3), GBP1 (1), HERC5 (8), HERC6 (17), IFI27 (2), IFI30 (21), IFI35 (18), IFI44 (1), IFI44L (2), IFI6 (5), IFIH1 (16), IFIT1 (2), IFIT2 (25), IFIT3 (4), IFIT5 (10), IFITM1 (2), IFITM2 (30), IFITM3 (6), IL15 (17), IL15RA (27), IRF7 (13), IRF9 (12), ISG15 (3), ISG20 (34), MX1 (12), OAS1 (1), OAS2 (10), OAS3 (8), OASL (9), PSME1 (5), PSME2 (3), RSAD2 (3), STAT1 (7), STAT2 (27), STAT5B (30), TRIM22 (5), XAF1 (9) |
| Antigen Recognition Genes | 93.0 | CD1C (17), HLA-B (27), HLA-DOB (6), HLA-DPA1 (32), HLA-DQA1 (3), HLA-DQB1 (6), HLA-E (26), MICA (22), TAP1 (7), TAP2 (30) |
| TNF Super Family | 89.5 | TNF (28), TNFAIP1 (11), TNFAIP3 (29), TNFAIP6 (22), TNFRSF10B (5), TNFRSF14 (31), TNFRSF4 (11), TNFSF10 (11) |
| IL-1 Beta Receptor Family | 86.0 | IL1B (14), IL1F5 (12), IL1R1 (10), IL1RAP (6), IL1RL2 (15), IL33 (24) |
| B Cell Maturation and Activation | 91.2 | CD19 (36), CD200 (4), CD22 (10), CD24 (7), CD38 (31), CD40 (23), CD72 (28), CD79A (12), CD86 (16), CD9 (7), IGHD (12), IGHM (5), IGHV3-23 (15) |
| Cell Cycle Related | 89.5 | CDC20 (13), CDC45L (8), CDCA3 (14), CDCA8 (7), CDK5 (1), CDK5R2 (6), CDKAL1 (5), CDKL5 (27), CDKN1C (13) |
| Programmed Cell Death | 84.2 | CASP10 (33), CASP4 (29), CASP5 (24), CASP7 (6), PCDHA3 (14), PCDHGA11 (5), PDCD1LG2 (2), PDCD4 (14) |
| Chemokines | 87.7 | CCL11 (19), CCL5 (36), CCR1 (10), CCR10 (4), CCR3 (29), CCR6 (27), CCRL2 (6), CX3CR1 (35), CXCL10 (30), CXCL11 (29), CXCL6 (15), CXCR5 (37), DARC (39) |
| Cell Adhesion Molecules | 87.7 | ICAM3 (3), ICAM4 (15), ICAM5 (29), MADCAM1 (25) |
| Cytokine-Cytokine Receptor Signaling | 82.5 | IL16 (10), IL17RC (26), IL18RAP (1), IL22 (11), IL9 (31), SOCS1 (33), SOCS3 (29), SOCS6 (9) |
| Complement Pathway | 91.2 | C1QA (2), C1QB (3), C2 (19), C3AR1 (26), CR2 (12) |
| Other Immune Response | 93.0 | CD8A (22), FCER1G (7), FCER2 (28), FCRL2 (14), KIR2DL3 (32), LY6E (1), LY9 (8), MARCO (26), TLR5 (8) |
Selected pathways relating to genes taken from the first 40 iterations of IFR on the influenza data. The iteration that the genes were discovered is in parentheses. Acc indicates the classification accuracy (percent) using only the genes in each list.
Figure 2Distribution of selected pathways over IFR iterations on influenza data. The Interferon Stimulated Genes comprise the longest list and have representation in half of the iterations. Genes from the other pathways require more iterations to discover. Not all iterations cover the same biological pathways.
Accuracies can improve when combining pathways
| 100.0 | B Cell Maturation and Activation + Cell Adhesion Molecules |
| 98.2 | Antigen Recognition Genes + Cell Adhesion Molecules |
| 96.5 | IL-1 Beta Receptor Family + Cell Adhesion Molecules |
| 96.5 | B Cell Maturation and Activation + Complement Pathway |
| 94.7 | IL-1 Beta Receptor Family + B Cell Maturation and Activation |
| 94.7 | Cell Cycle Related + Complement Pathway |
| 94.7 | Cell Cycle Related + Cell Adhesion Molecules |
| 94.7 | Antigen Recognition Genes + B Cell Maturation and Activation |
| 93.0 | IL-1 Beta Receptor Family + Complement Pathway |
| 93.0 | Complement Pathway + Other Immune Response |
| 93.0 | Cell Adhesion Molecules + Other Immune Response |
| 93.0 | Cell Adhesion Molecules + Complement Pathway |
| 93.0 | B Cell Maturation and Activation + Chemokines |
| 93.0 | B Cell Maturation and Activation + Cell Cycle Related |
| 93.0 | Antigen Recognition Genes + Other Immune Response |
| 93.0 | Antigen Recognition Genes + Complement Pathway |
Classification accuracies can improve when combining pathways. We computed the classification accuracies when using all pairs of the pathway gene lists presented in Table 2. Top sixteen results are shown.
Pathway classification accuracy on influenza compared to other published results
| 93.0 | Bayesian Elastic Net (from
[ |
| 91.2 | Bayesian Lasso (from
[ |
| 93.0 | Elastic Net (from
[ |
| 91.2 | Lasso (from
[ |
| 91.2 | Relevance Vector Machine (from
[ |
| 93.0 | SVM-RFE (from
[ |
In this table, the accuracy of our best single-pathway and best pathway-pair classifiers are compared to the best classifiers reported by Chen et al. [10]. The same protocol is followed: classifiers are trained using H3N2 data from time intervals 14-16 and tested on H1N1 from the same time period.
T-test ranking of genes in the BCell+CAM classifier
| 831 | 1238 | 5659 | 4620 | 508 | |||||
| | |||||||||
| 2198 | 2406 | 1048 | 2462 | 564 | 4250 | 8481 |
The BCell+CAM classifier is 100% accurate on the test data, while consisting of genes that are relatively low ranking in terms of univariate separability. The table shows gene names and the t-test rankings within the total of 12,023 genes (bold values for those within the first 500). A blank cell separates the 13 BCell genes from the 4 CAM genes.
Figure 3Univariate separability of discriminative genes. Boxplots show the gene expression levels (as z-scores) between symptomatic and asymptomatic subjects for selected genes. The first figure shows the expression level for the 17 genes identified by the best pathway pairs classifier (BCell + CAM). The second figure shows the top 17 genes ranked according to univariate t-test scores. The t-test score is provided at the top of the figures for each gene. The genes from the best classifier are far less discriminative in the univariate sense than the top ranked genes, but as a group, they are more discriminative (100% vs. 89.5% classification accuracy). More than two-thirds of the genes from the best classifier fall outside of the top 500 genes ranked according to t-test.
Figure 4Gene ontology annotations per iteration on lung cancer data. The x-axis shows the iteration number, each of which identifies a subset of approximately 30 genes. The y-axis is a set of labels from the GATHER interface to the Gene Ontology, selecting labels at depth 5 in the ontology structure. The intensity of the color at an (x,y) location indicates the number of genes in the subset associated with that label. By comparing one column with another, the results suggest that different iterations contain genes associated with different biological processes. By comparing one row to another, one can see the distribution of genes over iterations relating to a specific process.
Pathway and pathway pair classifiers
| Lung | 30 | 95.5/7.0 | 99.3 | 96.9 | 98.8/3.7 | 98.0 | 98.8 | 99.8/1.7 | 98.0 | 96.8 |
| Influenza 11-14 | 30 | 98.6/2.9 | 96.0 | 95.5 | 99.0/2.3 | 97.3 | 97.6 | 98.9/2.3 | 92.0 | 92.8 |
| Prostate | 30 | 78.9/14.2 | 79.4 | 77.6 | 81.8/14.0 | 79.4 | 78.2 | 91.0/5.7 | 79.4 | 80.6 |
| BCell Lymph. | 20 | 89.5/12.3 | 83.3 | 75.0 | 91.8/12.4 | 83.3 | 78.3 | 87.8/12.4 | 83.3 | 80.0 |
Cross-validated classification results compared using features selected from discriminative pathways and pathway-pairs, and the “optimal” set of features selected by sparse SVM (SSVM). The latter is equivalent to the first iteration of IFR. The mean and standard deviation of 50 cross-validated trials are shown, followed by the accuracy of the best model from the validation trials applied to withheld test data. Also shown is the average accuracy of the top 5 models on the withheld test data.
Figure 5T-test gene ranking of top classifiers. The genes from the best model, selected from 50-trial cross-validation, are plotted according to t-test gene ranking using box plots to show the distributions. As with most parsimonious machine learning methods, the sparse SVM (SSVM) classifier tends to select genes which are more univariately discriminative than either the pathway or pathway pair classifiers. Deeper mining of the features using IFR can help identify non-obvious sets of discriminative genes and elucidate discriminative pathways.
Figure 6SSVM weights for first IFR iteration. Magnitude of top 200 SSVM weights for the first IFR iteration on the influenza data. The cutoff for which weights are set to zero is clearly defined by the steep drop in magnitude.