| Literature DB >> 20003289 |
Amira Djebbari1, Aurélie Labbe.
Abstract
BACKGROUND: In high density arrays, the identification of relevant genes for disease classification is complicated by not only the curse of dimensionality but also the highly correlated nature of the array data. In this paper, we are interested in the question of how many and which genes should be selected for a disease class prediction. Our work consists of a Bayesian supervised statistical learning approach to refine gene signatures with a regularization which penalizes for the correlation between the variables selected.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20003289 PMCID: PMC2804684 DOI: 10.1186/1471-2105-10-410
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Networks used for simulations.
Summary of simulated datasets
| Network | Total # variables | # variables predicting the class | # instances in training dataset |
|---|---|---|---|
| Dataset A100 | 5 | 2 | 100 |
| Dataset A500 | 5 | 2 | 500 |
| Dataset C100 | 10 | 5 | 100 |
| Dataset C500 | 10 | 5 | 500 |
Number of times (out of 100) the correct variables are found with each method for network A
| Dataset | Boullé | Wrapper | Corr1 | Corr100 |
|---|---|---|---|---|
| A100-sparse | 55 | 46 | 59 | 97 |
| A100-half | 43 | 40 | 47 | 98 |
| A100-full | 45 | 35 | 47 | 92 |
| A500-sparse | 64 | 37 | 64 | 97 |
| A500-half | 50 | 33 | 50 | 93 |
| A500-full | 48 | 30 | 55 | 96 |
Number of times (out of 100) the correct variables are found with each method for network C
| Dataset | Boullé | Wrapper | Corr1 | Corr100 |
|---|---|---|---|---|
| C100-sparse | 22 | 3 | 35 | 45 |
| C100-half | 5 | 2 | 5 | 12 |
| C100-full | 1 | 2 | 3 | 9 |
| C500-sparse | 35 | 13 | 52 | 98 |
| C500-half | 32 | 8 | 51 | 75 |
| C500-full | 34 | 5 | 52 | 84 |
Number of times (out of 100) the correct variables are found with each method for network C-sparse
| Dataset | Boullé | Wrapper | Corr1 | Corr100 |
|---|---|---|---|---|
| C100-noise0 | 22 | 3 | 35 | 45 |
| C100-noise5 | 16 | 3 | 22 | 27 |
| C100-noise10 | 7 | 3 | 19 | 32 |
| C100-noise20 | 6 | 3 | 12 | 17 |
| C500-noise0 | 35 | 13 | 52 | 98 |
| C500-noise5 | 32 | 11 | 48 | 97 |
| C500-noise10 | 22 | 12 | 35 | 87 |
| C500-noise20 | 15 | 10 | 27 | 77 |
Figure 2Number of times genes were selected in 10 fold nested CV for van't Veer breast cancer dataset.
Classification performance of naïve Bayes algorithm on genes from van't Veer breast cancer dataset as training and independent test set of 234 samples for testing.
| Boulle | Wrapper | Corr1 | Corr10 | Amsterdam Signature | |
|---|---|---|---|---|---|
| Average # of genes | 6 | 8 | 10 | 11 | 70 |
| ACC (%) | 66.24 | 60.68 | 61.54 | 63.25 | 61.54 |
| SENS (%) | 69.57 | 65.22 | 73.91 | 81.16 | 86.96 |
| SPEC (%) | 64.85 | 58.79 | 56.36 | 55.76 | 50.91 |
| PPV (%) | 45.28 | 39.82 | 41.46 | 43.41 | 42.55 |
| NPV (%) | 83.59 | 80.17 | 83.78 | 87.62 | 90.32 |
| AUC | 0.6721 | 0.6200 | 0.6514 | 0.6846 | 0.6893 |
The average number of genes represents the average over the nested 10 fold CV.
Classification performance of naïve Bayes algorithm with nested 10 fold CV obtained by different methods on Pomeroy medulloblastoma outcome dataset.
| Pomeroy Signature | Wrapper | Boulle | corr1 | |
|---|---|---|---|---|
| Average # of genes | 100 | 11.5 | 4.8 | 11 |
| ACC (%) | 73.33 | 71.67 | 61.67 | 75.00 |
| SENS (%) | 76.92 | 74.36 | 76.92 | 82.05 |
| SPEC (%) | 66.67 | 66.67 | 33.33 | 61.90 |
| PPV (%) | 81.08 | 80.56 | 68.18 | 80.00 |
| NPV (%) | 60.87 | 58.33 | 43.75 | 65.00 |
| AUC | 0.8168 | 0.8180 | 0.7410 | 0.7800 |
The average number of genes represents the average over the nested 10 fold CV.
Figure 3Number of times genes were selected in 10 fold CV for Pomeroy medulloblastoma outcome dataset.
Classification performance of naïve Bayes algorithm with nested 10 fold CV obtained by different methods on Ramaswamy metastases dataset
| Ramaswamy signature | Wrapper | Boulle | corr1 | corr100 | |
|---|---|---|---|---|---|
| # of genes | 128 | 5.9 | 3 | 8.4 | 8 |
| ACC (%) | 90.79 | 89.47 | 77.63 | 89.47 | 90.79 |
| SENS (%) | 92.19 | 95.31 | 87.50 | 95.31 | 95.31 |
| SPEC (%) | 83.33 | 58.33 | 25.00 | 58.33 | 66.67 |
| PPV (%) | 96.72 | 92.42 | 86.15 | 92.42 | 93.85 |
| NPV (%) | 66.67 | 70.00 | 27.27 | 70.00 | 72.73 |
| AUC | 0.9297 | 0.8650 | 0.7410 | 0.8740 | 0.9210 |
Figure 4Number of times genes were selected in 10 fold CV for Ramaswamy metastases dataset.