| Literature DB >> 25983540 |
Vincent Gardeux1, Rachid Chelouah2, Maria F Barbosa Wanderley3, Patrick Siarry4, Antônio P Braga3, Fabien Reyal5, Roman Rouzier6, Lajos Pusztai7, René Natowicz8.
Abstract
BACKGROUND: Filter feature selection methods compute molecular signatures by selecting subsets of genes in the ranking of a valuation function. The motivations of the valuation functions choice are almost always clearly stated, but those for selecting the genes according to their ranking are hardly ever explicit.Entities:
Keywords: bi-objective optimization; breast cancer; feature selection; filter method; molecular signatures
Year: 2015 PMID: 25983540 PMCID: PMC4426938 DOI: 10.4137/CIN.S21111
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Description of the seven publicly available cancer datasets used in this study.
| DATASET | DATA TYPE | ARTICLE | # CASES | # PROBESETS | SOURCE |
|---|---|---|---|---|---|
| Dataset I | Colon | [29] | 62 | 2000 | (1) |
| Dataset II | Lymphoma | [30] | 77 | 5469 | (2) |
| Dataset III | Leukemia | [31] | 72 | 7129 | (3) |
| Dataset IV | Prostate | [32] | 102 | 10509 | (2) |
| Dataset V | Brain | [33] | 60 | 7129 | (3) |
| Dataset VI | Breast | [8] | 133 | 22283 | (4) |
| Dataset VII | Breast | [16] | 91 | 22283 | GSE20271 |
Notes: (1) http://genomics-pubs.princeton.edu/oncology/ (2) http://www.gems-system.org/ (3) http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi (4) http://bioinformatics.mdanderson.org/main/Public_Datasets
Average performances and size of the signatures predicted by bi-objective function optimization; three-fold cross-validation. We computed the δ-filtered signatures 100 times for different random training/testing subsets on Datasets I–V (Methods: 3-fold CV). Average performances across the runs are reported along with their standard deviation.
| COLON | LYMPHOMA | LEUKEMIA | PROSTATE | BRAIN | |
|---|---|---|---|---|---|
| #Probesets | 8.420 ± 5.459 | 4.700 ± 2.816 | 3.100 ± 1.187 | 3.340 ± 0.764 | 10.420 ± 5.668 |
| Accuracy | 0.825 ± 0.018 | 0.872 ± 0.017 | 0.961 ± 0.008 | 0.909 ± 0.009 | 0.672 ± 0.038 |
| Sensitivity | 0.867 ± 0.029 | 0.872 ± 0.036 | 0.937 ± 0.012 | 0.904 ± 0.008 | 0.584 ± 0.084 |
| Specificity | 0.747 ± 0.013 | 0.889 ± 0.015 | 0.974 ± 0.010 | 0.913 ± 0.015 | 0.718 ± 0.022 |
| PPV | 0.865 ± 0.006 | 0.702 ± 0.035 | 0.950 ± 0.020 | 0.524 ± 0.040 | 0.650 ± 0.040 |
| NPV | 0.752 ± 0.043 | 0.939 ± 0.012 | 0.967 ± 0.006 | 0.910 ± 0.007 | 0.766 ± 0.045 |
Average performances and size of the signatures predicted by bi-objective function optimization; leave-one-out cross-validation. We computed the δ-filtered signatures for different random training/testing subsets on Datasets I–V (Methods: Leave- one-out CV). Performances are computed from the summary of the runs.
| COLON | LYMPHOMA | LEUKEMIA | PROSTATE | BRAIN | |
|---|---|---|---|---|---|
| #Probesets | 9.048 | 4.156 | 2.972 | 4.000 | 8.267 |
| Accuracy | 0.855 | 0.883 | 0.986 | 0.941 | 0.683 |
| Sensitivity | 0.925 | 1.000 | 0.960 | 0.960 | 0.619 |
| Specificity | 0.727 | 0.845 | 1.000 | 0.923 | 0.718 |
| PPV | 0.860 | 0.679 | 1.000 | 0.923 | 0.542 |
| NPV | 0.842 | 1.000 | 0.979 | 0.960 | 0.778 |
Comparison of average performances and size of signatures reported in the literature. In this table are reported the mean accuracies (Ac., in percent) and mean number of probesets of non-biased results (Methods: Experimental Protocol (Non-biased)) published for the five benchmarking datasets (Datasets I–V). Results of our predictive modeling are reported on the last line.
| ARTICLES | COLON | LYMPHOMA | LEUKEMIA | PROSTATE | BRAIN | |||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
| ||||||
| [32] | – | – | – | – | – | – | 86.00 | 29 | – | – |
| [34] | – | – | – | – | – | – | – | – | 60.00 | 21 |
| [35] | 85.83 | 20 | – | – | – | – | – | – | – | – |
| [36] | – | – | 83.33 | 6 | – | – | – | – | – | – |
| [37] | 82.33 | 20 | – | – | – | – | – | – | – | – |
| [38] | 82.03 | (*) | – | – | 94.40 | (*) | 91.22 | (*) | – | – |
| [39] | – | – | – | – | – | – | 94.12 | 22 | – | – |
| [40] | 85.71 | 30 | – | – | – | – | 94.11 | 20 | – | – |
| [13] | ||||||||||
| F-test | 84.05 | 15.1 | – | – | – | – | 91.18 | 126.4 | – | – |
| 76.70 | 35.1 | – | – | – | – | 94.60 | 756.6 | – | – | |
| 78.60 | 43.3 | – | – | – | – | 94.70 | 573.3 | – | – | |
| 80.30 | 31.8 | – | – | – | – | 94.80 | 95.5 | – | – | |
| SVM-RFE | 85.48 | 26.4 | – | – | – | – | 94.18 | 43.2 | – | – |
| GLMPath | 81.91 | 1.3 | – | – | – | – | 94.09 | 1.6 | – | – |
| Random Forest | 89.40 | 49.8 | – | – | – | – | 94.10 | 81 | – | – |
Note: (*) Not in the article.
Comparison of the performances of signatures predicted on breast cancer dataset VI (training set=82 samples). The first column of this table (δ-DLDA-30) corresponds to the result of our predictive modeling with a fixed size of 30 probesets (for direct comparison with other methods). The second column (δ-DLDA-11) contains the results of our predictive modeling obtained without fixing the number of probesets of the signature. The optimal non-singular predictor found by our method contained 11 probesets. The third (DLDA-308) and fourth (Bi-Majority-3014) columns report results found in the literature with the same data/protocol.
| DLDA-30 | Bi-Majority-30 | |||
|---|---|---|---|---|
| Accuracy | 0.863 | 0.882 | 0.765 | 0.863 |
| Sensitivity | 0.846 | 0.923 | 0.923 | 0.923 |
| Specificity | 0.868 | 0.868 | 0.711 | 0.842 |
| PPV | 0.688 | 0.706 | 0.522 | 0.667 |
| NPV | 0.943 | 0.971 | 0.964 | 0.970 |
Figure 1Heatmaps of the four signatures on testing data (51 patients).
Notes: The four heatmaps represent, for each of the 51 patients of the breast cancer testing set (in columns), the different genes (in rows) of each of the four molecular signatures detailed in this paper. Green colors represent down-regulated genes, and red colors represent up-regulated genes. In parenthesis are the names of the corresponding probesets in the Affymetrix microarray. Panel A corresponds to the 11 genes of the δ-DLDA-11 signature. Panel B corresponds to the 30 genes of the -DLDA-30 signature. Panel C corresponds to the 30 genes of the DLDA-30 signature.8 Panel D corresponds to the 30 genes of the Bi-Majority-30 signature.14 At the bottom of each heatmap two variables are represented: “Predicted” is the predicted response for each patient using the DLDA classifier, and “Response” is the true response class (PCR: Pathologic Complete Response, or NoPCR: residual disease). In the subpanel, the red bars represent the misclassified patient (False Positives or False Negatives).
Detailed caracteristics of the δ-DLDA-11 signature. This table contains descriptions of the 11 probesets of highest contributions to the interclass distance unveiled by our predictive modeling on Dataset VI (rank of the probesets following our prioritization by δ scores, names of the targeted genes, Affymetrix references of the probesets, values of their contributions to the interclass distance, and P-values to a t-test). In bold are the genes of the 11 δ-signature that were neither member of DLDA-30 nor bi-majority-30 signatures.
| GENE | PROBESET | |||
|---|---|---|---|---|
| 1 | 205225 at | 0.102 | 5.261E-6 | |
| 2 | BTG3 | 213134_x_at | 0.090 | 2.956E-5 |
| 3 | BTG3 | 205548_s_at | 0.088 | 3.307E-5 |
| 4 | MELK | 204825_at | 0.083 | 1.224E-4 |
| 5 | METRN | 219051_x_at | 0.076 | 1.705E-6 |
| 6 | GAMT | 205354_at | 0.075 | 2.768E-7 |
| 7 | MAPT | 203929_s_at | 0.074 | 2.312E-8 |
| 8 | 209173_at | 0.073 | 5.451E-7 | |
| 9 | 204913_s_at | 0.073 | 8.000E-3 | |
| 10 | 212956_at | 0.072 | 2.037E-5 | |
| 11 | SCUBE2 | 219197_s_at | 0.071 | 4.736E-5 |
Figure 2Boxplots of the expressions of the 11 probesets of the δ-DLDA-11 signature.
Notes: The boxplots represent the expression values of the 11 probesets selected by our predictive modeling for breast cancer prediction of the response to preoperative chemotherapy.
Abbreviations: PCR, pathologic complete response; No-PCR, residual disease.
Performances of the predictors trained on breast cancer Dataset VI (82 training samples) and applied on Dataset VII (91 test samples). In the first column (δ-DLDA-30) are the performances of our predictive modeling with the 30 probesets signature predicted on Dataset VI (for direct comparison with other methods). In the second column (δ-DLDA-11) are the results of our predictive modeling obtained with the optimal non-singular DLDA predictor (11 probesets signature) predicted on Dataset VI. The third (DLDA-308) and fourth (Bi-Majority-3014) columns report the performances of two predictors whose signatures were the 30 probesets of smallest P-values to the Student t-test (DLDA-30) and the 30 probesets of highest bi-informativeness (Bi-Majority−30).
| DLDA-30 | Bi-Majority-30 | |||
|---|---|---|---|---|
| Accuracy | 0.670 | 0.659 | 0.725 | 0.681 |
| Sensitivity | 0.632 | 0.632 | 0.632 | 0.579 |
| Specificity | 0.681 | 0.667 | 0.750 | 0.708 |
| PPV | 0.343 | 0.333 | 0.400 | 0.343 |
| NPV | 0.875 | 0.873 | 0.885 | 0.864 |
Three-fold cross-validation average performances of the δ-DLDA-11 signature, applied on the two external test datasets. The table contains the results of our predictive modeling obtained with the 11 probesets signature (δ-DLDA-11) predicted on Dataset VI (82 training samples). The predictor was applied on both test datasets following a 3-fold cross-validation experimental protocol (Methods: 3-fold cross-validation). Average performances and standard deviations are reported.
| DATASET VI (51 TEST SAMPLES) | DATASET VII (91 TEST SAMPLES) | |
|---|---|---|
| Accuracy | 0.84 ± 0.12 | 0.61 ± 0.15 |
| Sensitivity | 0.92 ± 0.18 | 0.72 ± 0.30 |
| Specificity | 0.81 ± 0.16 | 0.58 ± 0.17 |
| PPV | 0.68 ± 0.22 | 0.31 ± 0.15 |
| NPV | 0.97 ± 0.06 | 0.90 ± 0.11 |
Figure 3Scatterplot of the probesets.
Notes: X-axis: probesets’ contributions (δ values) to the interclass distance (PCR and no-PCR classes). Y-axis: probesets’ P-values to the Student t-test. The δ contributions to the interclass and the P-values to the t-test were computed on the 133 tumor samples of Dataset VI.