| Literature DB >> 24688712 |
Vincenzo Lagani1, George Kortas2, Ioannis Tsamardinos3.
Abstract
Biomarker signature identification in "omics" data is a complex challenge that requires specialized feature selection algorithms. The objective of these algorithms is to select the smallest set(s) of molecular quantities that are able to predict a given outcome (target) with maximal predictive performance. This task is even more challenging when the outcome comprises of multiple classes; for example, one may be interested in identifying the genes whose expressions allow discrimination among different types of cancer (nominal outcome) or among different stages of the same cancer, e.g. Stage 1, 2, 3 and 4 of Lung Adenocarcinoma (ordinal outcome). In this work, we consider a particular type of successful feature selection methods, named constraint-based, local causal discovery algorithms. These algorithms depend on performing a series of conditional independence tests. We extend these algorithms for the analysis of problems with continuous predictors and multi-class outcomes, by developing and equipping them with an appropriate conditional independence test procedure for both nominal and ordinal multi-class targets. The test is based on multinomial logistic regression and employs the log-likelihood ratio test for model selection. We present a comparative, experimental evaluation on seven real-world, high-dimensional, gene-expression datasets. Within the scope of our analysis the results indicate that the new conditional independence test allows the identification of smaller and better performing signatures for multi-class outcome datasets, with respect to the current alternatives for performing the independence tests.Entities:
Keywords: Biomarker Signature Identification; Constraint-based Methods; Graphical Models; High Dimensional Data; Multiple Outcomes Studies; “Omics” Data
Year: 2013 PMID: 24688712 PMCID: PMC3962136 DOI: 10.5936/csbj.201303004
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Characteristics of the datasets employed during the experimentation. For each dataset the following information are reported: Gene Expression Omnibus (GEO) ID, disease investigated during the study, type (categorical or ordinal) of outcome, number of classes, samples and variables.
| GEO ID | Disease | Outcome | #Classes | #Samples | #Variables |
|---|---|---|---|---|---|
| GDS1329 | Breast Cancer | Categorical | 3 | 49 | 22215 |
| GDS1962 | Glioma | Ordinal | 4 | 180 | 54613 |
| GDS2373 | Squamous Cell cancer | Ordinal | 3 | 130 | 22284 |
| GDS2547 | Prostate cancer | Categorical | 4 | 164 | 12646 |
| GDS2855 | Muscle diseases | Categorical | 3 | 71 | 22645 |
| GDS3233 | Cervical cancer | Categorical | 3 | 61 | 22283 |
| GDS3257 | Adeno carcinoma | Ordinal | 3 | 101 | 22288 |
Nested cross validated results. Performances are reported as mean accuracy (averaged over nested-cross-validation outer loop) ± standard deviation. Values in parentheses are Binomial test p-values for comparing the respective accuracies against the corresponding MC-CIT performance. Columns “MC-CIT”, “Fisher Z test”, “G2 test” report the best accuracies obtained by the MMPC algorithm coupled with the respective conditional independence test. Column “Lasso” reports the best accuracies obtained by the Lasso feature selection method, while the last column report the baseline accuracy obtained by predicting the most frequent class (“Trivial classifier”). All methods performed similarly to the Trivial classifier for dataset GDS2373, thus its results are not shown. The last row reports the accuracies obtained by pooling together the predictions over all datasets.
| Dataset | MC-CIT | Fisher Z test | G2 test | Lasso | Trivial cl. |
|---|---|---|---|---|---|
| GDS1329 | 0.959 ± 0.057 | 0.958 ± 0.065 (1.000) | 0.958 ± 0.065 (1.000) | 1.000 ± 0.000 (0.500) | 0.551 ± 0.047 (0.000) |
| GDS1962 | 0.651 ± 0.061 | 0.650 ± 0.102 (1.000) | 0.648 ± 0.092 (1.000) | 0.699 ± 0.079 (0.108) | 0.451 ± 0.024 (0.000) |
| GDS2547 | 0.656 ± 0.092 | 0.634 ± 0.116 (0.678) | 0.712 ± 0.147 (0.389) | 0.724 ± 0.084 (0.228) | 0.391 ± 0.023 (0.000) |
| GDS2855 | 0.861 ± 0.112 | 0.652 ± 0.222 (0.003) | 0.766 ± 0.159 (0.189) | 0.861 ± 0.090 (1.000) | 0.408 ± 0.035 (0.000) |
| GDS3233 | 0.984 ± 0.048 | 0.890 ± 0.114 (0.031) | 0.981 ± 0.056 (1.000) | 0.968 ± 0.065 (1.000) | 0.460 ± 0.038 (0.000) |
| GDS3257 | 0.644 ± 0.096 | 0.607 ± 0.147 (0.523) | 0.657 ± 0.152 (1.000) | 0.642 ± 0.100 (1.000) | 0.446 ± 0.042 (0.005) |
|
|
|
|
|
|
|
Number of features selected in the outer-loop of the nested-cross validation procedure. Results are reported as mean ± standard deviation. Values in parentheses are p-values for statistically comparing MC-CIT against the other methods (two-tailed t-test). Column names follow the same schema of Table 2. The last row reports the average performances calculated over all datasets.
| Dataset | MC-CIT | Fisher Z test | G2 Test | Lasso |
|---|---|---|---|---|
| GDS1329 | 2.000 ± 0.000 | 3.167 ± 0.408 (0.000) | 35.833 ± 16.940 (0.001) | 15.833 ± 4.446 (0.000) |
| GDS1962 | 6.100 ± 0.568 | 6.600 ± 0.699 (0.096) | 8.200 ± 1.476 (0.001) | 40.200 ± 35.888 (0.008) |
| GDS2547 | 14.200 ± 2.741 | 7.800 ± 1.033 (0.000) | 6.100 ± 0.568 (0.000) | 53.000 ± 27.897 (0.000) |
| GDS2855 | 3.700 ± 0.949 | 7.300 ± 1.252 (0.000) | 28.500 ± 3.375 (0.000) | 30.400 ± 10.233 (0.000) |
| GDS3233 | 2.000 ± 0.000 | 5.778 ± 1.202 (0.000) | 34.222 ± 5.518 (0.000) | 18.000 ± 5.196 (0.000) |
| GDS3257 | 4.900 ± 0.876 | 4.200 ± 1.549 (0.229) | 4.700 ± 1.160 (0.669) | 26.200 ± 20.531 (0.004) |
|
|
|
|
|
|