| Literature DB >> 22022543 |
Vincent Guillemot1, Arthur Tenenhaus, Laurent Le Brusquet, Vincent Frouin.
Abstract
Integrating gene regulatory networks (GRNs) into the classification process of DNA microarrays is an important issue in bioinformatics, both because this information has a true biological interest and because it helps in the interpretation of the final classifier. We present a method called graph-constrained discriminant analysis (gCDA), which aims to integrate the information contained in one or several GRNs into a classification procedure. We show that when the integrated graph includes erroneous information, gCDA's performance is only slightly worse, thus showing robustness to misspecifications in the given GRNs. The gCDA framework also allows the classification process to take into account as many a priori graphs as there are classes in the dataset. The gCDA procedure was applied to simulated data and to three publicly available microarray datasets. gCDA shows very interesting performance when compared to state-of-the-art classification methods. The software package gcda, along with the real datasets that were used in this study, are available online: http://biodev.cea.fr/gcda/.Entities:
Mesh:
Year: 2011 PMID: 22022543 PMCID: PMC3195079 DOI: 10.1371/journal.pone.0026146
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Graph used to generate simulated data: an Erdös-Rényi graph.
Results using simulated datasets.
| Setting |
| RDA | SVM | LP-SVM | NB-SVM | gCDA |
|
|
| 66.12 (13.79) | 80.32 (6.55) | 69.97 (10.04) | 70.24 (10.54) | 88.74 (5.07) |
|
| 76.00 (21.37) | 92.59 (3.58) | 70.91 (11.90) | 74.76 (9.70) | 96.56 (2.81) | |
|
| 65.26 (19.36) | 81.24 (7.21) | 70.56 (13.10) | 67.06 (8.79) | 93.38 (4.13) | |
|
|
| 71.44 (12.90) | 77.50 (6.43) | 71.97 (9.09) | 70.94 (9.06) | 80.29 (6.24) |
|
| 70.59 (18.73) | 84.47 (5.76) | 71.59 (9.97) | 70.47 (9.79) | 86.65 (5.92) | |
|
| 72.35 (21.70) | 87.50 (5.44) | 73.65 (12.57) | 73.74 (11.77) | 92.56 (4.66) |
Mean of the good classification percentage (and standard deviation) over 100 MCCV iterations. Results obtained using simulated datasets. is the number of variables. The number of individuals is set to . We used the linear version of gCDA when and the quadratic version when .
Figure 2Histogram of the optimal values of .
These values were selected by 10-fold cross validation obtained on simulated data (linear setting, and ).
Figure 3Plot of the classification performance as a function of the Hamming distance between the real graph and the graph integrated in gCDA.
For this part of the simulation study, the number of variables is set to and the number of individuals to .
Characteristics of the datasets.
| outcome |
|
| Disease | Reference | Network inferred on |
| control/tumor | 30∶12 | 97 | colon cancer |
| The rest of the original dataset |
| control/tumor | 50∶52 | 282 | prostate cancer |
| Another dataset |
| relapse/no relapse | 69∶69 | 325 | lung cancer |
| Another dataset (GSE8332) |
Summary of the characteristics of each of the datasets. represents the number of individuals in the class, . The last column indicates whether the networks are inferred on an independent part of the dataset or on another dataset. In both cases, the dataset used to compute the networks is never used in the classification process.
Test on the covariance matrices.
| Colon | Lung | Prostate | |
| p-value | 0.26 | 0.65 |
|
We tested each dataset to determine whether the covariance matrices are statistically similar. The test we chose is robust enough to handle instances in which the number of variables is of the same order as the number of individuals. The null hypothesis is “”. As a result, we rejected the null hypothesis when the p-value was lower than the threshold of 0.05.
Comparison of gCDA's performance with the performance of three other classification methods.
| gCDA | RDA | SVM | NB-SVM | |
| Colon | 79.36 (9.63) | 69.50 (13.62) | 75.07 (9.87) | 54.57 (22.83) |
| Lung | 55.93 (6.00) | 49.13 (6.68) | 55.02 (6.12) | 50.41 (6.09) |
| Prostate | 87.10 (5.59) | 64.88 (12.1) | 88.62 (5.38) | 56.12 (13.2) |
Comparison of the performance of gCDA with the performance obtained with RDA, SVM and NB-SVM. For NB-SVM and gCDA, we chose to integrate the GRNs inferred with ARACNE. In this table are presented the mean (standard deviation) of the good classification rate over 100 MCCV iterations.
Performance of the considered classification methods on three gene expression microarray datasets.
| ridge.net | ARACNE | KEGG | |
| Colon | 67.857 (11.77) | 70.357 (11.37) | 66.143 (12.17) |
| Lung | 59.413 (5.88) | 56.457 (6.31) | 56.37 (5.83) |
| Prostate | 87.441 (6.09) | 87.029 (5.40) | 84.353 (6.78) |
The graphs integrated in the classification methods NB-SVM and gCDA were either inferred with two methods, ridge.net and ARACNE, or extracted from KEGG. In this table are presented the mean (standard deviation) of the good classification rate over 100 MCCV iterations.
Comparison of the integrated graphs.
|
|
|
|
|
|
|
|
| Colon | 35 | 2 | 6 | 315 | 158 | 344 |
| Lung | 263 | 62 | 18 | 3204 | 3311 | 1680 |
| Prostate | 69 | 4 | 19 | 1099 | 1300 | 1979 |
Comparison of the structure of the integrated graphs using ridge.net (), ARACNE () or KEGG (). The table contains the number of edges in the intersection and the union. When two graphs were inferred, they were simply merged into a unique graph.
Linear gCDA applied on high dimensional microarray datasets.
| SVM | RDA | linear gCDA | |
| Lung | 59.74 (6.65) | 49.30 (6.93) | 60.44 (7.32) |
| Prostate | 84.68 (5.69) | 71.59 (10.35) | 85.06 (5.81) |
Application of gCDA to more than 1000 variables. Comparison of SVM, RDA and linear gCDA on the lung and prostate cancer datasets: mean (standard deviation) of good classification rate over 100 MCCV iterations.