| Literature DB >> 25768286 |
Shiqian Ma1, Daniel Johnson2, Cody Ashby2, Donghai Xiong3, Carole L Cramer4, Jason H Moore5, Shuzhong Zhang6, Xiuzhen Huang7.
Abstract
It is challenging to cluster cancer patients of a certain histopathological type into molecular subtypes of clinical importance and identify gene signatures directly relevant to the subtypes. Current clustering approaches have inherent limitations, which prevent them from gauging the subtle heterogeneity of the molecular subtypes. In this paper we present a new framework: SPARCoC (Sparse-CoClust), which is based on a novel Common-background and Sparse-foreground Decomposition (CSD) model and the Maximum Block Improvement (MBI) co-clustering technique. SPARCoC has clear advantages compared with widely-used alternative approaches: hierarchical clustering (Hclust) and nonnegative matrix factorization (NMF). We apply SPARCoC to the study of lung adenocarcinoma (ADCA), an extremely heterogeneous histological type, and a significant challenge for molecular subtyping. For testing and verification, we use high quality gene expression profiling data of lung ADCA patients, and identify prognostic gene signatures which could cluster patients into subgroups that are significantly different in their overall survival (with p-values < 0.05). Our results are only based on gene expression profiling data analysis, without incorporating any other feature selection or clinical information; we are able to replicate our findings with completely independent datasets. SPARCoC is broadly applicable to large-scale genomic data to empower pattern discovery and cancer gene identification.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25768286 PMCID: PMC4359112 DOI: 10.1371/journal.pone.0117135
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Performance of NMF (with k = 2 sample clusters) on original matrices Ms or normalized matrices L, compared with its performance on CSD decomposed matrices M_Y or L_Y.
| Datasets | NMF runs (each dataset with 10 runs) | p-value (5-year overall survival) | Statistically valid (1: p-value <0.05; 0: otherwise) |
|---|---|---|---|
|
| 1 | 0.0483 | 1 |
| 2 | 0.0483 | 1 | |
| 3 | 0.0483 | 1 | |
| 4 | 0.0483 | 1 | |
| 5 | 0.0483 | 1 | |
| 6 | 0.0483 | 1 | |
| 7 | 0.0483 | 1 | |
| 8 | 0.0483 | 1 | |
| 9 | 0.0483 | 1 | |
| 10 | 0.0483 | 1 | |
|
| 1 | 0.0195 | 1 |
| 2 | 0.0302 | 1 | |
| 3 | 0.0447 | 1 | |
| 4 | 0.0446 | 1 | |
| 5 | 0.0196 | 1 | |
| 6 | 0.0447 | 1 | |
| 7 | 0.0447 | 1 | |
| 8 | 0.0447 | 1 | |
| 9 | 0.0301 | 1 | |
| 10 | 0.0195 | 1 | |
|
| 1 | 0.5361 | 0 |
| 2 | 0.5361 | 0 | |
| 3 | 0.5361 | 0 | |
| 4 | 0.5361 | 0 | |
| 5 | 0.5361 | 0 | |
| 6 | 0.5361 | 0 | |
| 7 | 0.5361 | 0 | |
| 8 | 0.5361 | 0 | |
| 9 | 0.5361 | 0 | |
| 10 | 0.5361 | 0 | |
|
| 1 | 0.1201 | 0 |
| 2 | 0.1326 | 0 | |
| 3 | 0.1121 | 0 | |
| 4 | 0.1135 | 0 | |
| 5 | 0.1381 | 0 | |
| 6 | 0.1201 | 0 | |
| 7 | 0.1201 | 0 | |
| 8 | 0.1326 | 0 | |
| 9 | 0.1291 | 0 | |
| 10 | 0.1201 | 0 | |
|
| 1 | 0.0019 | 1 |
| 2 | 0.0041 | 1 | |
| 3 | 0.0061 | 1 | |
| 4 | 0.0031 | 1 | |
| 5 | 0.0041 | 1 | |
| 6 | 0.0031 | 1 | |
| 7 | 0.0031 | 1 | |
| 8 | 0.0052 | 1 | |
| 9 | 0.0019 | 1 | |
| 10 | 0.0041 | 1 |
Note that since NMF uses random seed approaches, 10 runs were performed for each data set and the one with the lowest p-value (by survival log-rank analysis) was highlighted. From testing on the ACC stage1 dataset, the performance of NMF clustering is better (i.e., smaller p-values) on M_Y than on M, where M is the original Jacob gene expression matrix, and M_Y is the sparse matrix from CSD decomposition. From testing on the Jacob stage1 dataset, the performance of NMF clustering is much better on L_Y than on L or on M, where M is the original Jacob gene expression matrix, L is the normalized matrix, and L_Y is the sparse matrix from CSD decomposition. .
Performance comparison of NMF (k = 2, i.e., 2 sample clusters) and MBI (k1 = 100, k2 = 2) on different stage I lung ADCA datasets.
| Datasets | Clusters of NMF | Clusters of MBI | ||
|---|---|---|---|---|
| p-value (5-year overall survival) | Statistically valid (1: p-value <0.05; 0: otherwise) | p-value (5-year overall survival) | Statistically valid (1: p-value <0.05; 0: otherwise) | |
|
| 0.0483 | 1 | 0.0195 | 1 |
|
| 0.0195 | 1 | 0.0164 | 1 |
|
| 0.5361 | 0 | 0.0045 | 1 |
|
| 0.1201 | 0 | 0.0011 | 1 |
|
| 0.0019 | 1 | 0.0018 | 1 |
Note that since NMF and MBI use random seed approaches, 10 runs were performed for each data set and the one with the lowest p-value (by survival log-rank analysis) from the 10 runs was selected. Compared with the performance of NMF, the performance of MBI is more robust; When both achieve statistically valid clusters, MBI clusters have smaller p-values from log rank test, which implicates MBI is a better clustering model.