| Literature DB >> 30669418 |
Chao Yang1, Yu-Tian Wang2, Chun-Hou Zheng3,4.
Abstract
Availability of diverse types of high-throughput data increases the opportunities for researchers to develop computational methods to provide a more comprehensive view for the mechanism and therapy of cancer. One fundamental goal for oncology is to divide patients into subtypes with clinical and biological significance. Cluster ensemble fits this task exactly. It can improve the performance and robustness of clustering results by combining multiple basic clustering results. However, many existing cluster ensemble methods use a co-association matrix to summarize the co-occurrence statistics of the instance-cluster, where the relationship in the integration is only encapsulated at a rough level. Moreover, the relationship among clusters is completely ignored. Finding these missing associations could greatly expand the ability of cluster ensemble methods for cancer subtyping. In this paper, we propose the RWCE (Random Walk based Cluster Ensemble) to consider similarity among clusters. We first obtained a refined similarity between clusters by using random walk and a scaled exponential similarity kernel. Then, after being modeled as a bipartite graph, a more informative instance-cluster association matrix filled with the aforementioned cluster similarity was fed into a spectral clustering algorithm to get the final clustering result. We applied our method on six cancer types from The Cancer Genome Atlas (TCGA) and breast cancer from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). Experimental results show that our method is competitive against existing methods. Further case study demonstrates that our method has the potential to find subtypes with clinical and biological significance.Entities:
Keywords: cancer subtypes; cluster ensemble; random walk; refined similarity
Mesh:
Year: 2019 PMID: 30669418 PMCID: PMC6356971 DOI: 10.3390/genes10010066
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Schematic diagram of the Random Walk based Cluster Ensemble (RWCE) pipeline: (A) traditional clustering algorithm (here we used K-means) was applied to each molecular data type to obtain M basic clusterings. For each basic clustering, the cluster number was randomly chosen from 2 to √n; (B) each data type’s M clusterings were fused into one consensus clustering by RWCE refinement; (C) all data types’ consensus clusterings were fused into one final clustering using RWCE refinement again.
Figure 2Stacked histogram displaying, for each clustering method (PINS: perturbation clustering for data integration and disease subtyping; ECC: entropy-based consensus clustering; LCE: link-based cluster ensemble; CC: consensus clustering; RWCE: random walk based cluster ensemble), the times it passed the significant tests (p-value < 0.05) of survival analysis on several molecular data types: mRNA expression data (mRNA), DNA methylation data (Methy), miRNA expression data (miRNA) and an integration of all three data types (integration).
Performance of RWCE on three molecular data types and their integration across six cancer types from The Cancer Genome Atlas (TCGA).
| mRNA | Methylation | miRNA | Integration | |
|---|---|---|---|---|
| KIRC |
| 0.79397(3) | 0.52883(2) |
|
| GBM | 0.19041(2) |
| 0.96568(2) |
|
| LAML |
| 0.58721(2) |
|
|
| LUSC | 0.40747(3) |
|
|
|
| BRCA |
| 0.58412(2) | 0.15534(2) |
|
| COAD |
| 0.68703(2) | 0.81886(6) |
|
KIRC (kidney renal clear cell carcinoma); GBM (glioblastoma multiforme); LAML (acute myeloid leukemia); LUSC (lung squamous cell carcinoma); BRCA (breast invasive carcinoma); COAD (colon adenocarcinoma). p < 0.05 is highlighted in bold.
Figure 3The heatmap for silhouette value on six TCGA datasets of different methods. KIRC-mRNA indicates mRNA expression data in KIRC was used. The same as the others.
Figure 4The survival curves for TCGA glioblastoma multiforme (GBM) subtypes generated by RWCE.
Figure 5(A–C) Survival analysis of GBM patients for treatment with temozolomide (TMZ) in different subtypes generated by RWCE; (D) age distribution of GBM subtypes generated by RWCE.
Cox p-value and concordance index (CI) of subtypes discovered by PAM50, perturbation clustering for data integration and disease subtyping (PINS), consensus clustering (CC), entropy-based consensus clustering (ECC), link-based cluster ensemble (LCE), and our method on METABRIC data. For each discovery and validation cohort, we calculated the p-value and CI with respect to disease free survival (DFS) and overall survival of the patients. For each row, the best p-value (most significant) and the best CI (highest) are in red. The number of clusters in discovery and validation cohort are shown after the name of the clustering methods.
| PAM50 (5, 5) | PINS (14, 7) | CC (10, 8) | ECC (10, 10) | LCE (10, 8) | RWCE (6, 6) | |||
|---|---|---|---|---|---|---|---|---|
| Discovery | DFS | 3.00 × 10−11 | 6.50 × 10−10 | 2.50 × 10−5 | 1.39 × 10−1 | 9.50 × 10−1 | 1.69 × 10−9 | |
| Overall | 8.50 × 10−5 | 1.90 × 10−6 | 8.10 × 10−6 | 5.59 × 10−2 | 4.42 × 10−1 | 4.16 × 10−12 | ||
| CI | DFS | 0.620 | 0.634 | 0.598 | 0.521 | 0.506 | 0.594 | |
| Overall | 0.578 | 0.598 | 0.572 | 0.529 | 0.508 | 0.641 | ||
| Validation | DFS | 3.10 × 10−9 | 4.30 × 10−5 | 1.20 × 10−2 | 2.61 × 10−1 | 8.44 × 10−2 | 9.12 × 10−5 | |
| Overall | 2.90 × 10−5 | 033.80 × 10−3 | 7.90 × 10−3 | 1.66 × 10−1 | 3.53 × 10−2 | 9.13 × 10−7 | ||
| CI | DFS | 0.636 | 0.589 | 0.572 | 0.521 | 0.520 | 0.560 | |
| Overall | 0.561 | 0.545 | 0.538 | 0.519 | 0.514 | 0.607 |