| Literature DB >> 33916856 |
Shuguang Ge1, Xuesong Wang1, Yuhu Cheng1, Jian Liu1.
Abstract
Integrating multigenomic data to recognize cancer subtype is an important task in bioinformatics. In recent years, some multiview clustering algorithms have been proposed and applied to identify cancer subtype. However, these clustering algorithms ignore that each data contributes differently to the clustering results during the fusion process, and they require additional clustering steps to generate the final labels. In this paper, a new one-step method for cancer subtype recognition based on graph learning framework is designed, called Laplacian Rank Constrained Multiview Clustering (LRCMC). LRCMC first forms a graph for a single biological data to reveal the relationship between data points and uses affinity matrix to encode the graph structure. Then, it adds weights to measure the contribution of each graph and finally merges these individual graphs into a consensus graph. In addition, LRCMC constructs the adaptive neighbors to adjust the similarity of sample points, and it uses the rank constraint on the Laplacian matrix to ensure that each graph structure has the same connected components. Experiments on several benchmark datasets and The Cancer Genome Atlas (TCGA) datasets have demonstrated the effectiveness of the proposed algorithm comparing to the state-of-the-art methods.Entities:
Keywords: Laplacian Rank Constrained; cancer subtype recognition; graph learning; multiview clustering
Year: 2021 PMID: 33916856 PMCID: PMC8065670 DOI: 10.3390/genes12040526
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The flow chart of Laplacian Rank Constrained Multiview Clustering.
Overview of four benchmark datasets.
| Dataset |
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| 3-source | 169 | 3 | 6 | 3560 | 3631 | 3638 | - | - | - |
| Calt-7 | 1474 | 6 | 7 | 48 | 40 | 254 | 1984 | 512 | 928 |
| MSRC | 210 | 5 | 7 | 48 | 100 | 256 | 1302 | 512 | - |
| WebKB | 203 | 3 | 4 | 1703 | 230 | 230 | - | - | - |
The clustering performance comparison in terms of ACC, NMI and Purity on the four real datasets.
| Datasets | Methods | ACC | NMI | Purity |
|---|---|---|---|---|
| 3-source | ANF | 0.4970 (0.0000) | 0.2804 (0.0000) | 0.5325 (0.0000) |
| SNF | 0.7811 (0.0000) | 0.6942 (0.0000) | 0.8166 (0.0000) | |
| PFA | 0.4562 (0.0761) | 0.2247 (0.0713) | 0.7160 (0.0578) | |
| MVCMO | 0.4221 (0.0123) | 0.3035 (0.0128) | 0.5266 (0.0118) | |
| LRCMC |
|
|
| |
| Calt-7 | ANF | 0.6696 (0.0000) | 0.6203 (0.0000) | 0.8684 (0.0000) |
| SNF | 0.6601 (0.0000) | 0.5637 (0.0000) | 0.8562 (0.0000) | |
| PFA | - | - | - | |
| MVCMO | 0.6654 (0.0100) | 0.5179 (0.0355) | 0.8464 (0.0083) | |
| LRCMC |
|
|
| |
| MSRC | ANF | 0.8048 (0.0000) | 0.7297 (0.0000) | 0.8143 (0.0000) |
| SNF | 0.8429 (0.0000) | 0.7514 (0.0000) | 0.8429 (0.0000) | |
| PFA | - | - | - | |
| MVSCO | 0.7800 (0.0544) | 0.6711 (0.0628) | 0.7838 (0.0462) | |
| LRCMC |
|
|
| |
| WebKB | ANF | 0.6798 (0.0000) | 0.1718 (0.0000) | 0.6946 (0.0000) |
| SNF | 0.7044 (0.0000) | 0.2407 (0.0000) | 0.7192 (0.0000) | |
| PFA | 0.7143 (0.0000) | 0.3191 (0.0000) | 0.8128 (0.0000) | |
| MVCMO | 0.7652 (0.0346) | 0.3548 (0.0448) | 0.7833 (0.0323) | |
| LRCMC |
|
|
|
- means that the metrics cannot be calculated, the best results have been highlighted in bold.
Overview of the TCGA datasets.
| Datasets |
| mRNA Expression | DNA Methylation | miRNA Expression |
|---|---|---|---|---|
| GBM | 213 | 12,042 | 1305 | 534 |
| BIC | 105 | 17,814 | 23,094 | 354 |
| LSCC | 106 | 12,042 | 23,074 | 352 |
| COAD | 92 | 17,814 | 23,088 | 312 |
p values of survival analysis in Cox log-rank model for different clustering methods of four cancers on The Cancer Genome Atlas (TCGA) datasets.
| Methods | GBM | BIC | LSCC | COAD |
|---|---|---|---|---|
| ANF | 5.8 × 10−4 | 3.6 × 10−4 | 8.9 × 10−3 | 9.0 × 10−3 |
| SNF | 5.0×10−5 | 6.9×10−4 | 7.8 × 10−3 | 1.6 × 10−3 |
| PFA | 1.8×10−4 | 3.1×10−4 | 1.1 × 10−2 | 2.4 × 10−2 |
| MVCMO | 1.4×10−3 | 3.5×10−4 | 9.1 × 10−3 | 8.5 × 10−3 |
| LRCMC |
|
|
|
|
The best results have been highlighted in bold.
Figure 2The Kaplan–Meier survival curves of (a): Glioblastoma Multiforme (GBM), (b): Breast Invasive Carcinoma (BIC), (c): Lung Squamous Cell Carcinoma (LSCC) and (d): Colon Adenocarcinoma (COAD), respectively.
The identified clusters are compared with mRNA-expression-based subtypes and methylation-based subtypes.
| Our Cluster | mRNA-Expression-Based Subtypes | Methylation-Based Subtypes | ||||
|---|---|---|---|---|---|---|
| Mesenchymal | Classical | Neural | Proneural | G-CLMP | Non-G-CLMP | |
| cluster 1 | 46 | 54 | 27 | 30 | 0 | 155 |
| cluster 2 | 1 | 0 | 1 | 19 | 20 | 1 |
| cluster 3 | 12 | 11 | 7 | 7 | 0 | 37 |
The values represent the number of patients counted.
Figure 3Boxplot of diagnosis age for the identified clusters. It reflects the distribution of diagnosis age in each cluster. Black bar represents the median of each cluster.
Distribution of genetic variant signatures for the identified clusters.
| Our Cluster |
| |||||
|---|---|---|---|---|---|---|
| cluster 1 | 84 (56.4%) | 84 (56.4%) | 80 (53.7%) | 57 (38.3%) | 70 (47.0%) | 0 (0%) |
| cluster 2 | 6 (28.6%) | 6 (28.6%) | 6 (28.6%) | 5 (23.8%) | 0 (0%) | 10 (66.7%) |
| cluster 3 | 24 (68.9%) | 23 (62.2%) | 24 (68.9%) | 21 (56.8%) | 19 (51.4%) | 0 (0%) |
The values indicate the number of variations, and the values in parentheses indicate the frequencies of variations after removing statistical missing. ‘ampl.’: amplification, ‘del.’: deletion.
Figure 4The Kaplan–Meier survival curves of the identified clusters (a): cluster 1, (b): cluster 2 and (c): cluster 3) of Temozolomide (TMZ) response. “Untreated” expresses the group which did not receive TMZ treatment and “Treated” expresses the group which received TMZ treatment.
Figure 5Heatmaps of differentially expressed genes in (a): mRNA expression data and (b): DNA methylation data for the identified clusters.
GO: BP, KEGG pathway, DO enriched terms for the identified cluster.
| ENRICHMENT Analysis | Cluster 1 | Cluster 2 | Cluster 3 |
|---|---|---|---|
| GO:BP enriched terms | 1. Epithelial cell differentiation | 1. Protein targeting to ER | 1. SRP-dependent cotranslational protein targeting to membrane |
| KEGG enriched pathway terms | 1. Cell adhesion molecules (CAMs) | 1. Ribosome | 1. Ribosome |
| DO enriched terms | 1. alphaThalassemia | 1. Diamond–Blackfan anemia | 1. Diamond–Blackfan anemia |
We put the GO: BP terms with ranking in the top 5, and KEGG pathway and DO terms with p-value less than 1.00E-4 in the table.