| Literature DB >> 32038712 |
Conghai Lu1, Juan Wang1, Jinxing Liu1, Chunhou Zheng2, Xiangzhen Kong1, Xiaofeng Zhang3.
Abstract
As an important approach to cancer classification, cancer sample clustering is of particular importance for cancer research. For high dimensional gene expression data, examining approaches to selecting characteristic genes with high identification for cancer sample clustering is an important research area in the bioinformatics field. In this paper, we propose a novel integrated framework for cancer clustering known as the non-negative symmetric low-rank representation with graph regularization based on score function (NSLRG-S). First, a lowest rank matrix is obtained after NSLRG decomposition. The lowest rank matrix preserves the local data manifold information and the global data structure information of the gene expression data. Second, we construct the Score function based on the lowest rank matrix to weight all of the features of the gene expression data and calculate the score of each feature. Third, we rank the features according to their scores and select the feature genes for cancer sample clustering. Finally, based on selected feature genes, we use the K-means method to cluster the cancer samples. The experiments are conducted on The Cancer Genome Atlas (TCGA) data. Comparative experiments demonstrate that the NSLRG-S framework can significantly improve the clustering performance.Entities:
Keywords: cancer gene expression data; clustering; feature selection; low-rank representation; score function
Year: 2020 PMID: 32038712 PMCID: PMC6987458 DOI: 10.3389/fgene.2019.01353
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1The matrix Z with the symmetry constraint.
Figure 2Framework of NSLRG-S for clustering gene expression data.
Figure 3The tp, fp, and fn of the clustering result.
Figure 4(A and B): The convergence analysis of different methods in 100 iterations.
The clustering results of compared methods and NSLRG-S method on synthetic data.
| Method | Acc (%) | F1 (%) | RI (%) |
|---|---|---|---|
| GNMF | 72.44 | 68.42 | 93.01 |
| RPCA | 80.68 | 78.82 | 95.57 |
| SPCA | 70.42 | 67.6 | 91.07 |
| GLPCA | 67.28 | 64.45 | 89.84 |
| LS | 80.62 | 78.37 | 96.12 |
| LLRR | 81.04 | 78.67 | 96.12 |
| NSLRG-S |
|
|
|
Acc, clustering accuracy rate; F1, F1 measurement; and RI, Rand Index; GNMF, Graph Regularized Nonnegative Matrix Factorization; SPCA, Sparse Principal Component Analysis; GLPCA, Graph-Laplacian PCA; LS, Laplacian Score; and LLRR, Laplacian regularized Low-Rank Representation; NSLRG-S, non-negative symmetric low-rank representation with graph regularization based on score function. The bolded texts mean the results are better than the others.
The distribution of five gene expression datasets.
| Dataset | Cancer tissue samples | Normal tissue samples | Total samples | Total genes |
|---|---|---|---|---|
| PAAD | 176 | 4 | 180 | 20502 |
| HNSC | 398 | 20 | 418 | 20502 |
| ESCA | 183 | 9 | 192 | 20502 |
| COAD | 262 | 19 | 281 | 20502 |
| CHOL | 36 | 9 | 45 | 20502 |
The distribution of mixed datasets.
| Dataset | Cancer tissue and the number | Total number |
|---|---|---|
| HN-PA | 398 from HNSC; 176 from PAAD; | 574 |
| ES-PA | 183 from ESCA; 176 from PAAD; | 359 |
| CO-ES | 262 from COAD; 183 from ESCA; | 445 |
| HN-CH | 398 from HNSC; 36 from CHOL; | 434 |
| HN-PA-CH | 398 from HNSC; 176 from PAAD; 36 from CHOL; | 610 |
| ES-PA-CH | 183 from ESCA; 176 from PAAD; 36 from CHOL; | 395 |
| CO-PA-CH | 262 from COAD; 176 from PAAD; 36 from CHOL; | 474 |
ESCA, esophageal carcinoma; HNSC, head and neck squamous cell carcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; and PAAD, pancreatic adenocarcinoma.
The category result of experimental datasets.
| Category | I | II | III |
|---|---|---|---|
| Dataset | PAAD | HN-PA | HN-PA-CH |
| HNSC | ES-PA | ES-PA-CH | |
| ESCA | CO-ES | CO-PA-CH | |
| COAD | HN-CH | / | |
| CHOL | / | / |
ESCA, esophageal carcinoma; HNSC, head and neck squamous cell carcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; and PAAD, pancreatic adenocarcinoma.
The parameter selection.
| Dataset |
|
|
|
|---|---|---|---|
| PAAD | 10-5 | 10-2 | 10-3 |
| HNSC | 10-3 | 10-4 | 10-3 |
| ESCA | 104 | 10-1 | 10-3 |
| COAD | 104 | 100 | 10-3 |
| CHOL | 10-1 | 10-1 | 10-3 |
| HN-PA | 10-4 | 101 | 10-3 |
| ES-PA | 10-2 | 10-1 | 10-3 |
| CO-ES | 102 | 105 | 10-3 |
| HN-CH | 10-1 | 105 | 10-3 |
| HN-PA-CH | 10-5 | 10-2 | 10-3 |
| ES-PA-CH | 10-4 | 100 | 10-3 |
| CO-PA-CH | 101 | 10-2 | 10-3 |
ESCA, esophageal carcinoma; HNSC, head and neck squamous cell carcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; and PAAD, pancreatic adenocarcinoma.
The result of comparison experiment.
| Category | Dataset | Measure | K-means | GNMF | RPCA | SPCA | GLPCA | LS | LLRR | NSLRG-S |
|---|---|---|---|---|---|---|---|---|---|---|
| I | PAAD | Acc | 69.50% | 74.67% | 63.49% | 56.47% | 76.53% |
| 81.46% | 97.22% |
| F1 | 43.28% | 46.69% | 41.42% | 40.31% | 45.53% |
| 48.45% | 49.30% | ||
| RI | 63.77% | 61.96% | 55.23% | 50.58% | 64.45% |
| 69.73% | 94.57% | ||
| HNSC | Acc | 69.50% | 81.72% | 64.52% | 62.20% | 90.71% | 93.54% | 81.44% |
| |
| F1 | 46.78% | 44.97% | 47.34% | 46.59% | 68.51% | 48.33% | 48.43% |
| ||
| RI | 59.44% | 70.05% | 54.19% | 52.86% | 83.68% | 87.89% | 69.69% |
| ||
| ESCA | Acc | 62.01% | 54.69% | 53.65% | 53.97% | 84.90% | 94.79% | 67.47% |
| |
| F1 | 43.97% | 40.00% | 40.22% | 41.15% | 46.74% | 48.66% | 46.97% |
| ||
| RI | 58.34% | 50.18% | 50.01% | 50.06% | 76.19% | 90.07% | 56.41% |
| ||
| COAD | Acc | 74.71% |
| 86.39% | 81.28% | 84.42% | 87.09% | 88.20% |
| |
| F1 | 60.02% |
| 71.08% | 65.41% | 68.68% | 47.54% | 73.40% |
| ||
| RI | 65.22% |
| 76.45% | 69.48% | 73.60% | 78.08% | 79.15% |
| ||
| CHOL | Acc | 85.72% | 97.78% |
|
|
| 63.82% |
|
| |
| F1 | 66.16% | 96.66% |
|
|
| 44.81% |
|
| ||
| RI | 75.03% | 95.56% |
|
|
| 53.36% |
|
| ||
| II | HN-PA | Acc | 97.66% | 99.83% | 99.48% | 99.30% | 98.95% | 68.95% | 99.65% |
|
| F1 | 95.99% | 99.80% | 99.39% | 99.19% | 98.78% | 41.77% | 99.59% |
| ||
| RI | 96.38% | 99.65% | 98.96% | 98.61% | 97.93% | 57.11% | 99.30% |
| ||
| HN-CH | Acc | 85.42% | 98.39% | 82.56% | 89.59% | 92.06% | 90.12% | 94.14% |
| |
| F1 | 73.89% | 94.18% | 71.16% | 77.83% | 81.62% | 47.40% | 86.08% |
| ||
| RI | 76.94% | 96.82% | 72.33% | 81.36% | 85.37% | 82.15% | 89.46% |
| ||
| ES-PA | Acc | 96.41% | 97.21% | 98.25% | 99.16% | 99.16% | 50.86% | 99.16% |
| |
| F1 | 73.89% | 97.21% | 97.95% | 99.16% | 99.16% | 34.37% | 99.16% |
| ||
| RI | 95.44% | 94.57% | 97.37% | 98.34% | 98.34% | 49.89% | 98.34% |
| ||
| CO-ES | Acc | 96.58% | 80.67% | 97.53% | 96.85% | 96.18% | 59.10% | 97.30% |
| |
| F1 | 96.07% | 77.59% | 97.45% | 96.75% | 96.06% | 37.65% | 97.21% |
| ||
| RI | 93.95% | 68.75% | 95.17% | 93.89% | 92.63% | 51.55% | 94.74% |
| ||
| III | HN-PA-CH | Acc | 81.01% |
| 77.20% | 78.83% | 80.13% | 65.25% | 87.71% |
|
| F1 | 62.79% | 63.16% | 61.82% | 63.15% | 65.25% | 26.69% |
| 63.36% | ||
| RI | 84.14% |
| 81.99% | 81.85% | 81.76% | 51.20% | 87.74% | 89.98% | ||
| ES-PA-CH | Acc | 81.14% | 68.86% | 73.91% | 72.78% | 72.52% | 46.51% | 86.03% |
| |
| F1 | 65.98% | 52.42% | 63.41% | 66.55% | 66.13% | 22.30% | 69.23% |
| ||
| RI | 86.29% | 77.41% | 82.73% | 80.33% | 80.29% | 42.64% | 85.98% |
| ||
| CO-PA-CH | Acc | 80.24% |
| 74.04% | 74.63% | 75.40% | 55.59% | 85.57% | 83.74% | |
| F1 | 68.56% | 63.60% | 61.77% | 63.27% | 64.27% | 26.89% | 70.44% |
| ||
| RI | 84.22% | 84.00% | 82.27% | 84.02% | 83.65% | 45.84% | 84.53% |
|
ESCA, esophageal carcinoma; HNSC, head and neck squamous cell carcinoma; CHOL, cholangiocarcinoma; COAD, colon adenocarcinoma; and PAAD, pancreatic adenocarcinoma.
The bolded texts mean the results are better than the others.
The mean metrics of result for all methods on Category dataset I, II, III.
| Metrics | Category | K-means | GNMF | RPCA | SPCA | GLPCA | LS | LLRR | NSLRG-S |
|---|---|---|---|---|---|---|---|---|---|
| ACC | I | 72.29% | 81.63% | 73.61% | 70.78% | 87.31% | 87.40% | 83.71% |
|
| II | 94.02% | 94.03% | 94.45% | 96.23% | 96.59% | 67.26% | 97.56% |
| |
| III | 80.80% | 83.70% | 75.05% | 75.42% | 76.02% | 55.78% | 86.44% |
| |
| F1 | I | 52.04% | 65.13% | 60.01% | 58.69% | 65.89% | 51.09% | 63.45% |
|
| II | 84.96% | 92.20% | 91.49% | 93.23% | 93.91% | 40.30% | 95.51% |
| |
| III | 65.78% | 59.73% | 62.34% | 64.32% | 65.21% | 25.29% |
| 69.67% | |
| RI | I | 64.36% | 75.27% | 67.18% | 64.60% | 79.58% | 81.01% | 75.00% |
|
| II | 90.68% | 89.95% | 90.96% | 93.05% | 93.57% | 60.17% | 95.46% |
| |
| III | 84.88% | 85.40% | 82.33% | 82.07% | 81.90% | 46.56% | 86.08% |
|
Acc, clustering accuracy rate; F1, F1 measurement; and RI, Rand Index; GNMF, Graph Regularized Nonnegative Matrix Factorization; SPCA, Sparse Principal Component Analysis; GLPCA, Graph-Laplacian PCA; LS, Laplacian Score; and LLRR, Laplacian regularized Low-Rank Representation; NSLRG-S, non-negative symmetric low-rank representation with graph regularization based on score function.
The bolded texts mean the results are better than the others.
Figure 5The mean metrics of experimental result for Category I, II, and III. (A) Accuracy-Category (B) F1-Category (C) Rand Index-Category.
Figure 6The mean metrics of experimental result for all methods. (A) Accuracy-Method (B) F1-Method (C) Rand Index-Method.
The NSLRG method.
|
|
|
|
|
|
| 1. Update |
| 2. Update |
| 3. Update |
| 4. Update Lagrangian multipliers |
| |
| |
| 5. Update |
| |
| where |
|
|
| if || |
| max{ |
|
|
|
|
Framework of NSLRG-S for clustering gene expression data.
|
|
|
|
| 1) Learn a lowest rank matrix |
| 2) Obtain the ranked feature genes by the Score-function; |
| 3) Obtain the selected feature genes. |
| 4) Obtain the clustering cancer samples results using the K-means method. |
|
|