| Literature DB >> 31888442 |
Juan Wang1, Cong-Hai Lu1, Jin-Xing Liu2, Ling-Yun Dai1, Xiang-Zhen Kong1.
Abstract
BACKGROUND: Identifying different types of cancer based on gene expression data has become hotspot in bioinformatics research. Clustering cancer gene expression data from multiple cancers to their own class is a significance solution. However, the characteristics of high-dimensional and small samples of gene expression data and the noise of the data make data mining and research difficult. Although there are many effective and feasible methods to deal with this problem, the possibility remains that these methods are flawed.Entities:
Keywords: Affinity matrix; Gene expression data; Graph regularization; Low-rank representation; Spectral clustering; Symmetric constraint
Mesh:
Year: 2019 PMID: 31888442 PMCID: PMC6936083 DOI: 10.1186/s12859-019-3231-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The LRR method with clean data
Fig. 2The LRR method with real data
The distribution of the samples in the five datasets
| Gene Expression Datasets | The Distribution of the Samples in the Datasets | ||
|---|---|---|---|
| Cancer Samples | Normal Samples | Total of Number | |
| COAD | 262 | 19 | 281 |
| ESCA | 183 | 9 | 192 |
| CHOL | 36 | 9 | 45 |
| PAAD | 176 | 4 | 180 |
| HNSC | 398 | 20 | 418 |
Note: The Gene Expression Datasets represent the different cancer sample data: COAD colon adenocarcinoma, ESCA esophagus cancer, CHOL cholangiocarcinoma, PAAD pancreatic adenocarcinoma, HNSC head and neck squamous cell carcinoma
The distribution of the six datasets
| Datasets | The number of samples of each type of cancer | Total number of samples | subspace number |
|---|---|---|---|
| CO-CH | 262–36 | 298 | 2 |
| PA-ES | 176–183 | 359 | 2 |
| CH-HN-CO | 36–398-262 | 696 | 3 |
| ES-CH-HN | 183–36-398 | 617 | 3 |
| CO-CH-ES-HN | 262–36–183-398 | 879 | 4 |
| ES-CO-PA-HN | 183–262–176-398 | 1019 | 4 |
Note: The datasets represent different integrated datasets. The characteristics of each dataset are described in the previous passage
Fig. 3The item of TP, FP, TN, FN
Fig. 4Confusion matrix and the item of TP, FP, TN, FN for multi-cancer dataset clustering
The clustering results of all methods on the different integrated datasets
| Datasets | Measure | Method | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| K-means | T-SNE | LLE | NMF | PCA | LRR | LLRR | MLLRR | sgLRR | ||
| CO-CH | ACC | 95.40 | 89.31 | 71.14 | 93.14 | 93.63 | 97.99 | 98.99 | 98.66 | |
| MCC | 80.06 | 70.21 | 88.58 | 75.53 | 72.99 | 46.76 | 91.06 | 90.78 | ||
| RI | 91.43 | 49.85 | 63.66 | 87.18 | 88.07 | 96.04 | 98.00 | 97.34 | ||
| NMI | 66.46 | 51.01 | 75.71 | 54.12 | 54.47 | 41.92 | 80.51 | 77.04 | ||
| PA-ES | ACC | 98.25 | 91.84 | 77.26 | 96.38 | |||||
| MCC | 84.73 | 83.74 | 62.04 | 77.98 | 96.71 | 96.71 | ||||
| RI | 97.37 | 84.97 | 66.46 | 98.33 | 93.00 | 98.34 | 98.34 | 98.34 | ||
| NMI | 81.06 | 59.49 | 41.43 | 65.43 | 89.39 | 89.39 | ||||
| CH-HN-CO | ACC | 89.22 | 83.03 | 80.99 | 76.99 | 85.77 | 95.86 | 97.99 | 84.77 | |
| MCC | 67.91 | 65.69 | 60.77 | 82.25 | 69.17 | 61.23 | 68.31 | 80.96 | ||
| RI | 90.00 | 87.16 | 82.85 | 80.19 | 87.76 | 94.70 | 96.67 | 84.66 | ||
| NMI | 73.55 | 78.43 | 69.07 | 76.22 | 72.59 | 68.25 | 73.81 | 77.83 | ||
| ES-CH-HN | ACC | 85.56 | 52.35 | 61.65 | 84.52 | 80.03 | 82.17 | 93.19 | 94.32 | |
| MCC | 66.26 | 32.62 | 42.01 | 67.44 | 66.08 | 43.04 | 61.18 | 66.49 | ||
| RI | 82.67 | 60.97 | 63.89 | 80.25 | 78.15 | 72.36 | 89.07 | 90.30 | ||
| NMI | 56.77 | 30.77 | 33.26 | 47.35 | 72.59 | 36.92 | 52.38 | 57.20 | ||
| CO-CH-ES-HN | ACC | 86.89 | 60.52 | 63.04 | 82.31 | 81.32 | 79.24 | 92.48 | 87.94 | |
| MCC | 70.00 | 65.30 | 79.30 | 51.78 | 71.30 | 52.44 | 78.05 | 73.03 | ||
| RI | 89.43 | 74.59 | 71.98 | 85.07 | 86.57 | 81.58 | 91.42 | 88.95 | ||
| NMI | 71.04 | 48.43 | 52.00 | 57.67 | 69.12 | 54.76 | 74.42 | 69.82 | ||
| ES-CO-PA-HN | ACC | 86.89 | 85.83 | 67.76 | 82.31 | 81.32 | 79.24 | 92.48 | 87.94 | |
| MCC | 79.51 | 78.52 | 77.40 | 84.23 | 59.21 | 88.38 | 83.63 | 85.49 | ||
| RI | 89.43 | 91.45 | 78.06 | 85.07 | 86.57 | 81.58 | 91.42 | 88.95 | ||
| NMI | 76.15 | 81.42 | 55.93 | 72.97 | 79.53 | 61.82 | 76.58 | 76.23 | ||
Note: The best clustering results are highlighted in bold
Fig. 5Visualization of the dimensionality reduction data of obtained by T-SNE, LLE and sgLRR methods for the six integrated datasets
Fig. 6The heat maps intuitively compare the grouping effects of matrices Z∗ and H. (1-a) and (1-b) are the heat maps based on the matrices Z∗ and H for the CO-CH dataset, respectively; (2-a) and (2-b) are the heat maps based on the matrices Z∗ and H for the CH-HN-CO dataset, respectively; and (3-a) and (3-b) are the heat maps based on the matrices Z∗ and H for the CO-CH-ES-HN dataset, respectively