| Literature DB >> 35052405 |
Wei Dai1, Wenhao Yue1, Wei Peng1,2, Xiaodong Fu1,2, Li Liu1,2, Lijun Liu1,2.
Abstract
Cancer subtype classification helps us to understand the pathogenesis of cancer and develop new cancer drugs, treatment from which patients would benefit most. Most previous studies detect cancer subtypes by extracting features from individual samples, ignoring their associations with others. We believe that the interactions of cancer samples can help identify cancer subtypes. This work proposes a cancer subtype classification method based on a residual graph convolutional network and a sample similarity network. First, we constructed a sample similarity network regarding cancer gene co-expression patterns. Then, the gene expression profiles of cancer samples as initial features and the sample similarity network were passed into a two-layer graph convolutional network (GCN) model. We introduced the initial features to the GCN model to avoid over-smoothing during the training process. Finally, the classification of cancer subtypes was obtained through a softmax activation function. Our model was applied to breast invasive carcinoma (BRCA), glioblastoma multiforme (GBM) and lung cancer (LUNG) datasets. The accuracy values of our model reached 82.58%, 85.13% and 79.18% for BRCA, GBM and LUNG, respectively, which outperformed the existing methods. The survival analysis of our results proves the significant clinical features of the cancer subtypes identified by our model. Moreover, we can leverage our model to detect the essential genes enriched in gene ontology (GO) terms and the biological pathways related to a cancer subtype.Entities:
Keywords: cancer subtype classification; deep learning; residual graph convolutional network; sample interaction
Mesh:
Substances:
Year: 2021 PMID: 35052405 PMCID: PMC8774659 DOI: 10.3390/genes13010065
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Framework of the ERGCN. (a) Calculate sample similarity matrix according to the Pearson correlation coefficient of their gene expression data and construct the sample adjacency matrix. (b) Input the sample features and the sample adjacency matrix into the residual graph convolution model to obtain category prediction.
Figure 2Performance comparison with respect to different correlation coefficient thresholds.
The external evaluation metrics of every model on BRCA dataset.
| Methods | Precision | Recall | F1 Score | Accuracy | ARI | MCC |
|---|---|---|---|---|---|---|
| SAE+SVM | 0.46688 | 0.47171 | 0.41178 | 0.568 | 0.23869 | 0.36318 |
| SAE+Gcforest | 0.47257 | 0.47837 | 0.43852 | 0.64067 | 0.30144 | 0.43627 |
| Deeptype | 0.60228 | 0.62621 | 0.59160 | 0.753 | 0.57466 | 0.64430 |
| VAE+SVM | 0.66438 | 0.65065 | 0.63436 | 0.74114 | 0.48724 | 0.60771 |
| VAE+Gcforest | 0.64519 | 0.64151 | 0.61159 | 0.77448 | 0.54275 | 0.65777 |
| SVM | 0.42288 | 0.51925 | 0.45485 | 0.72076 | 0.43940 | 0.57969 |
| Gcforest | 0.64539 | 0.65442 | 0.62575 | 0.79638 | 0.57676 | 0.69289 |
| Random Forest | 0.66267 | 0.66451 | 0.63839 | 0.78952 | 0.57012 | 0.68508 |
| GCN+PPI | 0.64554 | 0.62277 | 0.61171 | 0.75005 | 0.49454 | 0.62303 |
|
|
|
|
|
|
|
|
The external evaluation metrics of every model on GBM dataset.
| Methods | Precision | Recall | F1 Score | Accuracy | ARI | MCC |
|---|---|---|---|---|---|---|
| SAE+SVM | 0.79338 | 0.79440 | 0.78250 | 0.79355 | 0.52781 | 0.72323 |
| SAE+Gcforest | 0.79924 | 0.78409 | 0.77825 | 0.78831 | 0.51550 | 0.71606 |
| Deeptype | 0.78081 | 0.75975 | 0.74804 | 0.77300 | 0.50885 | 0.79063 |
| VAE+SVM | 0.80097 | 0.78549 | 0.78055 | 0.79761 | 0.52952 | 0.72786 |
| VAE+Gcforest | 0.77588 | 0.76140 | 0.75264 | 0.77575 | 0.51302 | 0.70079 |
| SVM | 0.83716 | 0.81796 | 0.81292 | 0.82083 | 0.56993 | 0.76267 |
| Gcforest |
| 0.82310 | 0.82187 | 0.83661 | 0.61136 | 0.78249 |
| Random Forest | 0.85179 | 0.81486 | 0.81643 | 0.83511 | 0.61546 | 0.78059 |
| GCN+PPI | 0.81717 | 0.79781 | 0.79759 | 0.80755 | 0.55226 | 0.74441 |
|
| 0.85109 |
|
|
|
|
|
The external evaluation metrics of every model on LUNG dataset.
| Methods | Precision | Recall | F1 Score | Accuracy | ARI | MCC |
|---|---|---|---|---|---|---|
| SAE+SVM | 0.62703 | 0.64589 | 0.60029 | 0.70706 | 0.40495 | 0.59873 |
| SAE+Gcforest | 0.50461 | 0.53870 | 0.48131 | 0.63412 | 0.30443 | 0.49150 |
| Deeptype | 0.65217 | 0.66711 | 0.62727 | 0.736 | 0.53235 | 0.64140 |
| VAE+SVM | 0.71101 | 0.68801 | 0.67435 | 0.75177 | 0.48261 | 0.65223 |
| VAE+Gcforest | 0.70152 | 0.67114 | 0.64492 | 0.74588 | 0.49020 | 0.65056 |
| SVM | 0.46486 | 0.53482 | 0.46342 | 0.67176 | 0.44398 | 0.55509 |
| Gcforest | 0.68092 | 0.68020 | 0.64116 | 0.76823 |
| 0.69718 |
| Random Forest | 0.66950 | 0.68130 | 0.63430 | 0.76235 | 0.56768 | 0.68308 |
| GCN+PPI | 0.59129 | 0.568 | 0.55040 | 0.65412 | 0.30853 | 0.51357 |
|
|
|
|
|
|
|
|
The internal evaluation metrics of every model.
| Methods | BRCA | GBM | LUNG | |||
|---|---|---|---|---|---|---|
| DBI | Silhouette Width | DBI | Silhouette Width | DBI | Silhouette Width | |
| SAE+SVM | 2.0001 | −0.0056 | 2.5358 | 0.0402 | 1.9491 | −0.0005 |
| SAE+Gcforest | 1.8179 | 0.0335 | 2.4135 | 0.0465 | 2.0028 | 0.0222 |
| DeepType | 0.39641 | 0.62221 | 0.75048 | 0.42000 | 0.57735 | 0.48204 |
| VAE+SVM | 2.1105 | −0.0132 | 2.9650 | −0.0376 | 1.8451 | −0.0270 |
| VAE+Gcforest | 1.9178 | 0.0444 | 2.8630 | −0.0455 | 1.7715 | −0.0147 |
| SVM | 2.15145 | 0.11750 | 2.77210 | −0.00830 | 2.66726 | 0.00047 |
| Gcforest | 1.96480 | 0.06851 | 2.80126 | 0.00025 | 2.30813 | −0.00803 |
| Random Forest | 1.98764 | 0.05645 | 2.81110 | -0.00069 | 2.28595 | −0.00269 |
| GCN+PPI | 2.02747 | 0.03644 | 2.91481 | 0.00961 | 2.25382 | −0.0148 |
|
|
|
|
|
|
|
|
Figure 3Visualization results of t-SNE. (a) The visualization result of BRCA. The first picture is the result of the original feature, and the second picture is the result of the latent features learned by ERGCN. (b) The visualization result of the GBM. The first picture is the result of the original feature. The second is the result of latent features learned by ERGCN. (c) The visualization result of LUNG, the first picture is the result of the original feature, and the second is the result of the latent feature learned by ERGCN.
The experiment results for a new sample on BRCA dataset.
| Methods | Precision | Recall | F1 Score | Accuracy | ARI | MCC |
|---|---|---|---|---|---|---|
| SAE+SVM | 0.75693 | 0.54044 | 0.52594 | 0.72549 | 0.44104 | 0.57753 |
| SAE+Gcforest | 0.62501 | 0.55515 | 0.53656 | 0.73529 | 0.48863 | 0.59049 |
| VAE+SVM | 0.70438 | 0.68683 | 0.69412 | 0.75490 | 0.49052 | 0.62384 |
| VAE+Gcforest | 0.66973 | 0.65040 | 0.65643 | 0.76471 | 0.57447 | 0.63682 |
| SVM | 0.63776 | 0.51471 | 0.46261 | 0.72549 | 0.44240 | 0.59148 |
| Gcforest |
| 0.64338 | 0.64064 | 0.79411 | 0.56589 | 0.68464 |
| Random Forest | 0.82441 | 0.62868 | 0.62397 | 0.78431 | 0.56030 | 0.66815 |
| GCN+PPI | 0.76280 | 0.68873 | 0.70813 | 0.79808 | 0.57397 | 0.69049 |
|
| 0.74755 |
|
|
|
|
|
The experiment results for a new sample on GBM dataset.
| Methods | Precision | Recall | F1 Score | Accuracy | ARI | MCC |
|---|---|---|---|---|---|---|
| SAE+SVM | 0.81642 | 0.81625 | 0.81538 | 0.81221 | 0.55023 | 0.74629 |
| SAE+Gcforest | 0.82595 | 0.82933 | 0.82708 | 0.82629 | 0.58206 | 0.76532 |
| VAE+SVM | 0.78944 | 0.78534 | 0.78682 | 0.79343 | 0.52562 | 0.72020 |
| VAE+Gcforest | 0.76416 | 0.75092 | 0.75601 | 0.76526 | 0.47159 | 0.68126 |
| SVM | 0.83590 | 0.82663 | 0.82985 | 0.83098 | 0.59225 | 0.77148 |
| Gcforest |
| 0.80886 | 0.82051 | 0.83568 | 0.61514 | 0.77803 |
| Random Forest | 0.83445 | 0.80664 | 0.81572 | 0.82629 | 0.59218 | 0.76426 |
| GCN+PPI | 0.81691 | 0.81481 | 0.81547 | 0.82160 | 0.57898 | 0.75841 |
|
| 0.84325 |
|
|
|
|
|
The experiment results for a new sample on LUNG dataset.
| Methods | Precision | Recall | F1 Score | Accuracy | ARI | MCC |
|---|---|---|---|---|---|---|
| SAE+SVM | 0.53594 | 0.53283 | 0.49398 | 0.68235 | 0.41481 | 0.55207 |
| SAE+Gcforest | 0.65871 | 0.53268 | 0.52984 | 0.65882 | 0.34749 | 0.50912 |
| VAE+SVM | 0.78690 | 0.74056 | 0.75152 | 0.81176 | 0.63649 | 0.73533 |
| VAE+Gcforest | 0.63186 | 0.61147 | 0.60847 | 0.71764 | 0.52313 | 0.59664 |
| SVM | 0.58994 | 0.55804 | 0.52567 | 0.70588 | 0.46454 | 0.59018 |
| Gcforest |
| 0.65167 | 0.63457 | 0.77647 | 0.58815 | 0.69397 |
| Random Forest | 0.58994 | 0.68130 | 0.63430 | 0.76235 | 0.56768 | 0.68308 |
| GCN+PPI | 0.61656 | 0.54185 | 0.55348 | 0.61176 | 0.23225 | 0.43827 |
|
| 0.79367 |
|
|
|
|
|
Figure 4Survival time under different subtypes: (a) the survival curve of the BRCA data set; (b) the survival curve of the GBM data set; (c) the survival curve of the LUNG data set.
The experiment results for the ablation study.
| BRCA | GBM | LUNG | |||||||
|---|---|---|---|---|---|---|---|---|---|
| MLP | GCN | ERGCN | MLP | GCN | ERGCN | MLP | GCN | ERGCN | |
| Precision | 0.74126 | 0.76061 |
| 0.84129 | 0.84435 |
|
| 0.74748 | 0.754 |
| Recall | 0.75557 | 0.7641 |
| 0.84113 | 0.84397 |
|
| 0.73711 | 0.74699 |
| F1 Score | 0.72517 | 0.73677 |
| 0.83316 | 0.83556 |
|
| 0.71772 | 0.72242 |
| Accuracy | 0.80095 | 0.80904 |
| 0.84285 | 0.84525 |
| 0.78941 | 0.78941 |
|
| ARI | 0.60001 | 0.60768 |
| 0.62204 | 0.6292 |
| 0.55106 | 0.56351 |
|
| MCC | 0.70687 | 0.71806 |
| 0.78966 | 0.79278 |
| 0.70979 | 0.70839 |
|
Figure 5GO enrichment and KEGG enrichment analysis results. (A) Biological processes. (B) Cellular components. (C) Molecular functions. (D) KEGG enrichment.