| Literature DB >> 35368657 |
Shuo Wang1,2, Hao Zhang1,2, Zhen Liu1,2,3, Yuanning Liu1,2.
Abstract
Lung cancer is the leading cause of the cancer deaths. Therefore, predicting the survival status of lung cancer patients is of great value. However, the existing methods mainly depend on statistical machine learning (ML) algorithms. Moreover, they are not appropriate for high-dimensionality genomics data, and deep learning (DL), with strong high-dimensional data learning capability, can be used to predict lung cancer survival using genomics data. The Cancer Genome Atlas (TCGA) is a great database that contains many kinds of genomics data for 33 cancer types. With this enormous amount of data, researchers can analyze key factors related to cancer therapy. This paper proposes a novel method to predict lung cancer long-term survival using gene expression data from TCGA. Firstly, we select the most relevant genes to the target problem by the supervised feature selection method called mutual information selector. Secondly, we propose a method to convert gene expression data into two kinds of images with KEGG BRITE and KEGG Pathway data incorporated, so that we could make good use of the convolutional neural network (CNN) model to learn high-level features. Afterwards, we design a CNN-based DL model and added two kinds of clinical data to improve the performance, so that we finally got a multimodal DL model. The generalized experiments results indicated that our method performed much better than the ML models and unimodal DL models. Furthermore, we conduct survival analysis and observe that our model could better divide the samples into high-risk and low-risk groups.Entities:
Keywords: CNN; cancer precision medicine; cancer survival prediction; deep learning; multimodal; optimal threshold selection; survival analysis
Year: 2022 PMID: 35368657 PMCID: PMC8964372 DOI: 10.3389/fgene.2022.800853
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Overall process of the lung cancer long-term survival prediction: (A) the process of generating the gene expression image with KEGG BRITE data, (B) the DL model we propose for the prediction task, (C) the process of generating the gene expression image with KEGG Pathway data, (D) the detailed architecture of the convolution module we design for learning representations from the gene expression image.
The general statistic for the datasets analyzed. Stages I to IV are the tumor stages defined by the AJCC staging system (Edge and Compton, 2010).
| TCGA lung cancer data set | GSE37745 data set | |
|---|---|---|
| Number of samples included | 471 | 195 |
| Median age | 68 | 65 |
| Median age survived after 5 years | 68 | 63 |
| Median age dead after 5 years | 68 | 66 |
| Number of samples with stage I or stage IA | 88 | 40 |
| Number of samples with stage IB | 129 | 89 |
| Number of samples with stage II or stage IIA | 48 | 6 |
| Number of samples with stage IIB | 84 | 29 |
| Number of samples with stage III or stage IIIA | 84 | 21 |
| Number of samples with stage IIIB | 17 | 6 |
| Number of samples with stage IV | 21 | 4 |
| Percentage of over 5 year OS | 26.1% | 41.5% |
| Percentage of failed 5 year OS | 73.9% | 58.5% |
The hyperparameter searching space of the DL models for searching with Bayesian optimization.
| Hyperparameters for searching | |
| Hyperparameter | Options for searching |
| Conv-BRITE-filters-1 | 32, 40, 48, 56, 64 |
| Conv-BRITE-filters-2 | 80, 96, 112, 128 |
| Dense-BRITE-units | 128, 144, 192, 256 |
| Dropout-rate-BRITE | 0.1, 0.2, 0.3 |
| Conv-pathway-filters-1 | 32, 40, 48, 56, 64 |
| Conv-pathway-filters-2 | 80, 96, 112, 128 |
| Dense-pathway-units | 128, 144, 192, 256 |
| Dropout-rate-pathway | 0.1, 0.2, 0.3 |
| Dense-1-units | 64, 128, 144, 192, 256 |
| Dropout-rate-1 | 0.3, 0.4, 0.5 |
| Dense-2-units | 32, 64, 128 |
| Dropout-rate-2 | 0.3, 0.4, 0.5 |
| Learning-rate | 0.001, 0.002, 0.003 |
Results of the five average metrics scores from 50 different train–test-split experiments (mean ± SD) on the TCGA lung cancer data set. The accuracy, precision, recall, and f1-score were calculated with the optimal threshold selected using Youden’s J statistic.
| Models | Average scores of 50 experiments on TCGA datasets | ||||
|---|---|---|---|---|---|
| AUC | Accuracy | Precision | Recall | F1-score | |
| DL-four-inputs |
|
|
|
|
|
| DL-three-inputs-age | 65.68 ± 4% | 64.42 ± 8% | 62.34 ± 15% | 86.39 ± 4% | 71.00 ± 10% |
| DL-three-inputs-stage | 70.69 ± 4% | 68.95 ± 7% | 68.29 ± 14% | 87.54 ± 4% | 75.64 ± 8% |
| DL-two-inputs | 65.16 ± 4% | 62.82 ± 9% | 59.31 ± 17% | 87.22 ± 5% | 68.72 ± 11% |
| DL-one-input-BRITE | 63.58 ± 4% | 62.74 ± 9% | 61.03 ± 17% | 85.13 ± 4% | 69.32 ± 11% |
| DL-one-input-pathway | 64.69 ± 4% | 63.31 ± 8% | 60.97 ± 17% | 86.32 ± 5% | 69.62 ± 11% |
| KNN | 53.63 ± 5% | 57.22 ± 11% | 52.51 ± 23% | 85.47 ± 6% | 61.54 ± 16% |
| SVM | 54.77 ± 5% | 56.11 ± 11% | 52.69 ± 23% | 84.17 ± 7% | 60.58 ± 18% |
| Random-forest | 57.41 ± 6% | 57.33 ± 12% | 53.40 ± 24% | 85.09 ± 7% | 61.68 ± 18% |
| Logistic-regression | 50.81 ± 5% | 55.41 ± 15% | 53.91 ± 29% | 82.50 ± 8% | 58.67 ± 25% |
| MLP | 55.06 ± 5% | 54.61 ± 11% | 49.14 ± 21% | 83.91 ± 5% | 58.75 ± 17% |
The bold values are the highest among all the models.
FIGURE 6The Kaplan–Meier curves of the predicted high-risk and low-risk samples for our best DL model (without clinical data) and the five ML models on the TCGA lung cancer data set. The p-values were computed using log-rank test.
Hazard ratio of each model calculated from the univariate proportional hazard analysis model.
| Models | HR (95% CI) |
|
| DL-Two-Inputs |
| <0.01 |
| KNN | 2.22 | <0.20 |
| SVM |
| <0.20 |
| Random-Forest | 2.31 | <0.10 |
| Logistic-Regression | 3.60 | <0.01 |
| MLP | 2.77 | <0.07 |
The bold values are the highest among all the models.
Results of the five average metrics scores from 50 different train–test-split experiments (mean ± SD) on the GEO GSE37745 data set. The accuracy, precision, recall, and f1-score were calculated with the optimal threshold selected using Youden’s J statistic.
| Models | Average scores of 50 experiments on GEO datasets | ||||
|---|---|---|---|---|---|
| AUC | Accuracy | Precision | Recall | F1-score | |
| DL-four-inputs |
|
|
| 79.26 ± 7% |
|
| DL-three-inputs-age | 70.77 ± 5% | 71.03 ± 5% | 68.96 ± 17% | 81.26 ± 7% | 72.60 ± 9% |
| DL-three-inputs-stage | 72.36 ± 6% | 72.46 ± 6% | 71.04 ± 16% | 81.32 ± 7% | 74.39 ± 8% |
| DL-two-inputs | 69.74 ± 6% | 69.74 ± 6% | 65.30 ± 17% | 82.33 ± 9% | 70.58 ± 10% |
| DL-one-input-BRITE | 68.88 ± 5% | 70.56 ± 5% | 70.52 ± 14% | 79.10 ± 8% | 73.16 ± 7% |
| DL-one-input-pathway | 67.37 ± 5% | 68.05 ± 5% | 62.70 ± 15% | 80.91 ± 9% | 68.89 ± 8% |
| KNN | 55.76 ± 8% | 63.85 ± 9% | 56.35 ± 26% |
| 60.84 ± 20% |
| SVM | 54.32 ± 8% | 61.33 ± 6% | 63.13 ± 23% | 72.04 ± 10% | 63.28 ± 15% |
| Random-forest | 55.59 ± 8% | 60.72 ± 7% | 52.78 ± 23% | 77.37 ± 11% | 58.21 ± 17% |
| Logistic-regression | 54.08 ± 8% | 58.51 ± 7% | 49.83 ± 24% | 75.82 ± 11% | 55.07 ± 17% |
| MLP | 54.69 ± 8% | 59.03 ± 7% | 49.04 ± 24% | 75.89 ± 9% | 55.56 ± 15% |
The bold values are the highest among all the models.
The interpretation of TP, FP, TN, and FN. TP is the number of correctly predicted dead samples, TN is the number of correctly predicted survived samples, FP is the number of wrongly predicted dead samples, and FN is the number of wrongly predicted survived samples.
| Prediction | |||
|---|---|---|---|
| Ground Truth | — | P | N |
| P | TP | FN | |
| N | FP | TN | |
FIGURE 2Radar plot for comparison of the DL models on the TCGA lung cancer data set.
FIGURE 3Radar plot for comparison of the two-input DL model with the ML models on the TCGA lung cancer data set.
FIGURE 4Box plot of the distribution of 50 AUCs for each model on the TCGA lung cancer data set.
FIGURE 5Box plot of the distribution of 50 thresholds for each model on the TCGA lung cancer data set.