| Literature DB >> 33144687 |
Bin Baek1, Hyunju Lee2,3.
Abstract
Predicting the prognosis of pancreatic cancer is important because of the very low survival rates of patients with this particular cancer. Although several studies have used microRNA and gene expression profiles and clinical data, as well as images of tissues and cells, to predict cancer survival and recurrence, the accuracies of these approaches in the prediction of high-risk pancreatic adenocarcinoma (PAAD) still need to be improved. Accordingly, in this study, we proposed two biological features based on multi-omics datasets to predict survival and recurrence among patients with PAAD. First, the clonal expansion of cancer cells with somatic mutations was used to predict prognosis. Using whole-exome sequencing data from 134 patients with PAAD from The Cancer Genome Atlas (TCGA), we found five candidate genes that were mutated in the early stages of tumorigenesis with high cellular prevalence (CP). CDKN2A, TP53, TTN, KCNJ18, and KRAS had the highest CP values among the patients with PAAD, and survival and recurrence rates were significantly different between the patients harboring mutations in these candidate genes and those harboring mutations in other genes (p = 2.39E-03, p = 8.47E-04, respectively). Second, we generated an autoencoder to integrate the RNA sequencing, microRNA sequencing, and DNA methylation data from 134 patients with PAAD from TCGA. The autoencoder robustly reduced the dimensions of these multi-omics data, and the K-means clustering method was then used to cluster the patients into two subgroups. The subgroups of patients had significant differences in survival and recurrence (p = 1.41E-03, p = 4.43E-04, respectively). Finally, we developed a prediction model for prognosis using these two biological features and clinical data. When support vector machines, random forest, logistic regression, and L2 regularized logistic regression were used as prediction models, logistic regression analysis generally revealed the best performance for both disease-free survival (DFS) and overall survival (OS) (accuracy [ACC] = 0.762 and area under the curve [AUC] = 0.795 for DFS; ACC = 0.776 and AUC = 0.769 for OS). Thus, we could classify patients with a high probability of recurrence and at a high risk of poor outcomes. Our study provides insights into new personalized therapies on the basis of mutation status and multi-omics data.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33144687 PMCID: PMC7609582 DOI: 10.1038/s41598-020-76025-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Workflow of approach. Graphical summary of the prediction of survival and recurrence in patients with pancreatic cancer. (a) Omics datasets used to construct the prediction models. (b) Data preprocessing and the process for obtaining features. (c) Final nine features (including seven clinical features). (d) Machine learning models used for the prediction.
Figure 2Kaplan–Meier OS and DFS curves for the two groups of patients with PAAD. Kaplan–Meier survival curves for the CP value-based two groups of patients with PAAD. OS (a) and DFS (b). OS (c) and DFS (d) of two groups of patients who had mutations in frequently mutated genes and other patients.
Statistical analysis of the TCGA-PAAD omics datasets.
| mRNA | miRNA | DNA methylation | |
|---|---|---|---|
| Initial number | 56,716 | 1450 | 374,146 |
| After preprocessing | 280 | 413 | 18,707 |
Figure 3Kaplan–Meier OS and DFS curves for the two groups of patients identified by K-means clustering. Kaplan–Meier survival curves for the two subgroups of patients showing OS and DFS. DFS for G1 and G2 (a), OS for G1 and G2 (b). Kaplan–Meier survival curves for the two subgroups analyzed by PCA showing OS and DFS. DFS for G1 and G2 (c), OS for G1 and G2 (d).
Figure 4Predictive performance for DFS and OS using various features. The performance of machine learning models for predicting (a) OS and (b) DFS based on various features were measured. The y-axis represents the accuracy or AUC values. Clinical, nine clinical data; KG, known cancer driver genes (KRAS, CDKN2A, TP53, and SMAD4); HR, a high-risk group of patients harboring mutations in five genes with high CP values; Sub, a feature representing subgroups generated by integrating mRNA, miRNA, and DNA methylation subtypes; AUC, area under the curve values from fivefold cross-validation.
Predictive performance for DFS and OS using various features based on C-index and IBS.
| Features | C-index | IBS | ||
|---|---|---|---|---|
| OS | DFS | OS | DFS | |
| Clinical | 0.7955±0.04 | 0.7894±0.04 | 0.338 | 0.319 |
| KG+Clinical | 0.8019±0.04 | 0.7937±0.04 | 0.349 | 0.317 |
| KG+Sub+Clinical | 0.8057±0.04 | 0.8363±0.03 | 0.326 | 0.283 |
| HR+Clinical | 0.8152±0.04 | 0.8125±0.04 | 0.329 | 0.308 |
| Sub+Clinical | 0.8026±0.04 | 0.8361±0.03 | 0.329 | 0.286 |
| HR+Sub+Clinical | 0.318 | 0.265 | ||
Logistic regression was used as a prediction model. Clinical, nine clinical data; KG, known cancer driver genes (KRAS, CDKN2A, TP53, and SMAD4); HR, a high-risk group of patients harboring mutations in five genes with high CP values; Sub, a feature representing subgroups generated by integrating mRNA, miRNA, and DNA methylation subtypes.