| Literature DB >> 33086649 |
Satoshi Takahashi1,2, Ken Asada1,2, Ken Takasawa1,2, Ryo Shimoyama2, Akira Sakai2, Amina Bolatkan2, Norio Shinkai1,2, Kazuma Kobayashi1,2, Masaaki Komatsu1,2, Syuzo Kaneko2, Jun Sese2,3, Ryuji Hamamoto1,2.
Abstract
Mortality attributed to lung cancer accounts for a large fraction of cancer deaths worldwide. With increasing mortality figures, the accurate prediction of prognosis has become essential. In recent years, multi-omics analysis has emerged as a useful survival prediction tool. However, the methodology relevant to multi-omics analysis has not yet been fully established and further improvements are required for clinical applications. In this study, we developed a novel method to accurately predict the survival of patients with lung cancer using multi-omics data. With unsupervised learning techniques, survival-associated subtypes in non-small cell lung cancer were first detected using the multi-omics datasets from six categories in The Cancer Genome Atlas (TCGA). The new subtypes, referred to as integration survival subtypes, clearly divided patients into longer and shorter-surviving groups (log-rank test: p = 0.003) and we confirmed that this is independent of histopathological classification (Chi-square test of independence: p = 0.94). Next, an attempt was made to detect the integration survival subtypes using only one categorical dataset. Our machine learning model that was only trained on the reverse phase protein array (RPPA) could accurately predict the integration survival subtypes (AUC = 0.99). The predicted subtypes could also distinguish between high and low risk patients (log-rank test: p = 0.012). Overall, this study explores novel potentials of multi-omics analysis to accurately predict the prognosis of patients with lung cancer.Entities:
Keywords: deep learning and machine learning; lung cancer; multi-omics analysis
Year: 2020 PMID: 33086649 PMCID: PMC7603376 DOI: 10.3390/biom10101460
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Figure 1Overall workflow of the study. (a) Detecting integration survival subtypes in non-small cell lung cancer (NSCLC) from six categorical multi-omics data in The Cancer Genome Atlas (TCGA). An autoencoder and unsupervised learning technique were used. (b) Prediction of integration survival subtypes using only one categorical data and the validation of the model using uncommon data.
The summary of common and uncommon data set.
| The Number of Samples of Each Data Type | |||
|---|---|---|---|
| Data Name | LUAD | LUSC | Total |
| Common | 278 | 205 | 483 |
| Clinical_uncommon | 197 | 262 | 459 |
| mRNA_uncommon | 190 | 262 | 452 |
| miRNA_uncommon | 125 | 103 | 228 |
| RPPA_uncommon | 54 | 93 | 147 |
| CNV_uncommon | 190 | 259 | 449 |
| Somatic mutation_uncommon | 193 | 249 | 442 |
| Methylation_uncommon | 135 | 131 | 266 |
The summary of data used.
| The Number of Features in Each Step | |||
|---|---|---|---|
| Data Type | Before Compression | After Compression by Autoencoder | After Feature Selection by Cox-PH |
| mRNA | 13,049 | 100 | 12 |
| miRNA | 217 | 100 | 3 |
| RPPA | 150 | 100 | 3 |
| CNV | 14,786 | 100 | 5 |
| Somatic mutation | 18,977 | 100 | 3 |
| Methylation | 19,899 | 100 | 3 |
Figure 2Prediction of the cluster number and k-means clustering. (a) Result of the elbow method. The x-axis shows the number of clusters; the y-axis shows the distortion score. (b) Result of the Calinski-Harabasz index and Silhouette Coefficient. The x-axis shows the number of clusters; the y-axis shows the Silhouette score or Calinski-Harabasz score. (c) Visualization of the k-means clustering by t-SNE. (d) Kaplan-Meier survival curves of integration survival subtypes.
Figure 33D-scatter plots of compressed common ID data belonging to one category. Each axis represents the data values and the color shows Cluster ID. (a) Methylation common data. (b) reverse phase protein array (RPPA) common data. (c) Somatic mutation common data. (d) miRNA common data. The Cluster ID are not separated in (a,c,d). In (b), the Cluster ID were separated clearly.
Area under curve (AUC) of logistic regression models for predicting the survival subtypes using compressed data.
| Data Type | AUC |
|---|---|
| mRNA | 0.57 ± 0.05 |
| miRNA | 0.61 ± 0.07 |
| RPPA | 0.99 ± 0.00 |
| CNV | 0.43 ± 0.04 |
| Somatic mutation | 0.50 ± 0.07 |
| Methylation | 0.55 ± 0.05 |
Figure 4Kaplan-Meier survival curve of the RPPA uncommon dataset using the integration survival subtypes.
Figure 5Receiver operating characteristic (ROC) analysis for evaluation of the machine learning models that predict the integration survival subtypes using uncompressed RPPA common datasets. ROC curves of XGBoost (a) and LightGBM (b).
Figure 6SHapley Additive exPlanations (SHAP) summary plot. (a) The plot shows the SHAP value of XGBoost magnitudes across all samples. The color represents the feature values (red represents high and blue represents low). (b) The plot shows the sum of SHAP value of LightGBM.
Figure 7Relationship between Cluster ID and NKX2-1 expression levels. (a) Relationship between NKX2-1 RPPA expression levels and integration survival subtypes. x-Axis shows the integration survival subtype and Y-axis shows the value of NKX2-1 RRPA expression levels that are standardized against row (sample ID). (b) Relationship between NKX2-1 mRNA expression levels and integration survival subtypes. x-Axis shows integration survival subtype and y-axis shows the value of NKX2-1 mRNA expression levels that are standardized against row (sample ID).