| Literature DB >> 35846148 |
Wang-Ren Qiu1, Bei-Bei Qi1, Wei-Zhong Lin1, Shou-Hua Zhang2, Wang-Ke Yu1, Shun-Fa Huang3.
Abstract
The early symptoms of lung adenocarcinoma patients are inapparent, and the clinical diagnosis of lung adenocarcinoma is primarily through X-ray examination and pathological section examination, whereas the discovery of biomarkers points out another direction for the diagnosis of lung adenocarcinoma with the development of bioinformatics technology. However, it is not accurate and trustworthy to diagnose lung adenocarcinoma due to omics data with high-dimension and low-sample size (HDLSS) features or biomarkers produced by utilizing only single omics data. To address the above problems, the feature selection methods of biological analysis are used to reduce the dimension of gene expression data (GSE19188) and DNA methylation data (GSE139032, GSE49996). In addition, the Cartesian product method is used to expand the sample set and integrate gene expression data and DNA methylation data. The classification is built by using a deep neural network and is evaluated on K-fold cross validation. Moreover, gene ontology analysis and literature retrieving are used to analyze the biological relevance of selected genes, TCGA database is used for survival analysis of these potential genes through Kaplan-Meier estimates to discover the detailed molecular mechanism of lung adenocarcinoma. Survival analysis shows that COL5A2 and SERPINB5 are significant for identifying lung adenocarcinoma and are considered biomarkers of lung adenocarcinoma.Entities:
Keywords: deep neural network; feature selection; lung adenocarcinoma biomarkers; multi-omics data; survival analysis
Year: 2022 PMID: 35846148 PMCID: PMC9280023 DOI: 10.3389/fgene.2022.926927
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Workflow of lung adenocarcinoma prediction and biomarker recognition.
Benchmark dataset.
| Dataset | Gene expression | DNA methylation | |
|---|---|---|---|
| GEO ID | GSE19188 | GSE139032 | GSE49996 |
| Normal samples | 65 | 77 | 41 |
| LUAD samples | 91 | 77 | 39 |
| Features | 23,489 | 24,025 | 24,025 |
FIGURE 2PCA for all samples in the (A) GSE19188 (B) GSE139032 and (C) GSE49996 dataset indicated two different groups. Pearson correlation matrix among all samples in (D) GSE19188 (E) GSE139032 and (F) GSE49996.
FIGURE 3(A) Volcano map of DEGs in gene expression data. (B) Volcano map of DMPs in DNA methylation data.
Parameter setting.
| Methods | Parameter setting |
|---|---|
| DNN | learning rate = 0.02, dropout = 0.85 |
| RF | criterion = 'entropy', n_estimators = 100, n_jobs = -1, max_depth = 6 |
| KNN | n_neighbors = 10 |
| NB | default parameters |
Dimensions of different feature extraction algorithms in each fold data.
| 5-Fold CV | Num of genes (MI t-SNE DEG) | Num of CpGs (MI t-SNE DMP) | Num of genes and CpGs (MI t-SNE DEG + DMP |
|---|---|---|---|
| K = =1 | 47 | 109 | 156 |
| K = =2 | 52 | 120 | 172 |
| K = =3 | 55 | 121 | 176 |
| K = =4 | 57 | 131 | 188 |
| K = =5 | 54 | 121 | 175 |
| Avg | 53 | 120.4 | 173.4 |
The Accuracy and AUROC results of different prediction algorithms.
| Gene expression | Methylation expression | Gene expression and DNA methylation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MI | t-SNE | DEG | MI | t-SNE | DMP | MI | t-SNE | DEG + DMP | ||
| RF | ACC | 0.8387 | 0.5448 | 0.9435 | 0.5044 | 0.5766 | 0.9465 | 0.5000 | 0.5726 | 0.9659 |
| AUROC | 0.8653 | 0.5489 | 0.9407 | 0.5053 | 0.5889 | 0.9377 | 0.5000 | 0.6209 | 0.9694 | |
| KNN | ACC | 0.9293 | 0.5325 | 0.9435 | 0.5641 | 0.5171 | 0.9571 | 0.6132 | 0.4974 |
|
| AUROC | 0.9290 | 0.4516 | 0.9469 | 0.5667 | 0.5559 | 0.9459 | 0.6065 | 0.5171 |
| |
| NB | ACC | 0.6714 | 0.5065 | 0.9354 | 0.5044 | 0.6196 | 0.9659 | 0.5000 | 0.6099 | 0.9750 |
| AUROC | 0.6264 | 0.4542 | 0.9317 | 0.5000 | 0.6192 | 0.9579 | 0.5000 | 0.5964 | 0.9652 | |
5-fold performance comparison of different feature selection algorithms in the deep learning-based prediction model.
| Mi | t-SNE | The proposed model | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Cost | Accuracy | AUROC | Cost | Accuracy | AUROC | Cost | Accuracy | AUROC | |
| 1 | 1.9900 | 0.6156 | 0.4309 | 0.6508 | 0.7143 | 0.5000 | 0.0004 | 0.9999 | 0.9998 |
| 2 | 3.3951 | 0.6230 | 0.5000 | 0.6574 | 0.6230 | 0.5000 | 0.0171 | 0.9959 | 0.9967 |
| 3 | 6.7629 | 0.4298 | 0.5000 | 0.8742 | 0.4551 | 0.5077 | 0.1822 | 0.9849 | 0.9808 |
| 4 | 1.4862 | 0.5076 | 0.5028 | 0.9059 | 0.4702 | 0.4677 | 0.1213 | 0.9764 | 0.9772 |
| 5 | 3.2534 | 0.6534 | 0.5000 | 0.5965 | 0.6754 | 0.6193 | 0.0454 | 0.9945 | 0.9958 |
| Average | 3.3775 | 0.5659 | 0.4867 | 0.7370 | 0.5876 | 0.5189 |
|
|
|
The average Cost, Accuracy and AUROC of DEG+DMP feature selection algorithm are used in the prediction model based on deep learning.
FIGURE 4Average accuracy of all comparisons and the proposed model.
FIGURE 6Average F1-score of all comparisons and the proposed mode.
FIGURE 5Average AUROC of all comparisons and the proposed mode.
Selected genes from gene expression and DNA methylation and comparison with LUAD database.
| Each Ffold | K = 1 | K = 2 | K = 3 | K = 4 | K = 5 |
|---|---|---|---|---|---|
| Selected genes | ABCA3 AIM2 CA3 CDKN2A COL1A1 COL5A2 CYYR1 FOXF1 GREM1 HLF MAGEA6 PCSK1 PROK2 SCNN1B SERPINB5 SLC7A11 SOSTDC1 SOX17 SOX7 STXBP6 TWIST1 | ABCA3 AGTR1 AIM2 AZGP1 CA3 COL1A1 CLIC3 COL5A2 CYYR1 FOXF1 FOXF2 GDF10 GREM1 HLF MAGEA6 MAL MUC1 PROK2 S100A2 SERPINB5 SLIT3 SOSTDC1 SPARCL1 STXBP6 TRHDE TWIST1 | ABCA3 C1orf116 COL1A1 COL5A2 COX7A1 CYYR1 FOXF1 GREM1 HIST1H2BH HK3 HLF MAGEA6 MAL MUC1 S100A2 SCNN1B SERPINB5 SLC7A11 SOSTDC1 SOX7 SPARCL1 STXBP6 TWIST1 ZBED2 | ABCA3 AGTR1 C1orf116CDKN2A COL1A1 COL5A2COX7A1 CYYR1 EFEMP1 FOXF1 GDF10 GREM1 HLF IL6 MAL MMP13 PKP1 SERPINB5 SLC7A11 SOSTDC1m SOX7 SPARCL1 STXBP6 TWIST1 | ABCA3 AZGP1 C1orf116 COL1A1 COL5A2 COX7A1 CYYR1 EFEMP1 FOXF1 GREM1 HLF MAGEA6 MMP13 PCSK1 PROK2 S100A2 SERPINB5 SLC7A11 SOX7 SPARCL1 STXBP6 TWIST1 ZBED2 |
| Union of the selected genes across 5 fold | ABCA3 COL1A1 COL5A2 CYYR1 SLC7A11 GREM1HLF SERPINB5 SOX7 SPARCL1 STXBP6 TWIST1 |
GO analysis of selected genes.
| Category | Term |
| Gene |
|---|---|---|---|
| GOTERM_BP_DIRECT | GO:0030198∼extracellular matrix organization | 0.0001 | COL1A1, FOXF2, FOXF1, COL5A2, SERPINB5, JAM2 |
| GOTERM_BP_DIRECT | GO:0050900∼leukocyte migration | 0.0042 | COL1A1, SLC7A11, COL5A2, JAM2 |
| GOTERM_BP_DIRECT | GO:0001558∼regulation of cell growth | 0.0194 | SOCS2, FAM107A, AGTR1 |
| GOTERM_MF_DIRECT | GO:0001228∼transcriptional activator activity, RNA polymerase II transcription regulatory region sequence-specific binding | 0.0287 | SOX17, SERPINB5, FOXF1 |
| GOTERM_MF_DIRECT | GO:0046982∼protein heterodimerization activity | 0.0399 | TWIST1, COL5A2, JAM2 |
FIGURE 7Overall survival analysis in LUAD based on the TCGA data as determined by Kaplan-Meier estimates. (A) COL5A2 and (B) SERPINB5 are significantly affect the prognosis of LUAD in overall survival (p < 0.05).