| Literature DB >> 32806782 |
Saurav Mallik1, Soumita Seth2, Tapas Bhadra2, Zhongming Zhao1,3,4.
Abstract
DNA methylation change has been useful for cancer biomarker discovery, classification, and potential treatment development. So far, existing methods use either differentially methylated CpG sites or combined CpG sites, namely differentially methylated regions, that can be mapped to genes. However, such methylation signal mapping has limitations. To address these limitations, in this study, we introduced a combinatorial framework using linear regression, differential expression, deep learning method for accurate biological interpretation of DNA methylation through integrating DNA methylation data and corresponding TCGA gene expression data. We demonstrated it for uterine cervical cancer. First, we pre-filtered outliers from the data set and then determined the predicted gene expression value from the pre-filtered methylation data through linear regression. We identified differentially expressed genes (DEGs) by Empirical Bayes test using Limma. Then we applied a deep learning method, "nnet" to classify the cervical cancer label of those DEGs to determine all classification metrics including accuracy and area under curve (AUC) through 10-fold cross validation. We applied our approach to uterine cervical cancer DNA methylation dataset (NCBI accession ID: GSE30760, 27,578 features covering 63 tumor and 152 matched normal samples). After linear regression and differential expression analysis, we obtained 6287 DEGs with false discovery rate (FDR) <0.001. After performing deep learning analysis, we obtained average classification accuracy 90.69% (±1.97%) of the uterine cervical cancerous labels. This performance is better than that of other peer methods. We performed in-degree and out-degree hub gene network analysis using Cytoscape. We reported five top in-degree genes (PAIP2, GRWD1, VPS4B, CRADD and LLPH) and five top out-degree genes (MRPL35, FAM177A1, STAT4, ASPSCR1 and FABP7). After that, we performed KEGG pathway and Gene Ontology enrichment analysis of DEGs using tool WebGestalt(WEB-based Gene SeT AnaLysis Toolkit). In summary, our proposed framework that integrated linear regression, differential expression, deep learning provides a robust approach to better interpret DNA methylation analysis and gene expression data in disease study.Entities:
Keywords: DNA methylation; Liner regression; deep learning; differentially expressed genes; uterine cervical cancer
Mesh:
Substances:
Year: 2020 PMID: 32806782 PMCID: PMC7465138 DOI: 10.3390/genes11080931
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Flowchart of the proposed framework.
List of differentially expressed genes (false discovery rate (FDR) sorted).
| Gene Symbol | FDR | ||
|---|---|---|---|
|
| 45.22 |
|
|
|
| 32.50 |
|
|
|
| 32.09 |
|
|
|
| 30.24 |
|
|
|
| −29.40 |
|
|
|
| 29.02 |
|
|
|
| −28.71 |
|
|
|
| 28.19 |
|
|
|
| 27.69 |
|
|
|
| 26.49 |
|
|
|
| 26.25 |
|
|
|
| 26.16 |
|
|
|
| 25.94 |
|
|
|
| 25.67 |
|
|
|
| 25.45 |
|
|
|
| 25.41 |
|
|
|
| 25.40 |
|
|
|
| 25.40 |
|
|
|
| 25.23 |
|
|
|
| 24.89 |
|
|
|
| 24.86 |
|
|
|
| 24.74 |
|
|
|
| 24.71 |
|
|
|
| 24.63 |
|
|
|
| −24.51 |
|
|
Values of disease classification metrics by proposed method.
| Metrics | Average Value (std *) |
|---|---|
| Average accuracy | |
| Average sensitivity | |
| Average specificity | |
| Average precision | |
| Average overall error rate | |
| Area under curve (AUC) | 0.858 |
* std: standard deviation.
Figure 2ROC plots of all classification metrics for the proposed method.
Figure 3Comparative bar plot: proposed method vs state-of-the-art method (RSNNS)).
Top 10 hub genes according to the in-degree centrality from our proposed method.
| Gene Symbol | In-Degree | Out-Degree | Average Shortest Path Length | Betweenness Centrality | Closeness Centrality | Clustering Coefficient |
|---|---|---|---|---|---|---|
|
| 439 | 32 | 3.587 | 0.802 | 0.279 | 0.188 |
|
| 425 | 66 | 3.435 | 11.001 | 0.291 | 0.178 |
|
| 406 | 68 | 3.460 | 2.276 | 0.289 | 0.191 |
|
| 406 | 178 | 3.087 | 11.003 | 0.324 | 0.152 |
|
| 403 | 40 | 3.545 | 2.313 | 0.282 | 0.182 |
|
| 390 | 89 | 3.556 | 1.927 | 0.281 | 0.168 |
|
| 372 | 111 | 3.294 | 4.661 | 0.304 | 0.175 |
|
| 372 | 88 | 3.364 | 1.434 | 0.297 | 0.200 |
|
| 365 | 43 | 3.515 | 0.734 | 0.284 | 0.214 |
|
| 348 | 39 | 3.546 | 4.124 | 0.282 | 0.193 |
Top 10 hub genes according to the out-degree centrality from our proposed method.
| Gene Symbol | In-Degree | Out-Degree | Average Shortest Path Length | Betweenness Centrality | Closeness Centrality | Clustering Coefficient |
|---|---|---|---|---|---|---|
|
| 239 | 376 | 2.765 | 9.354 | 0.362 | 0.141 |
|
| 21 | 339 | 3.002 | 0.263 | 0.333 | 0.225 |
|
| 94 | 332 | 2.872 | 2.744 | 0.348 | 0.211 |
|
| 68 | 329 | 2.888 | 1.132 | 0.346 | 0.212 |
|
| 204 | 315 | 2.779 | 3.008 | 0.360 | 0.171 |
|
| 65 | 311 | 3.010 | 1.230 | 0.332 | 0.191 |
|
| 18 | 299 | 2.887 | 0.364 | 0.346 | 0.249 |
|
| 86 | 283 | 2.993 | 1.385 | 0.334 | 0.218 |
|
| 40 | 282 | 3.119 | 0.477 | 0.321 | 0.221 |
|
| 52 | 274 | 3.005 | 0.526 | 0.333 | 0.243 |
Top significant KEGG Pathways (FDR sorted).
| KEGG Pathway Name * | #Genes | Enriched | FDR |
|---|---|---|---|
|
| 198 |
|
|
|
| 139 |
|
|
|
| 255 |
|
|
|
| 199 |
|
|
|
| 524 |
|
|
|
| 206 |
|
|
|
| 144 |
|
|
|
| 123 |
|
|
|
| 146 |
|
|
|
| 97 |
|
|
* See Supplementary Table S3 for details.
Top significant GO-BP term enriched (FDR sorted).
| GO-BP Term Name * | #Genes | Enriched | FDR |
|---|---|---|---|
|
| 1986 | 0 | 0 |
|
| 1967 | 0 | 0 |
|
| 1949 | 0 | 0 |
|
| 1942 | 0 | 0 |
|
| 1919 | 0 | 0 |
|
| 1919 | 0 | 0 |
|
| 1911 | 0 | 0 |
|
| 1911 | 0 | 0 |
|
| 1908 | 0 | 0 |
|
| 1860 | 0 | 0 |
* See Supplementary Table S4 for details.
Top significant GO-CC term enriched (FDR sorted).
| GO-CC Term Name * | #Genes | Enriched | FDR |
|---|---|---|---|
|
| 1861 | 0 | 0 |
|
| 1690 | 0 | 0 |
|
| 1673 | 0 | 0 |
|
| 1661 | 0 | 0 |
|
| 1630 | 0 | 0 |
|
| 1596 | 0 | 0 |
|
| 1516 | 0 | 0 |
|
| 1462 | 0 | 0 |
|
| 1425 | 0 | 0 |
|
| 1425 | 0 | 0 |
* See Supplementary Table S5 for details.
Top significant GO-MF term enriched (FDR sorted).
| GO-MF Term Name * | #Genes | Enriched | FDR |
|---|---|---|---|
|
| 1696 | 0 | 0 |
|
| 1538 | 0 | 0 |
|
| 684 | 0 | 0 |
|
| 896 |
|
|
|
| 898 |
|
|
|
| 915 |
|
|
|
| 638 |
|
|
|
| 845 |
|
|
|
| 823 |
|
|
|
| 781 |
|
|
* See Supplementary Table S6 for details.
Figure 4The volcano plot of normalized enrichment score of the FDR significant KEGG pathways from GSEA analysis of DEGs.