| Literature DB >> 32393170 |
Joungmin Choi1, Heejoon Chae2.
Abstract
BACKGROUND: Recently, DNA methylation has drawn great attention due to its strong correlation with abnormal gene activities and informative representation of the cancer status. As a number of studies focus on DNA methylation signatures in cancer, demand for utilizing publicly available methylome dataset has been increased. To satisfy this, large-scale projects were launched to discover biological insights into cancer, providing a collection of the dataset. However, public cancer data, especially for certain cancer types, is still limited to be used in research. Several simulation tools for producing epigenetic dataset have been introduced in order to alleviate the issue, still, to date, generation for user-specified cancer type dataset has not been proposed.Entities:
Keywords: Cancer; Conditional variational autoencoder; DNA methylation; Generator; Simulator
Mesh:
Year: 2020 PMID: 32393170 PMCID: PMC7216580 DOI: 10.1186/s12859-020-3516-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of trained models and dataset for simulation evaluation
| 70% | methCancer-gen | CVAE based DNN model |
| Benchmark | Estimating beta distribution of beta values for each CpG | |
| 30% | 5 different ML based classifiers | Classifying dataset with 100 generated data for each cancer |
25 cancer types and the number of samples used for training generators and classifiers
| BLCA | 292 | 126 |
| BRCA | 555 | 238 |
| CESC | 214 | 93 |
| COAD | 219 | 94 |
| ESCA | 129 | 56 |
| GBM | 98 | 42 |
| HNSC | 369 | 159 |
| KIRC | 226 | 98 |
| KIRP | 192 | 83 |
| LGG | 361 | 155 |
| LIHC | 263 | 114 |
| LUAD | 331 | 142 |
| LUSC | 259 | 111 |
| MESO | 60 | 27 |
| PAAD | 128 | 56 |
| PCPG | 125 | 54 |
| PRAD | 351 | 151 |
| READ | 68 | 30 |
| SARC | 182 | 79 |
| SKCM | 72 | 32 |
| STAD | 276 | 119 |
| TGCT | 105 | 45 |
| THCA | 354 | 153 |
| THYM | 86 | 38 |
| UCEC | 306 | 132 |
Fig. 1Evaluation of methCancer-gen against benchmark method for accurate data generation. Each boxplot contains accuracies for all the cancer types for each experiment. Accuracy was measured by 5 ML algorithms classifying 25 cancer types of generated dataset
Description of trained models and dataset for usability evaluation
| 30% of TCGA (854 samples) | 8 types of cancer dataset from methCNA (1,038 samples) | |
| 30% of TCGA & 100-500 generated dataset for each cancer type | ||
Comparison of cancer type prediction accuracy for SVM classifiers trained based on different dataset
| Number of testing samples | ||||||||
|---|---|---|---|---|---|---|---|---|
| Number of generated dataset for each cancer type | ||||||||
| 313 | 0.796 | 0.796 | 0.796 | 0.796 | 0.799 | 0.802 | ||
| 102 | 0.922 | 0.922 | 0.922 | 0.922 | 0.931 | 0.951 | ||
| 71 | 0.972 | 0.972 | 0.972 | 0.972 | ||||
| 45 | 0.733 | 0.733 | 0.733 | 0.733 | ||||
| 162 | 0.969 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| 166 | 0.139 | 0.168 | 0.168 | 0.175 | 0.398 | 0.434 | ||
| 20 | 0.700 | 0.700 | 0.700 | 0.700 | 0.700 | 0.700 | ||
| 159 | 0.868 | 0.887 | 0.887 | 0.887 | 0.887 | 0.887 | ||
| 0.751 | 0.761 | 0.761 | 0.762 | 0.799 | 0.809 | |||
Fig. 2Workflow of the proposed methCancer-gen model based on CVAE using DNA methylation data. It consists of two main phases: (1) Preprocessing to minimize bias caused by high frequency of missing values. (2) Generation of DNA methylation dataset for specified cancer type by CVAE neural network model