| Literature DB >> 30799505 |
Yue Wang1, Jennifer M Franks1,2, Michael L Whitfield1,2, Chao Cheng1,2,3,4.
Abstract
MOTIVATION: The accumulation of publicly available DNA methylation datasets has resulted in the need for tools to interpret the specific cellular phenotypes in bulk tissue data. Current approaches use either single differentially methylated CpG sites or differentially methylated regions that map to genes. However, these approaches may introduce biases in downstream analyses of biological interpretation, because of the variability in gene length. There is a lack of approaches to interpret DNA methylation effectively. Therefore, we have developed computational models to provide biological interpretation of relevant gene sets using DNA methylation data in the context of The Cancer Genome Atlas.Entities:
Year: 2019 PMID: 30799505 PMCID: PMC6761945 DOI: 10.1093/bioinformatics/btz137
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Simply translating CpGs to genes confuses downstream results because of gene length diversity. (A) The mapping between number of genes and different CpG sites range (left y axis). The fraction above each bar shows the percentage of whole genome genes which located in this CpG range. Right y axis shows the relationship between the distribution of gene lengths in each CpG range. (B) The accumulation of CpG sites with gene length. (C) Barplot showing the average Jaccard scores from 10 000 times simulations. Error bars show standard deviation of Jaccard scores. Blue is only using promoter CpG sites and red is using whole CpG sites. (D) Boxplot showing the distribution of gene lengths for each simulation and all genes. ANOVA P-value is showed. (E) The distributions of covered CpG sites for common genes in simulations and all genes. (F) Boxplots for comparing gene length and covered CpG sites numbers between genes (subset of common genes) in significant pathways and all genome genes. Wilcoxon Rank Sum test P-values are showed
Fig. 2.Workflow of our computational framework. (A) Validation of models. Using a 10-fold cross-validation manner, BioMethyl trains models and calculates a new gene expression matrix for samples. To validate our models, we compare the inferred gene expression matrix with RNA-seq data in terms of gene difference and involved pathways. (B) Application of models. Using the complete DNA methylation and RNA-seq profiles, we build models for each cancer. Further by integrating GSEA analysis, we develop the R package BioMethyl to reveal the relevant pathways in a new DNA methylation data of interest
Fig. 3.Validation of BioMethyl in the context of breast cancer. (A) Density plot for SCC of genes by comparing gene expression inferred by BioMethyl and RNA-seq data. (B) Scatter plot of t scores (ER+ samples versus ER− samples) for genes between gene expression inferred by BioMethyl and RNA-seq data. Pathway enrichment results of GSEA are showed for (C) RNA-seq data and (D) gene expression inferred by BioMethyl by comparing ER+ to ER− samples. For pathways enriched in ER+ samples, −log10(FDR) are showed (red). The orange pathways are pathways shared by two results for ER+ samples. For pathways enriched in ER− samples, log10(FDR) are showed (green), in which green pathways are shared pathways
Fig. 4.Validation of BioMethyl using Fisher’s exact test. Venn diagrams for (A) differentially expressed genes selected by hypermethylated CpG sites in ER+ and ER− samples; (B) differentially expressed genes in ER+ samples selected by hypermethylated CpG sites, RNA-seq and BioMethyl; (C) differentially expressed genes in ER− samples selected by hypermethylated CpG sites, RNA-seq and BioMethyl; (D) pathways enriched in ER+ samples between RNA-seq data and BioMethyl; (E) pathways enriched in ER− samples between RNA-seq data and BioMethyl
Brief introduction of functions in BioMethyl R package
| Function | Application | Function examples |
|---|---|---|
| filterMethyData() | Pre-process methylation data | mydat <- filterMethyData(RawData) |
| calExpr() | Calculation of gene expression based on methylation data | myexpr <- calExpr(MethyData, CancerType, Example=FALSE, SaveOut=FALSE, OutFile) |
| calDEG() | Identification of differentially expression genes | myDEG <- calDEG(ExprData, Sample_1, Sample_2, SaveOut=FALSE, OutFile) |
| calGSEA() | GSEA pathway enrichment | mypath <- calGSEA(ExprData, DEG, DEGthr=c(0, 0.01), Sample_1, Sample_2, OutFile, GeneSet=‘C2’) |
| referCancerType() | Recommendation of cancer type | myType <- referCancerType(MethyData) |