| Literature DB >> 30782641 |
Le Li1, Yuwei Gao1,2, Qiong Wu1,3, Alfred S L Cheng3, Kevin Y Yip1,2,4,5,6.
Abstract
Many DNA methylome profiling methods cannot distinguish between 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC). Because 5mC typically acts as a repressive mark whereas 5hmC is an intermediate form during active demethylation, the inability to separate their signals could lead to incorrect interpretation of the data. Is the extra information contained in 5hmC signals worth the additional experimental and computational costs? Here we combine whole-genome bisulfite sequencing (WGBS) and oxidative WGBS (oxWGBS) data in various human tissues to investigate the quantitative relationships between gene expression and the two forms of DNA methylation at promoters, transcript bodies, and immediate downstream regions. We find that 5mC and 5hmC signals correlate with gene expression in the same direction in most samples. Considering both types of signals increases the accuracy of expression levels inferred from methylation data by a median of 18.2% as compared to having only WGBS data, showing that the two forms of methylation provide complementary information about gene expression. Differential analysis between matched tumor and normal pairs is particularly affected by the superposition of 5mC and 5hmC signals in WGBS data, with at least 25%-40% of the differentially methylated regions (DMRs) identified from 5mC signals not detected from WGBS data. Our results also confirm a previous finding that methylation signals at transcript bodies are more indicative of gene expression levels than promoter methylation signals. Overall, our study provides data for evaluating the cost-effectiveness of some experimental and analysis options in the study of DNA methylation in normal and cancer samples.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30782641 PMCID: PMC6442395 DOI: 10.1101/gr.240036.118
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Definition of the regions associated with each transcript and the resulting data set. (A) Genomic regions defined for each transcript at which the β values or PDR values were used to infer expression level of the transcript. Both the upstream (Up) and downstream (Down) regions were divided into five 400-bp bins (Up1–Up5 and Down1–Down5). The transcript body (Body) was divided into first exon (FirstEx), first intron (FirstIn), internal exons (IntEx), internal introns (IntIn), last exon (LastEx), and last intron (LastIn). (B) A heat map of the resulting large data set for sample Liver T1. Each row represents a transcript, and the transcripts are sorted in ascending order according to their expression levels. The four blocks of columns represent β values based on WGBS, oxWGBS, 5mC, and 5hmC, respectively. Within each block, the different columns are, respectively, Up5–Up1, FirstEx, FirstIn, IntEx, IntIn, LastEx, LastIn, and Down1–Down5. After the four methylation blocks, the last two columns show the log expression level and expression class, respectively. (C) Hierarchical clustering of the samples based on all their methylation features in the large data set using Ward's method.
Figure 2.Accuracy of the models for inferring expression classes based on the large data set. (A–C) Each bar represents the distribution of AUROC values across the three expression classes of the three samples in each sample group. (A) Comparison of models involving different combinations of methylation features from all associated genomic regions of the transcripts. (B) Comparison of models involving both 5mC and 5hmC levels at different combinations of genomic regions associated with each transcript. (C) Comparison of several knowledge-driven models. (D) The most useful methylation feature blocks for inferring gene expression level based on the forward-search procedure of feature selection. For each sample, the top feature block was given a score of eight, the second given a score of seven, and so on, for the top eight feature blocks. The total score of each feature block across all 12 samples is shown as a percentage of the maximum possible score of 8 × 12 = 96.
Figure 3.Generality of the models for predicting expression classes with all methylation features based on the large data set. Each row corresponds to a sample from which the model was trained, and each column represents a sample to which the model was applied, based on which evaluation measure was computed. The training and testing transcripts were disjointed regardless of whether the testing sample was the same as or different from the training sample.
Figure 4.Relationship between methylation and differential expression in cancer. (A) Accuracy of the models for inferring differential expression class, involving all β-value features, based on the large data set with an interclass gap percentage of 80%. Each bar represents the AUROC values of the three pairs of samples in the group. (B) Overlap of DMRs identified using only WGBS data, only oxWGBS data, only 5mC levels, or only 5hmC levels, for liver (left) and lung (right) samples using metilene. The lower plots are zoomed-in views of the bottom parts of the upper plots.