| Literature DB >> 30235838 |
Yuta Suzuki1, Yunhao Wang2, Kin Fai Au3,4, Shinichi Morishita5.
Abstract
We address the problem of observing personal diploid methylomes, CpG methylome pairs of homologous chromosomes that are distinguishable with respect to phased heterozygous variants (PHVs), which is challenging due to scarcity of PHVs in personal genomes. Single molecule real-time (SMRT) sequencing is promising as it outputs long reads with CpG methylation information, but a serious concern is whether reliable PHVs are available in erroneous SMRT reads with an error rate of ∼15%. To overcome the issue, we propose a statistical model that reduces the error rate of phasing CpG site to 1%, thereby calling CpG hypomethylation in each haplotype with >90% precision and sensitivity. Using our statistical model, we examined GNAS complex locus known for a combination of maternally, paternally, or biallelically expressed isoforms, and observed allele-specific methylation pattern almost perfectly reflecting their respective allele-specific expression status, demonstrating the merit of elucidating comprehensive personal diploid methylomes and transcriptomes.Entities:
Keywords: DNA methylation; allele-specific analysis; gene expression; single molecule real-time sequencing; statistical methods
Year: 2018 PMID: 30235838 PMCID: PMC6162384 DOI: 10.3390/genes9090460
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Outline of the proposed method to detect allele-specific methylation (ASM). (a) In our method, we assume that haplotype information for the genome is available, i.e., phased heterozygous variants (PHVs) exist (horizontal line in the middle with letters indicating PHVs), which serve as sites of interest (sites shaded in green) for the allele assignment process. Available heterozygous variants found in reads are indicated by letters (A, C, G, or T) at the shaded sites. Other mismatches and insertions/deletions (indels) (shown as letters and yellow blocks) in reads can be assumed to be sequencing errors and therefore ignored. After assigning PacBio reads to either allele (haplotype) by PHVs on the reads, the methylation status of each allele is predicted using (average) kinetics data obtained in the PacBio sequencing process (shaded in blue). Of note, if ASM is not present in the region, wrong assignment of reads does not affect the accuracy of the methylation call; (b) Outline of the detection of allele-specific expression (ASE). Only exonic PHVs (two of three in this figure) can be used to distinguish two alleles. Next, ASE can be detected as an imbalance of alleles observed in reads. (c) The probability of allele assignment error depends on the number of available PHVs in the reads (x-axis). The probability was calculated using an equation presented in the text and is shown in logarithmic scale on the y-axis; (d) prediction performance (sensitivity and precision) of the method for assessing the perturbed inter-pulse duration (IPD) ratio (purple line). For comparison, typical performance statistics for the original IPD are shown (green line). “” indicates that the IPD was perturbed to simulate 1% read assignment error, as described in the text; (e) Example of a region exhibiting ASM in diploid methylomes and transcriptomes. On the right: two CpG islands (CGIs) are shown in the middle; one allele (labeled A) is methylated and the other (B) unmethylated. Each CGI overlaps with the promoter regions of distinct isoforms of a known imprinted gene ZNF331. Bisulfite sequencing data in the bottom track exhibited intermediate-level methylation for the two CGIs showing ASM. From top to bottom, the panel shows the following features: gene structure, alignments of long RNA-seq (Iso-seq) reads, RNA-seq read counts for two alleles, which indicates ASE, sites of PHVs available in this personal genome (black marks), which were used to determine the allelic origins of the sequencing reads, annotated CGIs (green rectangles), methylation levels of the CpG sites of two alleles that were predicted using single-molecule real-time (SMRT) reads (respective black and gray bars towards positive and negative indicate methylated and unmethylated, respectively), and publicly available data on methylation levels via bisulfite sequencing (orange bars).
Figure 2(a,b) Proportions of CGIs located within a distance in the x-axis from the nearest genomic feature, common single nucleotide polymorphisms (SNPs) (green) and heterozygous single nucleotide variants (hetSNVs, or PHVs) (purple), in each genome of (a) AK1 and (b) HG002. Common SNPs and PHVs distributed differently in both personal genomes. PHVs were essential in determining the proportions of CGIs; (c) distribution of PHVs with respect to exons. The left pie chart shows the proportion of exons containing PHVs for which the ASE status can be assessed directly. The right pie chart shows the ratios of PHVs in exonic (blue), intronic (pale blue), or intergenic (gray) regions, thus classifying PHVs into three categories; (d) example showing personal diploid methylomes and transcriptomes in the AK1 genome. The CGI in the bidirectional promoter region (area shaded in blue) of the ZNF597 and NAA60 genes showed ASM. The RNA-seq reads (both long and short) support that transcription was only derived from allele A, which is the unmethylated allele in the region; (e) Personal diploid methylomes around the GNAS complex locus in the AK1 genome. The four regions are colored to show their known transcriptional pattern: maternally expressed (blue), paternally expressed (green), or expressed from both alleles (purple). Correspondingly, these regions shaded with different colors exhibited distinct methylation patterns. Of note, the ASM regions exhibited an intermediate level of methylation according to bisulfite sequencing (bottom). RNA-seq reads suggested the expression of Gs from both alleles.
Figure 3(a) Summary of the methylation scores (for each allele) for CGIs in personal diploid methylomes in HG002. Each CGI is shown as a circle. On the two opposite corners (top left and bottom right), CGIs with the top 1% absolute differences in methylation levels between the two alleles were provisionally classified as ASM CGIs. Red circles: The corresponding CGIs were separated from the nearest PHV by 1000 bp or more. Blue circles: Separation <1000 bp; (b) distribution of each type of CGI, ASM (black bar) or non-ASM (white bar), with respect to the functional annotation of genomic regions; (c) example showing personal diploid methylomes in the MEST gene-coding region of the HG002 (Ashkenazim Trio Son) genome. Although the upstream CGI (with 66 CpG sites) was unmethylated in both alleles, the larger downstream CGI (with 184 CpG sites) exhibited ASM. The CGIs corresponded to the promoter regions of different isoforms of the genes; (d) another example of ASM around an imprinted gene PEG13, paternally expressed gene 13.