Literature DB >> 34601556

RLM: Fast and simplified extraction of Read-Level Methylation metrics from bisulfite sequencing data.

Sara Hetzel1, Pay Giesselmann1, Knut Reinert2, Alexander Meissner1,3,4,5, Helene Kretzmer1.   

Abstract

Bisulfite sequencing data provide value beyond the straightforward methylation assessment by analyzing single-read patterns. Over the past years, various informative metrics have been established to explore this information. However, limited compatibility with alignment tools, reference genomes or the measurements they provide present a bottleneck for most groups to include this information as standard analysis. To address this, we developed RLM, a fast and scalable tool for the computation of frequently used Read-Level Methylation statistics. RLM supports several common alignment tools, works independently of the reference genome and handles all frequently used sequencing experiment designs. RLM can process large input files with a billion reads in just a few hours on common workstations. AVAILABILITY: https://github.com/sarahet/RLM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2021. Published by Oxford University Press.

Entities:  

Year:  2021        PMID: 34601556      PMCID: PMC8686677          DOI: 10.1093/bioinformatics/btab663

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Bisulfite sequencing experiments are the gold standard to measure DNA methylation at single-CpG resolution (Frommer ). Besides the average CpG methylation level across a cell population, each read contains information about the methylation of a single molecule present within a cell (Scherer ). This information can be used when analyzing the heterogeneity of a cell population or comparing samples of different conditions such as healthy and tumor. Different metrics have been established in order to quantify heterogeneity within and across cells based on single-read methylation patterns. Measurements of population heterogeneity include methylation entropy and epipolymorphism, which are based on so-called epialleles, assessing the patterns of methylated and unmethylated CpGs that can occur in a 4-mer of consecutive CpGs spanned by the same reads (16 epialleles are possible for a single 4-mer) (Landan ; Xie ). Additionally, the heterogeneity within a single read can be classified as concordant or discordant, which can then be aggregated across reads as the percent of discordant reads (PDR), thus again measuring the heterogeneity across different cells (Landau ). Similarly, the number of transitions from methylated to unmethylated CpGs on a read can be used and aggregated per CpG to determine the level of heterogeneity (read transition score or RTS) (Charlton ). So far, studies that have analyzed read-level DNA methylation heterogeneity often utilized custom scripts for this purpose (Landan ; Landau ; Xie ). Only few tools are available that implement one or more of such metrics; however, they are limited in their usability by requiring a specific alignment tool, reference genome or additional input such as the exact position of CpGs of interest (He ; Scherer ; Scott ). Here, we present RLM, a fast and scalable tool that implements established and frequently used inter- and intramolecular metrics of DNA methylation at the read level from bisulfite sequencing experiments. RLM is applicable for any reference genome, a wide range of library protocols and works with input alignment files from multiple commonly used alignment tools. Additionally, it automatically accounts for potential errors and biases caused by sequencing artifacts, mapping quality and overlapping read pairs.

2 Implementation and features

RLM is a standalone C++-based tool implemented using SeqAn (Reinert ). As input, RLM accepts BAM or SAM files from the common bisulfite alignment tools BSMAP, BISMARK, segemehl and GEM (Krueger and Andrews, 2011; Otto ; Santiago et al., 2012; Xi and Li, 2009). RLM can run with single- or paired-end sequencing data from different protocols such as whole genome bisulfite sequencing and target enrichment approaches but also reduced representation bisulfite sequencing (RRBS). To account for the artificially introduced nucleotides in RRBS experiments, CpGs at the end of the first read (and beginning of the second read) can be omitted from the analysis. Generally, reads that do not represent a primary alignment, are polymerase chain reaction duplicates, fail the quality control or do not pass a user-defined mapping quality filter are discarded. For single-end input, RLM streams across the records of the input file and extracts information about the methylation status of a CpG based on the corresponding reference genome and the BAM file tag that represents the origin and mapping orientation of each read (e.g. ‘ZS’ tag for BSMAP). Reads with sequencing errors at the position of a CpG, indels or reads that span less than three CpGs are discarded. For paired-end sequencing experiments, reads are filtered analogous to single-end experiments. Both mates are processed independently except for overlapping mates, which are merged and processed as a single, contiguous read maximizing the information that can be extracted when using downstream measurements dependent on four consecutive CpGs such as entropy. Additionally, we avoid two quantifications of the same genomic fragment, which would bias the analysis of cell population heterogeneity. To enable this, reads are kept in memory until the mate has been read and are removed from memory afterwards. If one mate needs to be excluded from the analysis, the other read will be processed independently to improve coverage. Depending on the research question of interest, RLM offers multiple outputs that can be requested separately or all at once by the user. Single-read information: For every read with at least three CpGs, the methylation status for each CpG, the average methylation, the transition score and the discordance are reported. The transition score is defined as the number of transitions between methylated and unmethylated CpGs divided by the number of possible transitions. This file is mandatory output as the information collected here is used for the other output files. RTS and PDR: For every CpG spanned by a user-defined minimum number of reads, the RTS and PDR across all reads spanning the CpG are reported. Additionally, the corresponding methylation rate based on the reads considered for the read-level measurements is provided. Entropy and epipolymorphism: For every 4-mer of consecutive CpGs spanned by a user-defined minimum number of reads, the entropy and epipolymorphism are calculated and reported together with the average methylation rate of the complete 4-mer based on the reads considered for read-level metrics. Additionally, the frequencies for all epialleles leading to the respective metrics are provided. To complement the reported scores, RLM ships a standalone R script that provides users with a report including summary statistics and figures. Additionally, the documentation contains guidelines for its interpretation and use cases. The runtime of RLM scales linearly with the input size for both paired- and single-end modes. Large datasets with up to one billion reads can be processed in few hours with modest memory requirements (detailed performance analysis and comparison with other existing tools in the Supplementary Data). Click here for additional data file.
  13 in total

1.  A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands.

Authors:  M Frommer; L E McDonald; D S Millar; C M Collis; F Watt; G W Grigg; P L Molloy; C L Paul
Journal:  Proc Natl Acad Sci U S A       Date:  1992-03-01       Impact factor: 11.205

2.  Fast and sensitive mapping of bisulfite-treated sequencing data.

Authors:  Christian Otto; Peter F Stadler; Steve Hoffmann
Journal:  Bioinformatics       Date:  2012-05-10       Impact factor: 6.937

3.  The GEM mapper: fast, accurate and versatile alignment by filtration.

Authors:  Santiago Marco-Sola; Michael Sammeth; Roderic Guigó; Paolo Ribeca
Journal:  Nat Methods       Date:  2012-10-28       Impact factor: 28.547

4.  The SeqAn C++ template library for efficient sequence analysis: A resource for programmers.

Authors:  Knut Reinert; Temesgen Hailemariam Dadi; Marcel Ehrhardt; Hannes Hauswedell; Svenja Mehringer; René Rahn; Jongkyu Kim; Christopher Pockrandt; Jörg Winkler; Enrico Siragusa; Gianvito Urgese; David Weese
Journal:  J Biotechnol       Date:  2017-09-06       Impact factor: 3.307

5.  Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues.

Authors:  Gilad Landan; Netta Mendelson Cohen; Zohar Mukamel; Amir Bar; Alina Molchadsky; Ran Brosh; Shirley Horn-Saban; Daniela Amann Zalcenstein; Naomi Goldfinger; Adi Zundelevich; Einav Nili Gal-Yam; Varda Rotter; Amos Tanay
Journal:  Nat Genet       Date:  2012-10-14       Impact factor: 38.330

6.  Genome-wide quantitative assessment of variation in DNA methylation patterns.

Authors:  Hehuang Xie; Min Wang; Alexandre de Andrade; Maria de F Bonaldo; Vasil Galat; Kelly Arndt; Veena Rajaram; Stewart Goldman; Tadanori Tomita; Marcelo B Soares
Journal:  Nucleic Acids Res       Date:  2011-01-28       Impact factor: 16.971

7.  Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications.

Authors:  Felix Krueger; Simon R Andrews
Journal:  Bioinformatics       Date:  2011-04-14       Impact factor: 6.937

8.  Global delay in nascent strand DNA methylation.

Authors:  Jocelyn Charlton; Timothy L Downing; Zachary D Smith; Hongcang Gu; Kendell Clement; Ramona Pop; Veronika Akopian; Sven Klages; David P Santos; Alexander M Tsankov; Bernd Timmermann; Michael J Ziller; Evangelos Kiskinis; Andreas Gnirke; Alexander Meissner
Journal:  Nat Struct Mol Biol       Date:  2018-03-12       Impact factor: 15.369

9.  Identification of cell type-specific methylation signals in bulk whole genome bisulfite sequencing data.

Authors:  C Anthony Scott; Jack D Duryea; Harry MacKay; Maria S Baker; Eleonora Laritsky; Chathura J Gunasekara; Cristian Coarfa; Robert A Waterland
Journal:  Genome Biol       Date:  2020-07-01       Impact factor: 13.583

10.  BSMAP: whole genome bisulfite sequence MAPping program.

Authors:  Yuanxin Xi; Wei Li
Journal:  BMC Bioinformatics       Date:  2009-07-27       Impact factor: 3.169

View more
  1 in total

1.  Acute lymphoblastic leukemia displays a distinct highly methylated genome.

Authors:  Sara Hetzel; Alexandra L Mattei; Helene Kretzmer; Chunxu Qu; Xiang Chen; Yiping Fan; Gang Wu; Kathryn G Roberts; Selina Luger; Mark Litzow; Jacob Rowe; Elisabeth Paietta; Wendy Stock; Elaine R Mardis; Richard K Wilson; James R Downing; Charles G Mullighan; Alexander Meissner
Journal:  Nat Cancer       Date:  2022-05-19
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.