Literature DB >> 22155865

B-SOLANA: an approach for the analysis of two-base encoding bisulfite sequencing data.

Benjamin Kreck1, George Marnellos, Julia Richter, Felix Krueger, Reiner Siebert, Andre Franke.   

Abstract

SUMMARY: Bisulfite sequencing, a combination of bisulfite treatment and high-throughput sequencing, has proved to be a valuable method for measuring DNA methylation at single base resolution. Here, we present B-SOLANA, an approach for the analysis of two-base encoding (colorspace) bisulfite sequencing data on the SOLiD platform of Life Technologies. It includes the alignment of bisulfite sequences and the determination of methylation levels in CpG as well as non-CpG sequence contexts. B-SOLANA enables a fast and accurate analysis of large raw sequence datasets.
AVAILABILITY AND IMPLEMENTATION: The source code, released under the GNU GPLv3 licence, is freely available at http://code.google.com/p/bsolana/. CONTACT: b.kreck@ikmb.uni-kiel.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22155865      PMCID: PMC3268236          DOI: 10.1093/bioinformatics/btr660

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Methylation at position 5 of cytosines is a major epigenetic modification, which has an important impact on transcriptional and regulatory processes of DNA (Holliday, 1975). It is a stable modification of the genome which can be inherited from one generation to the next, even though it can also be dynamically changed by environmental influences. There are several methods based on high-throughput sequencing, such as methylated DNA immunoprecipitation sequencing (MeDIP-seq), methylated DNA capture by affinity purification (MethylCap-seq) and BS-Seq, which can provide good-quality genome-wide DNA methylation data (Bock, 2010). Methods that currently provide genome-wide methylation patterns at single base resolution make use of bisulfite conversion and high-throughput sequencing. The treatment of DNA with sodium bisulfite has no effect on methylated cytosines, but it specifically converts unmethylated cytosines to uracils, which are converted to thymines during subsequent polymerase chain reaction amplification. As a result of bisulfite conversion, the Watson and Crick strands of bisulfite-treated DNA are no longer complementary to each other, they become essentially different genomes. This fact leads to an enlarged alignment reference space. The prevalence of T's that have replaced C's leads to reduced complexity in bisulfite sequences, which increases the bioinformatics challenge of BS-Seq analysis. Bioinformatics tools for BS-Seq have generally fallen into two categories: (i) methylation-aware alignment tools, which consider cytosines and thymines as potential matches to genomic cytosine positions and (ii) tools which convert any residual cytosines in bisulfite sequences and all cytosines of the reference genomes into thymines.

2 COLORSPACE BISULFITE SEQUENCING

Due to the two-base encoding of SOLiD sequencing, in silico conversions of any residual bisulfite read cytosines into thymines, which can be carried out in basespace data to avoid bisulfite-mismatches during alignment, cannot be performed on bisulfite colorspace sequences, because sequencing errors would lead to the incorrect translation of colorspace to basespace (Supplementary Fig. 1). There are ways to align bisulfite colorspace sequences with methylation-aware alignment approaches, which convert bisulfite colorspace sequences to basespace and index all theoretically possible alignments by creating a hash table. Such an approach is implemented in SOCS-B, which is based on the iterative version of the Rabin–Karp algorithm (Ondov, 2010). Even though SOCS-B turns out to be an accurate tool for the analysis of colorspace BS-Seq datasets, it becomes very computationally intensive for complex genomes such as the human genome (~ 150 000 CPU hours for the analysis of 500 Million sequences). Therefore, it is not efficient for huge datasets like those produced in genome-wide methylation analyses with average coverage depths ≥10X and genome size ≥1000 MB. Here, we present B-SOLANA, a tool which performs sequence alignment and methylation calling for colorspace bisulfite sequencing. It is based on the established short-read aligner Bowtie (Langmead, 2009) and SAMtools utilities for manipulating alignments (Li, 2009). B-SOLANA is divided into four individual steps: (i) indexing, (ii) mapping, (iii) determination of best alignment and (iv) methylation calling. The idea of B-SOLANA is to use Bowtie to uniquely align bisulfite sequences to two different conversions of the reference genome and determine best alignments from the combined set of results. The analysis of whole methylomes of 23 eukaryotic organisms shows a variable percentage of methylation at CpG dinucleotides, whereas the percentage of methylated CHG and CHH is always lower (Pelizzola, 2010). The approach of B-SOLANA reduces the number of bisulfite-induced mismatches by considering the prevalence of methylated cytosines in their different sequence contexts. In order to identify CpG and non-CpG methylation sites, B-SOLANA aligns bisulfite sequences to two in-silico conversions of the reference genome (Supplementary Fig. 2). In the first modified reference genome, all cytosines in a non-CpG context are converted to thymines (Conversion I). In the second, all cytosines, irrespective of their sequence context, are converted to thymines (Conversion II). After alignment to these converted genomes, B-SOLANA determines the best alignment for each bisulfite sequence in the following way: bisulfite sequences that are aligned to different genomic positions in Conversions I and II are assigned to the position with the lowest number of mismatches. Reads with the same number of mismatches at different positions are ignored. In its final step, B-SOLANA determines methylation levels. B-SOLANA is compatible with 50 bp directional single-end libraries and allows a simple adjustment for the upcoming read lengths. B-SOLANA was designed to generate accurate results for methylomes with a low percentage of methylation in non-CpG sites (<5%). This includes most eukaryotic organisms, with mammalian genomes typically having methylation levels of <3% in CHG and <1% in CHH context (Pelizzola, 2010). A detailed test of B-SOLANA was performed for genomic DNA extracted from a human lymphoma cell line. The library was prepared using a protocol for single-end SOLiD BS-Seq (Bormann Chung, 2010) and sequencing of the bisulfite-converted DNA was performed using SOLiD versions 3 Plus and 4. This generated 2.79 billion bisulfite reads of which ~ 52% were mapped uniquely (genome build hg19/NCBI 37). The methylation results obtained by B-SOLANA were then compared to the Illumina Infinium HumanMethylation450 BeadChip (450 k) assay, an established methylation analysis method, as a quality control (Supplementary Fig. 3). We observed high concordance between the results of the two independent methods (99% of 450 k sites were also represented in the B-SOLANA results) and the methylation levels of cytosines, which were assayed by both methods displayed a very high correlation (Pearson correlation r > 0.93). As a proof of principle, we also generated different simulated datasets, reflecting varying CpG and non-CpG methylation levels. Encouragingly, the results generated by B-SOLANA closely match the expected methylation degrees (Supplementary Table 1). The 485 990 920 SOLiD BS-Seq reads (50 bp), taken from SRR204024 (Hansen, 2011), were analyzed with B-SOLANA and MethylCoder (one mismatch allowed) B-SOLANA exhibits a high correlation with the results of Hansen et al. aIncluding post-processing quality control. bOnly cytosine and thymine base calls are included. A further approach for the analysis of colorspace BS-Seq was published with the tool MethylCoder (Pedersen, 2011). MethylCoder applies a conversion of any residual bisulfite read cytosines into thymines, which leads to erroneous alignments, as discussed above. Therefore, we compared B-SOLANA and MethylCoder (one mismatch allowed) by analyzing 485 990 920 SOLiD BS-Seq reads (50 bp), taken from SRR204024 (Hansen, 2011). We found a high concordance between methylation calls of Hansen et al., analyzed by their yet unpublished and unavailable approach, and B-SOLANA. Moreover, B-SOLANA turns out to have a significantly higher mapping efficiency. As a platform-independent benchmark, we demonstrate that the analysis of colorspace BS-Seq data of the fibroblast cell line IMR90 is comparable to methylome data published by Lister ), who used a BS-Seq approach on the Illumina platform (Supplementary Information 1).

3 CONCLUSIONS

We present an efficient tool for the analysis of large colorspace BS-Seq data. B-SOLANA provides a fast and accurate all-in-one approach, including alignment and methylation calling. It is easy to use and generates an intuitive output, which can be used for genome-wide DNA methylation analysis.
Table 1.

The 485 990 920 SOLiD BS-Seq reads (50 bp), taken from SRR204024 (Hansen, 2011), were analyzed with B-SOLANA and MethylCoder (one mismatch allowed) B-SOLANA exhibits a high correlation with the results of Hansen et al.

Hansen et al.aB−SOLANAMethylCoderb
Uniquely mapped reads (%)37.8349.8419.23
CpG positions: % C69.8472.8367.07
CpG positions: % T30.0326.9732.93
Non-CpG positions: % C0.200.220.69
Non-CpG positions: % T99.7699.7099.31

aIncluding post-processing quality control.

bOnly cytosine and thymine base calls are included.

  10 in total

1.  DNA modification mechanisms and gene activity during development.

Authors:  R Holliday; J E Pugh
Journal:  Science       Date:  1975-01-24       Impact factor: 47.728

2.  MethylCoder: software pipeline for bisulfite-treated sequences.

Authors:  Brent Pedersen; Tzung-Fu Hsieh; Christian Ibarra; Robert L Fischer
Journal:  Bioinformatics       Date:  2011-06-30       Impact factor: 6.937

Review 3.  The DNA methylome.

Authors:  Mattia Pelizzola; Joseph R Ecker
Journal:  FEBS Lett       Date:  2010-11-05       Impact factor: 4.124

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

5.  Whole methylome analysis by ultra-deep sequencing using two-base encoding.

Authors:  Christina A Bormann Chung; Victoria L Boyd; Kevin J McKernan; Yutao Fu; Cinna Monighetti; Heather E Peckham; Melissa Barker
Journal:  PLoS One       Date:  2010-02-22       Impact factor: 3.240

6.  An alignment algorithm for bisulfite sequencing using the Applied Biosystems SOLiD System.

Authors:  Brian D Ondov; Charles Cochran; Mark Landers; Gavin D Meredith; Miroslav Dudas; Nicholas H Bergman
Journal:  Bioinformatics       Date:  2010-06-18       Impact factor: 6.937

7.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

8.  Human DNA methylomes at base resolution show widespread epigenomic differences.

Authors:  Ryan Lister; Mattia Pelizzola; Robert H Dowen; R David Hawkins; Gary Hon; Julian Tonti-Filippini; Joseph R Nery; Leonard Lee; Zhen Ye; Que-Minh Ngo; Lee Edsall; Jessica Antosiewicz-Bourget; Ron Stewart; Victor Ruotti; A Harvey Millar; James A Thomson; Bing Ren; Joseph R Ecker
Journal:  Nature       Date:  2009-10-14       Impact factor: 49.962

9.  Quantitative comparison of genome-wide DNA methylation mapping technologies.

Authors:  Christoph Bock; Eleni M Tomazou; Arie B Brinkman; Fabian Müller; Femke Simmer; Hongcang Gu; Natalie Jäger; Andreas Gnirke; Hendrik G Stunnenberg; Alexander Meissner
Journal:  Nat Biotechnol       Date:  2010-09-19       Impact factor: 54.908

10.  Increased methylation variation in epigenetic domains across cancer types.

Authors:  Kasper Daniel Hansen; Winston Timp; Héctor Corrada Bravo; Sarven Sabunciyan; Benjamin Langmead; Oliver G McDonald; Bo Wen; Hao Wu; Yun Liu; Dinh Diep; Eirikur Briem; Kun Zhang; Rafael A Irizarry; Andrew P Feinberg
Journal:  Nat Genet       Date:  2011-06-26       Impact factor: 38.330

  10 in total
  11 in total

Review 1.  Analysing and interpreting DNA methylation data.

Authors:  Christoph Bock
Journal:  Nat Rev Genet       Date:  2012-10       Impact factor: 53.242

2.  Transcriptome-Wide Mapping of RNA 5-Methylcytosine in Arabidopsis mRNAs and Noncoding RNAs.

Authors:  Rakesh David; Alice Burgess; Brian Parker; Jun Li; Kalinya Pulsford; Tennille Sibbritt; Thomas Preiss; Iain Robert Searle
Journal:  Plant Cell       Date:  2017-01-06       Impact factor: 11.277

Review 3.  DNA methylome analysis using short bisulfite sequencing data.

Authors:  Felix Krueger; Benjamin Kreck; Andre Franke; Simon R Andrews
Journal:  Nat Methods       Date:  2012-01-30       Impact factor: 28.547

4.  Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data.

Authors:  Junko Tsuji; Zhiping Weng
Journal:  Brief Bioinform       Date:  2015-12-01       Impact factor: 11.622

Review 5.  Analyzing the cancer methylome through targeted bisulfite sequencing.

Authors:  Eun-Joon Lee; Junfeng Luo; James M Wilson; Huidong Shi
Journal:  Cancer Lett       Date:  2012-11-28       Impact factor: 8.679

Review 6.  Integrating Epigenomics into the Understanding of Biomedical Insight.

Authors:  Yixing Han; Ximiao He
Journal:  Bioinform Biol Insights       Date:  2016-12-04

7.  BatMeth: improved mapper for bisulfite sequencing reads on DNA methylation.

Authors:  Jing-Quan Lim; Chandana Tennakoon; Guoliang Li; Eleanor Wong; Yijun Ruan; Chia-Lin Wei; Wing-Kin Sung
Journal:  Genome Biol       Date:  2012-10-03       Impact factor: 13.583

8.  BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions.

Authors:  Kasper D Hansen; Benjamin Langmead; Rafael A Irizarry
Journal:  Genome Biol       Date:  2012-10-03       Impact factor: 13.583

9.  High-throughput sequencing of cytosine methylation in plant DNA.

Authors:  Thomas J Hardcastle
Journal:  Plant Methods       Date:  2013-06-07       Impact factor: 4.993

10.  CRCDA--Comprehensive resources for cancer NGS data analysis.

Authors:  Manonanthini Thangam; Ramesh Kumar Gopal
Journal:  Database (Oxford)       Date:  2015-10-08       Impact factor: 3.451

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.