Literature DB >> 20416082

BS Seeker: precise mapping for bisulfite sequencing.

Pao-Yang Chen¹, Shawn J Cokus, Matteo Pellegrini.

Abstract

BACKGROUND: Bisulfite sequencing using next generation sequencers yields genome-wide measurements of DNA methylation at single nucleotide resolution. Traditional aligners are not designed for mapping bisulfite-treated reads, where the unmethylated Cs are converted to Ts. We have developed BS Seeker, an approach that converts the genome to a three-letter alphabet and uses Bowtie to align bisulfite-treated reads to a reference genome. It uses sequence tags to reduce mapping ambiguity. Post-processing of the alignments removes non-unique and low-quality mappings.
RESULTS: We tested our aligner on synthetic data, a bisulfite-converted Arabidopsis library, and human libraries generated from two different experimental protocols. We evaluated the performance of our approach and compared it to other bisulfite aligners. The results demonstrate that among the aligners tested, BS Seeker is more versatile and faster. When mapping to the human genome, BS Seeker generates alignments significantly faster than RMAP and BSMAP. Furthermore, BS Seeker is the only alignment tool that can explicitly account for tags which are generated by certain library construction protocols.
CONCLUSIONS: BS Seeker provides fast and accurate mapping of bisulfite-converted reads. It can work with BS reads generated from the two different experimental protocols, and is able to efficiently map reads to large mammalian genomes. The Python program is freely available at http://pellegrini.mcdb.ucla.edu/BS_Seeker/BS_Seeker.html.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Sulfites
hydrogen sulfite

Year: 2010 PMID： 20416082 PMCID： PMC2871274 DOI： 10.1186/1471-2105-11-203

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Epigenetic regulation, such as cytosine (C) DNA methylation, is important in gene regulation and transposon silencing. The gold standard technique for studying DNA methylation is genomic bisulfite sequencing [1]. Sodium bisulfite converts unmethylated Cs to uracils, but 5-methylcytosines remain unconverted. Hence, after PCR amplification, unmethylated Cs are converted to thymines (T) while methylated Cs are unchanged. Recently, Cokus et al. and Lister et al. developed protocols, BS-Seq [2] and MethylC-seq [3], which couple bisulfite sequencing with next generation sequencing and completed a first single nucleotide resolution map of methylation in Arabidopsis. While these approaches open up new avenues for genome-wide measurements of DNA methylation [3-6], aligning millions of bisulfite-treated short reads (BS reads) onto the reference genome remains a challenge. Mapping bisulfite-converted reads leads to ambiguity, since Ts in the read can map to both genomic Cs or Ts. Most alignment tools, such as BLAT [7], SOAP [8], and Bowtie [9] do not explicitly enable bisulfite mapping. Currently, there are only a few aligners explicitly designed for mapping BS reads. CokusAlignment [2] treats each cycle in a read as probabilities of A, C, T, G and uses a suffix tree searching algorithm. However, to date only the Arabidopsis suffix tree has been published. Other newly-developed bisulfite mapping software includes BSMAP [10], RMAP [11] and MAQ [12], which, unlike CokusAlignment, model reads as discrete base calls instead of probability vectors. BSMAP enumerates all possible combinations of C/T conversion in the BS read to find the uniquely mapping position with the least mismatches on the reference genome. It is reported [10] to have a similar sensitivity as CokusAlignment and outperformed the methods described in [3] and [4]. The bisulfite mapping in RMAP uses Wildcard matching for mapping Ts. MAQ also has a methylation alignment mode and it assigns non-unique reads randomly to one of the best-matching positions. Two library protocols have been developed for constructing bisulfite-converted libraries (see Fig 1). Cokus et al's protocol [2] uses two amplification steps: the first amplification generates both forward and reverse bisulfite-converted sequences ligated with DNA adapters of DpnI restriction sites. These sequences are then digested by DpnI restriction enzymes that result in the 5-bp sequence tags on the bisulfite-converted sequences. There are two patterns of tags based on the forward (+FW) and reverse (-FW) directions of the bisulfite-converted sequences. After the ligation with standard Solexa adapters and the second amplification step, four types of bisulfite-converted reads are generated. They are forward and reverse reads from Watson (+FW, +RC) and Crick stands (-FW, -RC), respectively (see Fig 2A). These tags are essential to reduce the ambiguity of certain classes of reads. Unlike BSMAP, MAQ or RMAP, BS Seeker is able to explicitly use the tags to improve mapping. The second experimental protocol (i.e. Lister et al) generates bisulfite libraries using premethylated adapters, and in this case no tags are present and all reads are +FW or -FW.

Figure 1

Figure 2

Schematic diagrams of the 4 forms of BS reads, mapping and post processing. 2A. BS reads may be in one of the 4 forms: +FW, +RC, -FW, -RC. 2B. Bowtie aligns C/T converted reads to the C/T converted strands. During the post processing, the number of mismatches is counted except those between read Ts and genomic Cs. Low-quality mappings with many mismatches are removed.

The two library protocols generating bisulfite-converted reads. Cokus et al's experimental protocol uses two amplification steps for generating bisulfite-converted sequences and for high throughput sequencing. The bisulfite-converted reads are preceded by one of two tags in the first 5 nucleotides of reads. Lister et al's protocol generates bisulfite libraries using premethylated adapters, and in this case no tags are present. BS Seeker uses Bowtie for mapping BS reads generated from either experimental protocol. It maps C/T converted FW reads to the C/T converted reference strands, and G/A converted RC reads to G/A converted, reverse complements of the reference strands. Post-processing removes low-quality mappings based on the number of mismatches. The strategy of bisulfite converting the reference sequence by treating all Cs as Ts (Gs as As) has also been used in [4,6,13], although implementations of this approach have not been published. However, most of these approaches are less precise than the ones we present here as they cannot correctly handle reads that are partially methylated. For the evaluation of our aligner, we use synthetic BS reads to assess the accuracy of mapping. We evaluate the mapping results by comparing the methylation statistics in the synthetic reads and in the mapped sequences. We also use BS Seeker to align an Arabidopsis library and human libraries from the two experimental protocols, which provides us with an evaluation using real data.

Implementation

Depending on the bisulfite experimental protocol that was used to generate the library [2,3], BS reads may be observed in either four or two forms. Cokus et al's protocol generates a forward read (+FW) from the Watson strand, the reverse complement (+RC) of +FW, a forward read (-FW) from the Crick strand, and the reverse complement (-RC) of -FW, see Fig 2A. Lister et al's protocol generates only +FW and -FW reads. We first describe how BS Seeker handles data generated from Cokus et al's protocol. It first converts all Cs to Ts on FW reads and both strands of the reference genome, so that the subsequent mapping is performed using only 3 letters, A, T, G. Similarly, G/A conversion is performed on RC reads and both strands of the reverse complement of the reference genome. Then it uses Bowtie to map the C/T converted FW reads to the C/T converted Watson and Crick strands, and the G/A converted RC reads to the two G/A converted reverse complements of the Watson and Crick strands. Reads that do not have a tag are treated as if they could be both FW and RC reads. During each of the four runs of Bowtie, the mapped positions for each read are recorded. After all the runs of Bowtie are complete, only unique alignments are retained. Here, we define unique alignments as those that have no other hits with - the same or fewer mismatches in the 3-letter alignment (between the converted read and the converted genomic sequence). Finally, we calculate the number of mismatches. For this calculation we consider a read T that aligns to a genomic C as a match, while a read C that aligns to a genomic T is considered a mismatch (see Fig 2B). Similarly, when aligning RC reads, a read A that aligns to a genomic G is considered a match, while a read G that aligns to a genomic A is considered a mismatch. Low-quality alignments with the number of mismatches larger than the user-defined value are discarded. Aligning reads generated from Lister et al's protocol is simpler, since there are no RC reads, and consequently Bowtie is only run twice. Schematic diagrams of the 4 forms of BS reads, mapping and post processing. 2A. BS reads may be in one of the 4 forms: +FW, +RC, -FW, -RC. 2B. Bowtie aligns C/T converted reads to the C/T converted strands. During the post processing, the number of mismatches is counted except those between read Ts and genomic Cs. Low-quality mappings with many mismatches are removed.

Results and Discussion

We first tested BS Seeker against three bisulfite aligners, BSMAP, RMAP and MAQ, by mapping synthetic reads. The sensitivity and the specificity of the aligner's output are assessed by calculating the percentage of reads that mapped uniquely and their accuracy, which is the ratio of the number of correctly mapped reads over the total uniquely mapped reads. We also calculate the inferred average methylation rate in order to discern possible mapping bias from the aligners. We then show our mapability on a lane of experimental data from Arabidopsis. Finally, we mapped the BS reads of human libraries generated from the two experimental protocols, in order to measure the performance of BS-Seeker on different protocols, and compare the speed of alignment to RMAP.

Mapping synthetic reads

We simulated 36-mer BS reads from human chromosome 21. One million contained no base calling errors and another one million had base calling errors that follow the error distribution from [14] (see Supplementary Information in Additional file 1 for details). These were mapped to chromosome 21 using all four aligners. The simulated data were generated using both protocols. As shown in Table 1 and Supplementary Table S1 in Additional file 1, we found that in all cases BSMAP was significantly slower than the other aligners. MAQ's strategy, randomly assigning one of the best-matching positions for non-unique reads, results in a lower accuracy and biased estimates of the methylation rates. When the simulated reads are based on Lister et al's protocol with no tags, RMAP has a slight speed advantage over BS Seeker, but the two methods are otherwise quite similar. However, when the Cokus et al library protocol is used, BS Seeker had a higher accuracy and shorter run time than RMAP, since it is able to explicitly account for the FW/RC tags. Furthermore, the methylation rate inferred from the mapped reads using BS Seeker is very close to that of the initial synthetic data, indicating that the alignments are less biased than those of RMAP. These results suggest that BS Seeker is able to work with data from both protocols. By explicitly using tag information, it is able to optimally align reads generated by the Cokus et al. protocol.

Table 1

Mapping 1M synthetic human chr. 21 reads onto human chr. 21

Aligner	Experimental protocol	Uniquely (or best ^a) mapped reads ^b(%)	Accuracy (%)	Methylation rates ^c(CG/CHG/CHH) (%)			CPU time
No base calling error

BS Seeker	Lister et al	91.7	100	72.0	0	0	209 s ^d

BSMAP	Lister et al	92.1	100	72.3	0	0	15h43m20s
RMAP	Lister et al	91.7	100	72.0	0	0	185 s
MAQ	Lister et al	>99.9	93.4	67.7	0	0	353 s

BS Seeker	Cokus et al	89.6	100	72.0	0	0	263 s ^d
BSMAP	Cokus et al	89.8	99.6	72.4	0	0	15 h 46 m 40 s
RMAP	Cokus et al	80.2	99.0	71.3	0	0.1	400 s
MAQ	Cokus et al	73.0	92.2	68.2	0	0	665 s

Simulated base calling errors

BS Seeker	Lister et al	91.2	99.54	71.5	0.4	0.4	217 s
BSMAP	Lister et al	91.1	99.57	72.1	0.4	0.4	15 h 19 m 51 s
RMAP	Lister et al	91.0	99.52	71.6	0.4	0.4	188 s
MAQ	Lister et al	99.5	92.9	67.8	0.4	0.4	340 s

a BS Seeker, BSMAP, and RMAP discard non-uniquely mapped reads. MAQ keeps non-uniquely mapped reads and assigns them to one of the best-matching positions. b Up to 2 mismatches are allowed. c The simulated methylation rates are set to be 72%, 0%, and 0% for CG, CHG, and CHH. d Preprocessing reference genome needs 2-5 cpu minutes for the first run.

Mapping 1M synthetic human chr. 21 reads onto human chr. 21 a BS Seeker, BSMAP, and RMAP discard non-uniquely mapped reads. MAQ keeps non-uniquely mapped reads and assigns them to one of the best-matching positions. b Up to 2 mismatches are allowed. c The simulated methylation rates are set to be 72%, 0%, and 0% for CG, CHG, and CHH. d Preprocessing reference genome needs 2-5 cpu minutes for the first run.

Mapping BS reads from Arabidopsis

We mapped 2,946,339 BS reads of a single lane of an Arabidopsis library from Cokus, et al., (2008), which is the same library tested in Xi and Li, (2009). BS Seeker is able to uniquely map 56.3% of the reads, which compares favorably to the coverage reported in Xi et al. The methylation rates inferred from the mapped reads for CG, CHG and CHH (H stands for A or C or T) are 25.5%, 8%, and 2.2% respectively, which are very close to the published results [2], i.e, 24% CG, 6.7% CHG and 1.7% CHH methylation.

Mapping BS reads from Human libraries

In order to test the performance of BS Seeker on large genomes, we used it to map BS reads from human libraries generated from the two protocols [6]. The mapabilities of reads from Cokus et al's protocol and from Lister et al's protocol are very close (38.3% and 38.6% respectively). The CG methylation rate we obtained from the mapping of reads from Lister et al's protocol is 82.3%, which coincides with the published result of 82.7% [6] (see Supplementary Information in Additional file 1 for the distributions of methylation levels). We also used RMAP to align the same human libraries. The running time for RMAP to map one lane of reads is between 11.9-13.7 hours, while for BS Seeker it is between 20-50 minutes. Both aligners showed close mapability, see Table 2. The significant advantage on the mapping efficiency suggests that BS Seeker is the most efficient bisulfite aligner for mapping large genomes.

Table 2

Comparison of mapping efficiency on mapping three lanes of human data

Reads file [6]	Number of reads	Uniquely mapped reads (%)		Running time

		BS Seeker	RMAP	BS Seeker	RMAP
SRR019048	15,311,970	40.8	41.0	50 mins	13.70 hours
SRR019501	7,217,883	52.0	52.5	26 mins	11.94 hours
SRR019597	5,943,586	62.0	62.1	20 mins	13.45 hours

Comparison of mapping efficiency on mapping three lanes of human data

Conclusions

We present a tool for mapping BS reads, the BS Seeker. It is simple to operate, and achieves high accuracy and coverage. We have made this tool publicly available for the community.

Availability and requirements

• Project name: BS Seeker • Project home page: http://pellegrini.mcdb.ucla.edu/BS_Seeker/BS_Seeker.html • Operating system(s): Linux, Mac OS • Programming language: Python • Other requirements: Python 2.5.2 or higher, Bowtie 0.10.0 or higher • License: Free for all users • Any restrictions to use by non-academics: None, the software is readily available to any scientist wishing to use it for non-commercial purposes. • The software (source code) and the examples are attached in Additional file 2 and Additional file 3, respectively. It can also be downloaded via the project home page.

Authors' contributions

PC wrote BS Seeker and drafted the manuscript. SC and MP participated in the design and helped to draft the manuscript. All authors read and approved the final manuscript.

Additional file 1

Supplementary Information; supplementary materials to the BS Seeker project. Click here for file

Additional file 2

BS Seeker software: a compressed file containing the source code for BS Seeker. Click here for file

Additional file 3

Examples for running BS Seeker. This compressed file includes two example datasets (reads and the reference genome) for testing BS Seeker. A step by step instruction is included in Readme and can be also viewed from the project webpage at http://pellegrini.mcdb.ucla.edu/BS_Seeker/EXAMPLES.html. Click here for file

14 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands.

Authors: M Frommer; L E McDonald; D S Millar; C M Collis; F Watt; G W Grigg; P L Molloy; C L Paul
Journal: Proc Natl Acad Sci U S A Date: 1992-03-01 Impact factor: 11.205

3. Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors: Heng Li; Jue Ruan; Richard Durbin
Journal: Genome Res Date: 2008-08-19 Impact factor: 9.043

4. High-throughput bisulfite sequencing in mammalian genomes.

Authors: Zachary D Smith; Hongcang Gu; Christoph Bock; Andreas Gnirke; Alexander Meissner
Journal: Methods Date: 2009-05-12 Impact factor: 3.608

5. Genome-scale DNA methylation maps of pluripotent and differentiated cells.

Authors: Alexander Meissner; Tarjei S Mikkelsen; Hongcang Gu; Marius Wernig; Jacob Hanna; Andrey Sivachenko; Xiaolan Zhang; Bradley E Bernstein; Chad Nusbaum; David B Jaffe; Andreas Gnirke; Rudolf Jaenisch; Eric S Lander
Journal: Nature Date: 2008-07-06 Impact factor: 49.962

6. Highly integrated single-base resolution maps of the epigenome in Arabidopsis.

Authors: Ryan Lister; Ronan C O'Malley; Julian Tonti-Filippini; Brian D Gregory; Charles C Berry; A Harvey Millar; Joseph R Ecker
Journal: Cell Date: 2008-05-02 Impact factor: 41.582

7. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors: Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-03-04 Impact factor: 13.583

8. Human DNA methylomes at base resolution show widespread epigenomic differences.

Authors: Ryan Lister; Mattia Pelizzola; Robert H Dowen; R David Hawkins; Gary Hon; Julian Tonti-Filippini; Joseph R Nery; Leonard Lee; Zhen Ye; Que-Minh Ngo; Lee Edsall; Jessica Antosiewicz-Bourget; Ron Stewart; Victor Ruotti; A Harvey Millar; James A Thomson; Bing Ren; Joseph R Ecker
Journal: Nature Date: 2009-10-14 Impact factor: 49.962

9. SOAP: short oligonucleotide alignment program.

Authors: Ruiqiang Li; Yingrui Li; Karsten Kristiansen; Jun Wang
Journal: Bioinformatics Date: 2008-01-28 Impact factor: 6.937

10. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning.

Authors: Shawn J Cokus; Suhua Feng; Xiaoyu Zhang; Zugen Chen; Barry Merriman; Christian D Haudenschild; Sriharsa Pradhan; Stanley F Nelson; Matteo Pellegrini; Steven E Jacobsen
Journal: Nature Date: 2008-02-17 Impact factor: 49.962

117 in total

1. RRBSMAP: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing.

Authors: Yuanxin Xi; Christoph Bock; Fabian Müller; Deqiang Sun; Alexander Meissner; Wei Li
Journal: Bioinformatics Date: 2011-12-06 Impact factor: 6.937

2. Transcriptome and methylome interactions in rice hybrids.

Authors: Ramakrishna K Chodavarapu; Suhua Feng; Bo Ding; Stacey A Simon; David Lopez; Yulin Jia; Guo-Liang Wang; Blake C Meyers; Steven E Jacobsen; Matteo Pellegrini
Journal: Proc Natl Acad Sci U S A Date: 2012-07-09 Impact factor: 11.205

3. Genome-wide identification of regulatory DNA elements and protein-binding footprints using signatures of open chromatin in Arabidopsis.

Authors: Wenli Zhang; Tao Zhang; Yufeng Wu; Jiming Jiang
Journal: Plant Cell Date: 2012-07-05 Impact factor: 11.277

Review 4. Suppression Subtractive Hybridization Versus Next-Generation Sequencing in Plant Genetic Engineering: Challenges and Perspectives.

Authors: Mahbod Sahebi; Mohamed M Hanafi; Parisa Azizi; Abdul Hakim; Sadegh Ashkani; Rambod Abiri
Journal: Mol Biotechnol Date: 2015-10 Impact factor: 2.695

5. Profiling DNA methylome landscapes of mammalian cells with single-cell reduced-representation bisulfite sequencing.

Authors: Hongshan Guo; Ping Zhu; Fan Guo; Xianlong Li; Xinglong Wu; Xiaoying Fan; Lu Wen; Fuchou Tang
Journal: Nat Protoc Date: 2015-04-02 Impact factor: 13.491

6. INVOLVED IN DE NOVO 2-containing complex involved in RNA-directed DNA methylation in Arabidopsis.

Authors: Israel Ausin; Maxim V C Greenberg; Dhirendra K Simanshu; Christopher J Hale; Ajay A Vashisht; Stacey A Simon; Tzuu-fen Lee; Suhua Feng; Sophia D Española; Blake C Meyers; James A Wohlschlegel; Dinshaw J Patel; Steven E Jacobsen
Journal: Proc Natl Acad Sci U S A Date: 2012-05-16 Impact factor: 11.205

7. BRAT-BW: efficient and accurate mapping of bisulfite-treated reads.

Authors: Elena Y Harris; Nadia Ponts; Karine G Le Roch; Stefano Lonardi
Journal: Bioinformatics Date: 2012-05-03 Impact factor: 6.937

8. Intrauterine calorie restriction affects placental DNA methylation and gene expression.

Authors: Pao-Yang Chen; Amit Ganguly; Liudmilla Rubbi; Luz D Orozco; Marco Morselli; Davin Ashraf; Artur Jaroszewicz; Suhua Feng; Steve E Jacobsen; Atsushi Nakano; Sherin U Devaskar; Matteo Pellegrini
Journal: Physiol Genomics Date: 2013-05-21 Impact factor: 3.107

9. The human placenta methylome.

Authors: Diane I Schroeder; John D Blair; Paul Lott; Hung On Ken Yu; Danna Hong; Florence Crary; Paul Ashwood; Cheryl Walker; Ian Korf; Wendy P Robinson; Janine M LaSalle
Journal: Proc Natl Acad Sci U S A Date: 2013-03-25 Impact factor: 11.205

10. Non-exhaustive DNA methylation-mediated transposon silencing in the black truffle genome, a complex fungal genome with massive repeat element content.

Authors: Barbara Montanini; Pao-Yang Chen; Marco Morselli; Artur Jaroszewicz; David Lopez; Francis Martin; Simone Ottonello; Matteo Pellegrini
Journal: Genome Biol Date: 2014-07-31 Impact factor: 13.583