| Literature DB >> 24839440 |
Hong Tran1, Jacob Porter1, Ming-An Sun2, Hehuang Xie2, Liqing Zhang1.
Abstract
Background. Large-scale bisulfite treatment and short reads sequencing technology allow comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. Aligning bisulfite short reads to a reference genome has been a challenging task. We compared five bisulfite short read mapping tools, BSMAP, Bismark, BS-Seeker, BiSS, and BRAT-BW, representing two classes of mapping algorithms (hash table and suffix/prefix tries). We examined their mapping efficiency (i.e., the percentage of reads that can be mapped to the genomes), usability, running time, and effects of changing default parameter settings using both real and simulated reads. We also investigated how preprocessing data might affect mapping efficiency. Conclusion. Among the five programs compared, in terms of mapping efficiency, Bismark performs the best on the real data, followed by BiSS, BSMAP, and finally BRAT-BW and BS-Seeker with very similar performance. If CPU time is not a constraint, Bismark is a good choice of program for mapping bisulfite treated short reads. Data quality impacts a great deal mapping efficiency. Although increasing the number of mismatches allowed can increase mapping efficiency, it not only significantly slows down the program, but also runs the risk of having increased false positives. Therefore, users should carefully set the related parameters depending on the quality of their sequencing data.Entities:
Year: 2014 PMID: 24839440 PMCID: PMC4009243 DOI: 10.1155/2014/472045
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Figure 1Bisulfite mapping tools classification. The tools can be divided into two groups based on indexing strategies: hash tables or suffix/prefix tries. Each of the groups is classified further into subgroups where some example programs are shown. *BFAST uses multiple index strategies: both hashing and suffix tree.
Detailed comparison of different bisulfite short reads mapping tools.
| Programs | Year | Algorithmic Technique used | Language | Aligner | Input | Output | Min./Max. read length | Mismatches | Indels | Gaps | Single/Paired-end | Multi-threaded | Nondirectional |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ERNE-bs5 | 2012 | Hash genome indexing uses a 5-letter (Cm, Cu) for storing methylation information and uses a weighted context-aware Hamming distance to identify a T coming from an unmethylated C. | C++ | None | gz/bz2/fastq/fasta | BAM/ | up to 600 bp | 1 every 15 bp (-errors arg) | Yes | Yes | both | Yes | No |
|
| |||||||||||||
| BatMeth | 2012 | FM index integrates mismatch counting, list filtering and mismatch stage filtering and fast mapping onto two indexes. | Perl/C++ | None | fasta | NA | NA | up to 5 (- | No | No | Yes | Yes | Yes |
|
| |||||||||||||
| BiSS | 2012 | Reference genome hashing, local Smith-Waterman alignment | Perl | None | fasta/fastq/gz/SAM/BAM | SAM/BAM/Next GenMap | up to 4096 bp | (- | Yes | Yes | Yes | Yes | No |
|
| |||||||||||||
| Bismark | 2011 | FM-Index enumerates all possible T to C conversion | Perl | Bowtie/Bowtie2 | fasta/fastq | BAM/SAM | Bowtie: up to 1000 bp Bowtie 2: unlimited | 0 or 1 in a seed (- | Yes | Yes | both | Yes | Yes |
|
| |||||||||||||
| BS-Seeker2 | 2013 | FM-Index enumerates all possible T to C conversion | Python | Bowtie2/Bowtie/SOAP/RMAP | fasta, fastq, qseq, pure sequence | BAM/SAM/BS-Seeker | 50–500 bp | up to 4 per read (- | Yes | Yes | Single | No | Yes |
|
| |||||||||||||
| BS-Seeker | 2010 | FM-Index, enumerates all possible T to C conversion, converts the genome to 3 letters, and uses Bowtie to align reads | Python | Bowtie | fasta, fastq, qseq, pure sequence | BAM/SAM/BS_Seeker | 50–250 bp | up to 3 per read (- | Yes | No | Single | No | Yes |
|
| |||||||||||||
| BSMAP | 2009 | hashing of reference genome and bitwise masking tries all possible T to C combinations for reads | Python | SOAP | fasta/fastq/ | SAM/txt | up to 144 bp | up to 15 in a read (- | up to 3 bp | both | Yes | Yes | |
|
| |||||||||||||
| RMAP | 2008 | Wildcard matching for mapping Ts, incorporates the use of quality scores directly into the mapping process | C++ | fastq/fasta | BED | unlimited | up to 10 in a read (- | No | No | both | No | No | |
|
| |||||||||||||
| BRAT-BW | 2012 | Converts a TA reference and CG reference; two FM indices are built on the positive strand of the reference genome | C++ | Text file with input file names in fastq, sequence only | txt | 32 bp-unlimited | unlimited | No | No | both | Yes | Yes | |
|
| |||||||||||||
| MAQ | 2008 | Builds multiple hash tables to index the reads, scans the reference genome against the hash tables to find hits | Perl/C/C++ | fastq | maq | Up to 63 bp | up to 3 per read | Yes, - | No | both | No | No | |
|
| |||||||||||||
| PASH | 2010 | Implements | C | fastq | Txt/SAM | NA | Yes | Yes | No | Single | No | No | |
|
| |||||||||||||
| Novo-align | 2010 | Hashing genome | C/C++ | fastq | SAM/BAM | up to 8 per read, 16 for paired end reads | Yes | Yes | up to 7 bp on single end reads | Both | No | Yes | |
|
| |||||||||||||
| Methyl-coder | 2011 | FM-Index, all Cs converted to Ts | C/C++/Python | GSNAP/bowtie | fastq/fasta | BAM/SAM | Bowtie: up to 1000 bp | Yes | No | Yes | both | No | No |
|
| |||||||||||||
| GSNAP | 2005 |
| C/Perl | gzip/fastq, fasta/bzip2 | SAM/GSNAP | 14–250 bp | Yes | Yes | Yes | both | yes | No | |
|
| |||||||||||||
| BFAST | 2009 | Uses multiple indexing strategies: hashing and suffix array of the reference genome | C | fastq/bz2/gzip | SAM | NA | Yes | Yes | Yes | both | Yes | Yes | |
|
| |||||||||||||
| Segemehl | 2008 | Enhanced suffix arrays to find exact and inexact matches. Align to read using Myers bitvector algorithm | C/C++ | fasta | SAM | unlimited | Yes | (- | Yes | both | Yes | No | |
*BFAST does not have a direct option for bisulfite mapping, users have to convert Cs to Ts in both a reference genome and reads and then align converted reads to the converted reference genome.
*Parenthesis in mismatches column indicates parameter for mismatches in a program.
∗1A min percentages of matches per read.
Figure 2Mapping efficiency on ten human blood datasets for BSMAP, Bismark, BS-Seeker, BRAT-BW, and BiSS with zero mismatches allowed between reads and the reference genome.
Figure 3CPU running time (on a log scale) on human blood data for BSMAP, Bismark, BS-Seeker, BRAT-BW, and BiSS with zero mismatches allowed between reads and the reference genome.
Figure 5The effect of trimming reads on mapping efficiency on ten human blood, ten human brain, and eight mouse brain datasets for BSMAP and Bismark.
Figure 6Mean and standard deviations of mapping percentages across ten human blood, ten human brain, and eight mouse brain datasets.
Improvement in mapping efficiency after using BSMAP and BS-Seeker to map unmapped reads from Bismark on human blood data.
| File name | Total number of reads | Unmapped reads in BISMARK | Overall improvement using BSMAP | Overall improvement using BS-Seeker |
|---|---|---|---|---|
| SRR342552 | 23,472,574 | 10512269 | 3.72% | 0.90% |
| SRR342553 | 23,749,583 | 10610307 | 4.24% | 1.03% |
| SRR342554 | 25,232,053 | 11277407 | 4.29% | 1.07% |
| SRR342555 | 23,750,428 | 10452979 | 4.23% | 1.01% |
| SRR342556 | 23,140,352 | 10204603 | 4.28% | 1.06% |
| SRR342557 | 23,089,492 | 10093756 | 4.33% | 1.05% |
| SRR342558 | 21,205,564 | 9215604 | 4.26% | 1.04% |
| SRR342560 | 26,174,056 | 11491673 | 4.17% | 1.01% |
| SRR342561 | 25,457,341 | 11271400 | 4.16% | 1.02% |
Figure 4Unique mapping efficiency on ten human blood datasets from BS-Seeker with different numbers of mismatches allowed between reads and the reference genome (0, 1, 2, and 3 mismatches).
Figure 7The effect of sequencing error on mapping efficiency for BSMAP and Bismark using simulated data generated from Sherman simulator with varying sequencing error from 0.1 to 4.75% (e.g., sequencing error 0.1% means 1 error in every 1000 bases) for read length = 101 bp, CG = 10% (10% of all CG-cytosines will be converted into thymines), and CH = 98.5% (98.5% of all CH-cytosines will be converted into thymines).
Figure 8The effect of read length on mapping efficiency for BSMAP and Bismark using simulated data generated from Sherman simulator with different read lengths (from 40 to 160 bps) for sequencing error e = 0.16, CG = 10%, and CH = 98.5% for mouse and e = 0.16, CG = 19.73%, and CH = 98.9% for human.