| Literature DB >> 24261665 |
Changjin Hong1, Nathan L Clement, Spencer Clement, Saher Sue Hammoud, Douglas T Carrell, Bradley R Cairns, Quinn Snell, Mark J Clement, William Evan Johnson.
Abstract
BACKGROUND: DNA methylation has been linked to many important biological phenomena. Researchers have recently begun to sequence bisulfite treated DNA to determine its pattern of methylation. However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases. Therefore, it is often difficult to align reads to the correct locations in the reference genome. Furthermore, bisulfite sequencing experiments have the additional complexity of having to estimate the DNA methylation levels within the sample.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24261665 PMCID: PMC3924334 DOI: 10.1186/1471-2105-14-337
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Parameters used in the two experiments for each of the aligners tested
| -m 17 -s 1 -T 20 -a 0.90 (-a 0.92) -b | |
| -k 17 -s 2 -r -A 20 -t 75 (-t 90) -b2 | |
| -n 0 -w 100 -v 3 | |
| -n 2 -l 50 | |
| -N 1 -L 20 –bowtie2 –min-score L,0,-0.6 | |
| -Q1 -j1 -d120 -n20 -f1 | last-map-probs.py -s150 -m0.95 | |
| default settings for single-end reads |
Parameters in parentheses were used for the human BSR dataset. Up to 3 to 4 sequencing errors or mutations were allowed in a 100-bp long BSR. Because the human genome contains many repeated genomic regions, 20 valid mapping locations were allowed for a given read when the alignment method supported this feature.
Simulated bisulfite read experiment
| | | | | | | |
| Total reads aligned (%) | 156.6M( | 155.8M(97.4) | 153.4M(95.9) | 149.5M(93.4) | 145.4M(90.8) | 158.7M(99.2) |
| Correctly aligned (%) | 155.2M( | 154.2M(96.3) | 150.2M(93.9) | 149.2M(93.2) | 145.2M(90.7) | 155.1M(96.9) |
| Incorrectly aligned (%) | 1.4M(0.9) | 1.7M(1.1) | 1.5M(1.0) | 0.3M(0.2) | 0.2M( | 3.6M(2.3) |
| | | | | | | |
| Total reads aligned (%) | 69.0M(97.8) | 66.0M(93.6) | 65.3M(92.6) | 63.6M(90.2) | 59.6M(84.4) | 70.3M( |
| Correctly aligned (%) | 67.7M( | 65.3M(92.6) | 63.9M(90.1) | 63.3M(89.8) | 59.4M(84.1) | 66.7M(94.6) |
| Incorrectly aligned (%) | 1.3M(1.8) | 0.7M(1.0) | 1.4M(2.0) | 0.3M(0.4) | 0.2M( | 3.5M(5.1) |
| | | | | | | |
| Ave. absolute estimation err. | 0.11 | 0.69 | 0.22 | 0.11 | 0.10 | - |
| Standard err. | 0.056 | 0.066 | 0.067 | 0.064 | 0.062 | - |
| | | | | | | |
| Total compute time (16 CPUs) | 39 h 50 m | 29 h 25 m | 46 h 16 m | 97 h 26 m | 58 h 20 m | |
| Peak memory usage (GB) | 44.8 | 14.5 | 9.4 | 7.9 | 15.9 | |
| Reads per second per CPU | 68 | 92 | 607 | 448 | 26 |
Simulation study of 160 million (M) simulated BSRs generated from the human genome reference sequence. The GNUMAP-bs algorithm was the most sensitive aligner, especially for reads with ≥1 sequence variant (sequencing errors or mutations). The Bismark algorithm had the smallest error rate with 1.2 M fewer erroneously assigned reads than GNUMAP-bs, however GNUMAP-bs correctly aligned 6 to 10 M more reads. The BSMAP algorithm had the fastest total run time, however its sensitivity was less than the sensitivity of the GNUMAP-bs algorithm. LAST mapped nearly all reads with a sensitivity that was comparable to that of GNUMAP-bs, but the mapping error rate for LAST was much higher than it was for GNUMAP-bs.
Human bisulfite read experiment
| | | | | | | | |
| Reads Aligned | 1.57M | 1.65M | 1.34M | 1.31M | 1.21M | 3.02M | 0.98M |
| CGs Covered | 330K | 336K | 330K | 310K | 299K | 373K | 275K |
| CG read coverage | 6.3 | 6.1 | 5.8 | 5.8 | 5.7 | 7.6 | 5.0 |
| | | | | | | | |
| CGs Covered | 7,902 | 7,802 | 7,747 | 7,561 | 7,331 | 8,606 | 6,690 |
| Correlation | 0.887 | 0.889 | 0.887 | 0.888 | 0.895 | 0.882 | 0.894 |
| Concordance | 0.869 | 0.867 | 0.865 | 0.867 | 0.872 | 0.858 | 0.872 |
| CG Read coverage | 4.6 | 4.5 | 4.4 | 4.3 | 4.3 | 4.9 | 3.7 |
Study of 162.3 M Illumina BSRs on human chromosome 22 (chr22) from a human donor sample. The top rows show the overall mapping efficiency, including covered CGs sites and read coverage aligned to chr22. The bottom rows show the consistency of each mapping result with known methylation levels at 13,563 CGs sites on chr22 available from the HEP database. The concordance statistic is defined as the fraction of overlapped CGs covered by the aligner with HEP CGs for which the methylation prediction error was smaller than 0.25.
Figure 1workflow for mapping high-throughput bisulfite reads. A flow-chart of the GNUMAP-bs algorithm is shown. For the reference genome, all Cs are converted to Ts and then each k-mer in the genome is hashed, producing a list of positions in the genome to which the k-mer sequence is mapped. Given a completely BS-converted read (e.g. TGGTGTT) where the corresponding original bisulfite read (BSR) is known (e.g. CGGTGTC), a query k-mer (e.g. TGGTG) is searched against the BS-converted hash table. Locations with a k-mer match to both the genome and the read are aligned using the original read and original genome and the GNUMAP-bs probabilistic Needleman–Wunsch (N-W) algorithm. If the alignment score passes a quality threshold, the location is considered a match and recorded on the genome for future output. The posterior probability is computed for each mapped location using Equation (1). Finally, all the mapped BSRs are used to quantify the fraction of methylation at each CG location across the reference genome.
Figure 2Relative complement mapping consistency of with HEP methylation profiles of human chromosome 22. Venn diagrams between GNUMAP-bs and (a) Novoalign, (b) BSMAP, and (c) Bismark showing both the number of covered/uncovered CG sites and the concordance (in parenthesis) of these sites with the HEP methylation profiles. The estimated levels of methylation in the additional CG sites covered by the probabilistic aligners but not by the other aligners are highly concordant with the HEP results.