| Literature DB >> 30526492 |
Seokjun Soe1, Yoonjae Park2, Heejoon Chae3.
Abstract
BACKGROUND: Bisulfite sequencing is one of the major high-resolution DNA methylation measurement method. Due to the selective nucleotide conversion on unmethylated cytosines after treatment with sodium bisulfite, processing bisulfite-treated sequencing reads requires additional steps which need high computational demands. However, a dearth of efficient aligner that is designed for bisulfite-treated sequencing becomes a bottleneck of large-scale DNA methylome analyses.Entities:
Keywords: Alignment; Apache Spark; Bisulfite sequencing; DNA methylation
Mesh:
Substances:
Year: 2018 PMID: 30526492 PMCID: PMC6288881 DOI: 10.1186/s12859-018-2498-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Analysis workflow within BiSpark consists of 4 processing phases: (1) Distributing the reads into key-value pairs, (2) Transforming reads into ‘three-letter’ reads and mapping to transformed reference genome, (3) Aggregating mapping results and filtering ambiguous reads, and (4) Profiling the methylation information for each read. The figure depicts the case when library of input data is a non-directional
Experimental data for performance evaluation
| Data set | Tailored data size | # of reads | Description |
|---|---|---|---|
| Simulation data | 122MB | 1,000,000 | Simulation set with 0% error |
| 122MB | 1,000,000 | Simulation set with 1% error | |
| 122MB | 1,000,000 | Simulation set with 2% error | |
| GEO WGBS data (GSE80911) | 1.6GB | 10,000,000 | 10 million reads real data set |
| 7.9GB | 50,000,000 | 50 million reads real data set | |
| 16GB | 100,000,000 | 100 million reads real data set | |
| 32GB | 200,000,000 | 200 million reads real data set | |
| Reference genome | Build 37, hg19 |
Simulation data sets are generated by Sherman [26] with various error rates (0%, 1% and 2% respectively) where the error rate is a mean error rate per bp whereby the error curve follows an exponential decay model. Each test data sets are tailored from original WGBS data based on number of reads
Testbed for performance evaluation
| System/framework | description | version |
|---|---|---|
| Master | 1 master node of cluster | CPU: 2.2GHz |
| (Intel Xeon E5-2407) | Memory: 8GB | |
| Slaves | {10,20,40} slave nodes of cluster | CPU: 3.3GHz |
| (Intel i3-3220) | Memory: 8GB | |
| Single server | 24 core single server | CPU: 2.6GHz |
| (Intel Xeon X5650) | Memory: 94GB | |
| Apache Hadoop | Distributed file system | v2.6.0 |
| Apache Spark | Data processing framework | v1.6.0 |
| Bowtie2 | General short read aligner | v2.2.9 |
| CloudAligner | Bisulfite aligner on cluster | v1.8 |
| Bison | Bisulfite aligner on cluster | v0.3.3 |
| Bismark | Bisulfite aligner on single machine | v0.18.1 |
Mappability, precision, sensitivity and accuracy of aligners
| Data set | Aligner | Mappability | Precision | Sensitivity | Accuracy |
|---|---|---|---|---|---|
| With 0% error | BiSpark† | 0.9569 | 1.0 | 0.9569 | 0.9569 |
| Bismark | 0.9454 | 1.0 | 0.9454 | 0.9454 | |
| Bison | 0.8030 | 0.6090 | 0.7129 | 0.4891 | |
| With 1% error | BiSpark† | 0.9494 | 0.9892 | 0.9489 | 0.9392 |
| Bismark | 0.9440 | 0.9961 | 0.9438 | 0.9403 | |
| Bison | 0.8297 | 0.5812 | 0.7391 | 0.4823 | |
| With 2% error | BiSpark† | 0.9422 | 0.9800 | 0.9411 | 0.9234 |
| Bismark | 0.9182 | 0.9862 | 0.9171 | 0.9055 | |
| Bison | 0.8315 | 0.5729 | 0.7387 | 0.4763 |
†The results from both BiSpark-plain and balance are denoted as BiSpark because the difference is only in the part where data is distributed, which means the results of two versions are always same
Fig. 2Comparison between the BiSpark and other bisulfite-treated aligners. In the performance test, the BiSpark outperforms all other aligners in terms of (a) scalability to data size and (b) cluster size