| Literature DB >> 24413520 |
Chengxi Ye1, Chiaowen Hsiao, Héctor Corrada Bravo.
Abstract
MOTIVATION: Base-calling of sequencing data produced by high-throughput sequencing platforms is a fundamental process in current bioinformatics analysis. However, existing third-party probabilistic or machine-learning methods that significantly improve the accuracy of base-calls on these platforms are impractical for production use due to their computational inefficiency.Entities:
Mesh:
Year: 2014 PMID: 24413520 PMCID: PMC3998134 DOI: 10.1093/bioinformatics/btu010
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Signal properties in the base-calling problem. (A) Fluorescence intensity measurements from one cluster for 50 sequencing cycles. Cross-talk and signal decay effects are clearly observed in this data. Background intensity increases as sequencing progresses. (B) The phasing effect demonstrated on a subset of data from (A). High intensity in the C channel in cycle 32 affects background intensity in the C channel in neighboring cycles
Fig. 2.The BlindCall architecture. BlindCall consists of two modules: (A) the training module uses blind deconvolution and (B) to simultaneously estimate model parameters and produce a deconvolved signal from which base-calling is done. The calling module uses the parameters estimated in the training module to produce a deconvolved output signal
Base callers accuracy and runtime comparison
| Bustard | AYB | BlindCall slow | BlindCall fast | freeIbis | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Perfect reads | 1 446 079 | 1 532 000 | 1 509 451 | 1 508 779 | 1 530 099 | |||||
| Error rate (%) | 0.29 | 0.21 | 0.23 | 0.23 | 0.21 | |||||
| Time (minimum) | 17 | 217 | 8/12 | 4/8 | 9/126 | |||||
| Assembly results | N50 | Maximum | N50 | Maximum | N50 | Maximum | N50 | Maximum | N50 | Maximum |
| 5× | 610 | 1122 | 628 | 1155 | 629 | 1164 | 623 | 1167 | 649 | 1184 |
| 10× | 3 375 | 3469 | 3198 | 3322 | 3382 | 3487 | 3389 | 3485 | 3306 | 3418 |
| 20× | 4466 | 4478 | 4627 | 4637 | 4511 | 4523 | 4470 | 4483 | 4333 | 4357 |
AYB, accuracy and run times for Bustard. freeIbis and BlindCall for a dataset of 1.9 million reads from a HiSeq 2000 run of PhiX174. BlindCall Fast corresponds to non-iterative version of the blind-deconvolution method. Running times for BlindCall are reported as (processing time/total time), where the total time includes reading intensity data from disk and writing base-calls to disk. For freeIbis, we report the time as (predicting time with single thread/ training time with 10 threads). BlindCall was able to produce base-calls of comparable accuracy to AYB and freeIbis at significantly faster computational time (8 min/12 min versus 217 min and 126 min, respectively). It is also faster than Bustard (8 min/12 min versus 17 min). AYB, freeIbis and BlindCall all improve on Bustard base calls. We also compared assemblies of the PhiX174 genome using reads generated by Bustard, BlindCall, freeIbis and AYB. The reported N50s and Max contig lengths are averages >100 random samples with the corresponding coverage (5×, 10× or 20×). While BlindCall is able to process data at a significantly lower computational cost, the assemblies obtained using BlindCall are of comparable quality to those obtained using AYB or freeIbis.
Fig. 3.Third-party base callers improve Bustard per-cycle error rate. We plot error rate of each base-caller per sequencing cycle on the PhiX174 test data. All three base callers significantly improve accuracy over Bustard, especially in later cycles. BlindCall is able to achieve comparable accuracy while processing data at a much faster rate
Accuracy comparison
| Ibis Test | PhiX174 (AYB) | ||||||
|---|---|---|---|---|---|---|---|
| Perfect reads | Error rate (%) | Perfect reads | Error rate (%) | Perfect reads | Error rate (%) | ||
| Bustard | 99 834 | 1.45 | 1 557 963 | 2.01 | 24 478 | 0.49 | |
| AYB | 133 537 | 0.73 | 2 304 005 | 1.26 | 26 878 | 0.38 | |
| BlindCall slow | 110 951 | 1.12 | 1 902 621 | 1.61 | 25 144 | 0.45 | |
| BlindCall fast | 105 312 | 1.26 | 1 856 286 | 1.66 | 24 740 | 0.47 | |
| Time | Slow | 0.08/0.3/1 | 0.11/6/10 | 0.15/14/22 | |||
| Fast | 0.08/0.1/1 | 0.11/3/8 | 0.15/7/16 | ||||
Accuracy for Bustard, AYB and BlindCall on various datasets. BlindCall was able to produce comparable accuracy to state-of-the-art base callers at significantly faster computational time. All methods improve on Bustard base calls. Run times for BlindCall are reported as (training time/processing time/total time in minutes) where the total time includes reading intensity data from disk and writing base-calls to disk.
Fig. 4.BlindCall produces accurate calibrated quality scores. We plot observed error rates (on the PHRED scale) for Bustard, AYB and BlindCall as predicted by quality scores and observed high correlation for all base callers
Fig. 5.Base-calling by blind deconvolution is scalable to long read lengths. We compare the computational time of BlindCall with a state-of-the-art probabilistic base caller AYB, the state-of-the-art supervised learning method freeIbis and Illumina’s Bustard on the PhiX174 dataset reported in Table 1 as a function of the number of sequencing cycles. Since most model-based base callers resort to a dynamic programming solution, running time is quadratic with respect to the read length. In contrast, BlindCall scales linearly with read length. Base callers based on the blind deconvolution framework will be able to scale as sequencers produce longer reads. freeIbis also scales linearly but is much slower than BlindCall