| Literature DB >> 21245079 |
Christian Ledergerber1, Christophe Dessimoz.
Abstract
Next-generation sequencing platforms are dramatically reducing the cost of DNA sequencing. With these technologies, bases are inferred from light intensity signals, a process commonly referred to as base-calling. Thus, understanding and improving the quality of sequence data generated using these approaches are of high interest. Recently, a number of papers have characterized the biases associated with base-calling and proposed methodological improvements. In this review, we summarize recent development of base-calling approaches for the Illumina and Roche 454 sequencing platforms.Entities:
Mesh:
Year: 2011 PMID: 21245079 PMCID: PMC3178052 DOI: 10.1093/bib/bbq077
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:Illustration of the commonly modeled biases in base-callers for the Illumina platform. ϕ: Phasing can be observed as leading (gray arrow) and lagging (black arrows) signal increase before and after each intensity peak. This is illustrated by the averaged intensities of the cytosine channel when sequencing GCAGTAGTGTTGGTT CTGTAGTGGAATGTGCGGTTGTTGAGAATTCAGTA. Cross-talk correction and normalization have been applied and the first cycle has been omitted. δ: Signal decay is illustrated by the intensity signal of sequencing the micro satellite sequence ACACAC … Shown are the averaged intensities of cytosine (red) and adenine (blue) after crosstalk correction and normalization. Again, the first cycle is not shown. µ: Mixed clusters occur whenever more than one template collocate on the tile. ω: The image shows local averages of the fluorescence intensities across the area of a tile. Due to optical effects, stronger intensities are measured toward the center of the image. Σ: The intensity quadruples of the four bases are not orthogonal. Shown is the projection of measured intensities of the first sequencing cycle of the phiX174 data onto the axes corresponding to A and C. τ: In past chemistries the T-fluophore was not washed away efficiently and hence accumulated with growing number of cycles. The illustration shows the intensity values for one tile of a 51-cycle PhiX 174 RF1 run after correction by Bustard. Shown is the 95th percentile for the signal intensities in each channel and cycle. Figure credits: ϕ, δ from [16]; ω, Σ from [17]; τ from [18].
A summary of the available applications used for base-calling on the Illumina platform
| Name | Statistical approach | Biases explicitly corrected | Training data required | Quality score | Practical notes | References |
|---|---|---|---|---|---|---|
| Bustard | Parametric Model | Σ, ϕ, δ | No | Phred | Not freely retrievable | |
| Alta-Cyclic | Mixed Parametric and SVM | Σ, ϕ, δ | Yes | Phred | No longer maintained; requires a Sun Grid Engine cluster environment | [ |
| Rolexa | Parametric Model | Σ, ϕ, ω | No | IUPAC | No longer maintained | [ |
| Swift | Parametric Model | Σ, ϕ, µ | No | Phred | No longer maintained | [ |
| BayesCall/ naiveBayesCall | Parametric Model | Σ, ϕ, δ | No | Phred | [ | |
| Seraphim | Parametric Model | Σ, ϕ, δ | No | Phred | We did not succeed installing it | [ |
| Ibis | Fully empirical SVM | (n/a) | Yes | Phred | [ | |
| BING | Parametric Model | Σ, ϕ | No | None | Not freely retrievable; requires own image processing as input | [ |
We give a short description of the statistical approach used by each application. Next, the biases explicitly modeled and corrected by the application are reported (see Figure 1 for details). Alta-cyclic and Ibis rely on supervised learning and require training data. Finally uncertainty measurements or sequencing quality is either reported as Phred scores or using IUPAC codes. For details, please refer to the main text.
Figure 2:Comparison of Base-Callers for the Illumina Platform. (A) Error rate of base callers for Illumina Platform (FC-104-10xx). The test data consists of 286 847 reads (length 51, chemistry FC-104-10xx) from the phiX174 control lane, provided by Martin Kirchner, who also provided the results for Bustard 1.95 and Alta-cyclic. The method used here is identical to that of [18]. (B) Time required on a 2 GHz AMD Opteron with eight cores for training (blue) and base-calling (green). For Ibis, training was performed on a set of 1.15 million reads disjoint from the test set. For (naive)BayesCall, an unsupervised learning method, parameter estimation was performed on the test data set itself (the base caller randomly selects 250 reads for this purpose). For Rolexa, for which there was no clear separation between a parameter estimation phase and a base-calling phase, all time was attributed to base-calling. Note that training only needs to be performed once, but base-calling on a full lane involves about 40 times more reads than in our test set. (C) Phred score accuracy. Deviation from the 45-degree line indicate either underestimation (curve above the line) or overestimation (curve below the line) of the quality of the base called. We only report data for quality scores with at least 20 000 bases.