| Literature DB >> 24350580 |
Julie A Sleep1, Andreas W Schreiber, Ute Baumann.
Abstract
BACKGROUND: Next (second) generation sequencing is an increasingly important tool for many areas of molecular biology, however, care must be taken when interpreting its output. Even a low error rate can cause a large number of errors due to the high number of nucleotides being sequenced. Identifying sequencing errors from true biological variants is a challenging task. For organisms without a reference genome this difficulty is even more challenging.Entities:
Mesh:
Year: 2013 PMID: 24350580 PMCID: PMC3879328 DOI: 10.1186/1471-2105-14-367
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Number of vertices plotted against sequence abundance. Number of vertices for each parent node (Y) plotted against abundance (X) for sequences of length 21. The theoretical curve given by the function Y = 3L [ 1 - (1 - p)] ([14]), using p = 0.0004 is shown in grey. This function explains the general trend of the data but not the substantial variation in number of variants.
Properties of created subgraphs
| | 1 | 2-20 | 20-40 | 40 + | |
| 20 | 33,170 | 2,530 | 27 | 17 | 992 |
| 21 | 132,373 | 11,992 | 170 | 105 | 1,048 |
| 22 | 86,118 | 7,078 | 63 | 32 | 387 |
| 23 | 171,287 | 9,714 | 79 | 47 | 296 |
| 24 | 1,277,008 | 101,108 | 1,264 | 757 | 2,030 |
Figure 2Example model fit. Data points and fitted model for the probability of an A being misread as a C, for (a) an Illumina GA data set and (b) an Illumina HiSeq data set.
Figure 3Modelled error rates. Modelled error rates from (a) an Illumina GA data set (lane 2), (b) an Illumina GA data set (lane 4) and (c) an Illumina HiSeq data set (lane 2).
Summary of modelled error probabilities and model parameters
| A → C | 1.4E-03 | 1.4E-04 | 0.11 | 8.2E-04 | 2.2E-04 | 0.08 | 2.7E-04 | 3.7E-05 | 0.06 |
| A → G | 5.1E-04 | 2.0E-04 | 0.04 | 4.8E-04 | 1.7E-04 | 0.07 | 4.1E-04 | 1.5E-04 | 0.03 |
| A → T | 4.1E-04 | 3.4E-05 | 2.1E-04 | 5.8E-05 | 2.0E-04 | 5.6E-05 | 0.03 | ||
| C → A | 0.07 | 7.9E-04 | 0.07 | 6.6E-05 | 6.9E-05 | 0.05 | |||
| C → G | 4.2E-04 | 8.0E-05 | 0.05 | 2.9E-04 | 5.3E-05 | 0.09 | 1.6E-04 | 5.2E-05 | 0.04 |
| C → T | 6.3E-04 | 2.1E-04 | 0.07 | 5.9E-04 | 1.9E-04 | 0.08 | 6.2E-04 | 3.1E-04 | -0.01 |
| G → A | 4.3E-04 | 1.6E-04 | 0.05 | 3.7E-04 | 2.0E-04 | 0.03 | 6.1E-04 | 4.7E-04 | -0.08 |
| G → C | 5.1E-04 | 1.4E-04 | 0.10 | 7.8E-04 | 1.3E-04 | 0.09 | 6.9E-05 | 3.1E-04 | -0.11 |
| G → T | 1.5E-03 | 3.5E-04 | 0.10 | 3.3E-04 | 0.08 | -0.13 | |||
| T → A | 3.6E-04 | 7.4E-05 | 0.08 | 2.4E-04 | 1.0E-04 | 0.06 | 1.4E-04 | 5.4E-05 | 0.05 |
| T → C | 6.1E-04 | 3.5E-04 | 0.04 | 5.6E-04 | 3.6E-04 | 0.04 | 5.1E-04 | 1.4E-04 | 0.02 |
| T → G | 3.3E-04 | 2.8E-04 | 0.05 | 3.4E-04 | 2.7E-04 | 0.08 | 1.3E-04 | 2.0E-05 | |
Probabilities for position 1 and exponents of the fitted exponential curves, Ae , for positions 2 to 24 for the data sets corresponding to Figure 3.
Summary of model parameters resulting from simulated data
| A → C | 1.4E-03 | 1.8E-04 | 0.10 |
| A → G | 6.8E-04 | 2.3E-04 | 0.04 |
| A → T | 4.6E-04 | 4.6E-05 | |
| C → A | 0.07 | ||
| C → G | 4.3E-04 | 8.0E-05 | 0.08 |
| C → T | 8.3E-04 | 2.4E-04 | 0.06 |
| G → A | 4.9E-04 | 2.0E-04 | 0.06 |
| G → C | 5.3E-04 | 1.4E-04 | 0.11 |
| G → T | 1.8E-03 | 2.7E-04 | 0.14 |
| T → A | 4.1E-04 | 1.2E-04 | 0.06 |
| T → C | 5.9E-04 | 2.9E-04 | 0.06 |
| T → G | 3.9E-04 | 4.5E-04 | 0.02 |
Probabilities for position 1 and exponents of the fitted exponential curves, Ae , for positions 2 to 24 for the simulated data set. The corresponding figure is shown in Additional file 3.
Evaluation of error correction algorithm on PhiX genomic sequences
| Exact match | 10115 | 8 |
| 1 mismatch | 2779 | 64137 |
| 2 mismatches | 164 | 17636 |
| 3 mismatches | 14 | 3217 |
Sequence counts comparing our model predictions of correct and erroneous sequences to results obtained by mapping the sequences to the corresponding genome.