| Literature DB >> 20823302 |
Susanne Balzer1, Ketil Malde, Anders Lanzén, Animesh Sharma, Inge Jonassen.
Abstract
MOTIVATION: The commercial launch of 454 pyrosequencing in 2005 was a milestone in genome sequencing in terms of performance and cost. Throughout the three available releases, average read lengths have increased to approximately 500 base pairs and are thus approaching read lengths obtained from traditional Sanger sequencing. Study design of sequencing projects would benefit from being able to simulate experiments.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20823302 PMCID: PMC2935434 DOI: 10.1093/bioinformatics/btq365
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Data basis for building the empirical distributions
| SFF files | Total | ||
|---|---|---|---|
| Number of reads | 1 176 344 | 1 270 325 | 2 446 669 |
| Average read length | 534.1 | 532.8 | 533.4 |
| Number of bases | 92 924 311 | 85 822 587 | 178 746 898 |
| Number of flow values | 142 361 278 | 130 621 280 | 272 982 558 |
| Reference Genome | Total | ||
| Number of bases | 4 639 675 | 13 213 695 | – |
| Empirical distributions | Total | ||
| Number of flow values in noise distributions | 280 763 949 | 285 227 582 | 565 991 531 |
| Number of flow values in homopolymer distributions | 314 495 947 | 278 127 101 | 592 623 048 |
aAfter 454 quality-trimming;
bwithout N′s;
chomopolymer lengths 1–5, equals to number of homopolymer runs in BLAST results.
Fig. 1.(a) A 454 flowgram: cyclic flowing during one read. The light signal strengths (flow values) are directly translated into homopolymer runs. (b) Absolute frequencies of flow values (E.coli). Left: original data, no quality-trimming; right: quality-trimmed. The trimming algorithm enhances the separation of the homopolymer length distributions and levels out discrepancies between the nucleotides such that the curves for the four nucleotides are nearly identical.
Fig. 3.Empirical distributions (smoothed average of E.coli and D.labrax) on logarithmic scale. In gray: fitted (log-) normal distributions.
Parameters of the empirical distributions
| Homopolymer length | Mean | Standard deviation |
|---|---|---|
| 0 | 0.1230 | 0.0737 |
| 1 | 1.0193 | 0.1227 |
| 2 | 2.0006 | 0.1585 |
| 3 | 2.9934 | 0.2188 |
| 4 | 3.9962 | 0.3168 |
| 5 | 4.9550 | 0.3863 |
| Linear regression for | 0.03494 + |
aNormal distribution. Mean and standard deviation of normal distribution around homopolymer lengths of 6, 7 etc.
Fig. 2.(a) Absolute frequencies of flow values by flow cycle. A total of 200 flow cycles of a Titanium run correspond to 200×4 = 800 flows. The first two flow cycles contain the TCAG tag and are omitted here. Towards the end of a run, flow values tend to lie further away from their ideal values (integers), but are obviously less in number because many values from later flow cycles have been trimmed away. (b) Standard deviation of flow values (difference in relation to their closest integer), by flow cycle. Standard deviation increases almost linearly. Only flow values <5.5 were included.
De novo-based and reference-based N50 for E. coli
| Coverage | Real | 200 cycles | 400 cycles |
|---|---|---|---|
| (simulated) | (simulated) | ||
| 1 | 649 | 651 | 995 |
| 5 | 2406 | 7045 | 7623 |
| 10 | 23 613 | 132 913 | 104 012 |
| 15 | 67 231 | 173 592 | 178 129 |
| 20 | 86 902 | 172 127 | 203 060 |
| 25 | 95 348 | 176 747 | 207 011 |
| 30 | 97 821 | 171 819 | 207 011 |
| Reference-based N50 for | |||
| 1 | 895 | 1093 | 1681 |
| 5 | 8305 | 31 730 | 40 321 |
| 10 | 76 687 | 207 827 | 2 343 849 |
| 15 | 110 013 | 207 856 | 2 496 857 |
| 20 | 118 387 | 207 740 | 2 497 013 |
| 25 | 161 266 | 207 899 | 2 497 058 |
| 30 | 177 489 | 207 845 | 2 724 990 |
Fig. 4.De novo and reference-based N50 for E.coli. Both real and simulated 454 data were assembled using Newbler v2.3.