| Literature DB >> 18423012 |
Nicolas J Parker1, Andrew G Parker2.
Abstract
BACKGROUND: The advent of pyrophosphate sequencing makes large volumes of sequencing data available at a lower cost than previously possible. However, the short read lengths are difficult to assemble and the large dataset is difficult to handle. During the sequencing of a virus from the tsetse fly, Glossina pallidipes, we found the need for tools to search quickly a set of reads for near exact text matches.Entities:
Year: 2008 PMID: 18423012 PMCID: PMC2374781 DOI: 10.1186/1751-0473-3-5
Source DB: PubMed Journal: Source Code Biol Med ISSN: 1751-0473
Figure 1Graphical representation of the output from re. The sequence of red points on the x-axis near 182400 represents a mismatch region where a single base was inserted by the assembler. The areas with many more reads than the average show identical repeats. The match length was set to 27 for this Fig.
and , the individual variants of the repeats being indicated by a sequence number, e.g.
Figure 2Number of reads for repeat elements. a) For an intermediate stage of finishing. b) After completion of finishing; the dashed line is the computed trend line with gradient 74.02 and intercept zero.
Figure 3Homopolymer lengths. a) Reported homopolymer lengths in the 454 data set for all 9-mers of A and T in the GpSGHV. b) Reported homopolymer lengths in the 454 data set for all 6-mers of C and G and a random selection of 6-mers of A and T in the GpSGHV.
Homopolymer read lengths
| Homopolymer | Read direction | ||
| Length | Forward | Reverse | Total |
| 8 | 2 | 6 | 8 |
| 9 | 10 | 22 | 32 |
| 10 | 12 | 20 | 32 |
| 11 | 1 | 6 | 7 |
Read lengths in the 454 data for the GpSGHV sequence starting at position 86587, TTTT*AAAA*TTAAAAAAAAATCCTCCGAGT
Bases inserted due to incomplete extension.
| Forward | Reverse | |
| TTAAAATAAAAAAAAAG | 17 | 0 |
| TTAAAATAAAAAAAAAGCGCAT | 8 | 14 |
| TAAAAAAATAAAAAACT | 18 | 2 |
| TAAAAAAATAAAAAACTGGTTT | 11 | 13 |
| TATTGTAAATAAAAAAAAAATA | 18 | 1 |
| TATTGTAAATAAAAAAAAAATATACA | 6 | 15 |
Paired output from count with the incorrect sequence followed by the corrected sequence. In each case, incomplete extension has resulted in the insertion of a spurious base in the following flow cycle in the forward direction, but no or very few reads in the reverse sense. When the spurious base is removed, the numbers are more nearly matched in the forward and reverse directions. In the third example, this effect caused a single A to be read as AA.