| Literature DB >> 19239711 |
Douglas W Bryant1, Weng-Keen Wong, Todd C Mockler.
Abstract
BACKGROUND: New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data.Entities:
Mesh:
Year: 2009 PMID: 19239711 PMCID: PMC2653489 DOI: 10.1186/1471-2105-10-69
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
in Figure 1. Then, like VCAKE, QSRA finds all n of the k-mer reads which exactly match the 3' end of the seed (now the growing contig) down to a minimum number of matching bases, u, which is a user-defined parameter. QSRA finds these matching reads by searching the prefix-tree data structure. Each matching read found is stored in a linked list along with the number of times each it and its reverse complement occurred in the set of input reads, as well as the number of bases in the read which match to the growing contig. This last measure indicates the position at which bases cease to overlap the growing contig in each matching read and is updated each time the growing contig is extended.
Figure 1QSRA Algorithm. The basic QSRA algorithm, using default values for user-defined parameters. As illustrated, each iteration of contig extension (line 22) is contingent on there existing sufficiently many matching reads,
Results of assemblies of actual Illumina sequencing data on 3.0 GHz Xeon processor with 32 GB memory.
| pina | 5.33 | 479 | SSAKE | 3.2 | 2008 | 0.91 | 3463 | 79.8 | 3051 | 241 | N/A | 24686 | -m 16 |
| pina | 5.33 | 479 | VCAKE | 1.0 | 05/2007 | 0.74 | 8400 | 68.1 | 1721 | 101 | N/A | 188778 | -k 33 -o 34 |
| pina | 5.33 | 479 | VELVET | 0.6.04 | 03/2008 | 0.36 | 74 | 58.5 | 3076 | 285 | N/A | 464 | -min_contig_lgth 34 |
| pina | 5.33 | 479 | EDENA | 2.1.1 | 2008 | 0.24 | 210 | 77.8 | 3329 | 400 | N/A | 3377 | -c 34 |
| pina | 5.33 | 479 | QSRA | 06032008 | 06/2008 | 0.84 | 1553 | 93.1 | 1046 | 94 | 86 | 32473 | -k 33 -o 34 |
| pina | 5.33 | 479 | QSRA* | 06032008 | 06/2008 | 0.91 | 1301 | 99.3 | 1771 | 85 | 85 | 83004 | -k 33 -o 34 |
| gera03 | 5.18 | 376 | SSAKE | 3.2 | 2008 | 0.78 | 1936 | 85.1 | 3613 | 347 | 42 | 18093 | -m 16 |
| gera03 | 5.18 | 376 | VCAKE | 1.0 | 05/2007 | 0.64 | 3114 | 82.6 | 1964 | 157 | 96 | 175451 | -k 33 -o 34 |
| gera03 | 5.18 | 376 | VELVET | 0.6.04 | 03/2008 | 0.32 | 55 | 60.2 | 4296 | 386 | N/A | 311 | -min_contig_lgth 34 |
| gera03 | 5.18 | 376 | EDENA | 2.1.1 | 2008 | 0.16 | 98 | 88.9 | 3285 | 535 | 41 | 1977 | -c 34 |
| gera03 | 5.18 | 376 | QSRA | 06032008 | 06/2008 | 0.69 | 733 | 99.1 | 3012 | 71 | 71 | 21132 | -k 33 -o 34 |
| gera03 | 5.18 | 376 | QSRA* | 06032008 | 06/2008 | 0.75 | 641 | 99.0 | 3012 | 569 | 167 | 60584 | -k 33 -o 34 |
| suisp | 2.26 | 49 | SSAKE | 3.2 | 2008 | 2.21 | 5941 | 95.8 | 6475 | 1036 | 355 | 15632 | -m 16 |
| suisp | 2.26 | 49 | VCAKE | 1.0 | 05/2007 | 1.66 | 7202 | 99.0 | 11894 | 1577 | 718 | 487006 | -k 36, -o 37 |
| suisp | 2.26 | 49 | VELVET | 0.6.04 | 03/2008 | 0.74 | 144 | 96.4 | 18690 | 4401 | 1992 | 1185 | -min_contig_lgth 37 |
| suisp | 2.26 | 49 | EDENA | 2.1.1 | 2008 | 0.48 | 357 | 97.3 | 8829 | 1836 | 759 | 3254 | -c 37 |
| suisp | 2.26 | 49 | QSRA | 06032008 | 06/2008 | 1.89 | 3329 | 96.9 | 11934 | 2432 | 259 | 18834 | -k 36 -o 37 |
| suisp | 2.26 | 49 | QSRA* | 06032008 | 06/2008 | 2.18 | 3628 | 98.5 | 11934 | 2370 | 259 | 168464 | -k 36 -o 37 |
Five tests were run for each of three data sets, including SSAKE, VCAKE, VELVET, EDENA, QSRA without q-values, specified simply as QSRA in Table 1, and QSRA with q-values, specified as QSRA* in Table 1. In addition to the runtime options listed for VELVET, each VELVET run used a tile size of 19 and a coverage cutoff of 5. Only contigs were used, discarding unextended singletons, in the calculation of coverage, N50, and N80 values. Coverage values were determined though analysis of BLAT output by comparing the total number of bases in the reference genome with the number of bases uniquely "hit" by the BLAT alignments with assembled contigs. Thus, any contig which BLAT, using the default value of 90% identity, could not match to its reference genome did not contribute to coverage calculations. N50 and N80 values are equal to the largest contig in the output such that it and all contigs of greater length accounted for 50%/80% of total genome coverage. For S. suis, 43.8% of the 36 mer Illumina reads in the data set matched perfectly to the reference genome, which corresponds to an estimated average error rate per sequence base of 2.26%.