| Literature DB >> 16397298 |
Xiaoqiu Huang1, Shiaw-Pyng Yang, Asif T Chinwalla, LaDeana W Hillier, Patrick Minx, Elaine R Mardis, Richard K Wilson.
Abstract
We introduce a data structure called a superword array for finding quickly matches between DNA sequences. The superword array possesses some desirable features of the lookup table and suffix array. We describe simple algorithms for constructing and using a superword array to find pairs of sequences that share a unique superword. The algorithms are implemented in a genome assembly program called PCAP.REP for computation of overlaps between reads. Experimental results produced by PCAP.REP and PCAP on a whole-genome dataset show that PCAP.REP produced a more accurate and contiguous assembly than PCAP.Entities:
Mesh:
Year: 2006 PMID: 16397298 PMCID: PMC1325203 DOI: 10.1093/nar/gkj419
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1An example superword array for a combined sequence of two short reads. The combined sequence is given on the top row. For each base of the combined sequence, the position of the base is shown below the base on the second row and the code of the word of length 2 starting at the position is shown below the position on the third row. The ordered sequence positions in the superword array are shown on the fourth row. For each sequence position in the superword array, the code of the superword at word level 3 starting at the position is given as a column of three word codes on the bottom three rows.
The numbers and lengths of contigs in top two groups for each of the assemblies produced by PCAP.REP and PCAP
| Group | Program | Number | N50 length (bp) | Maximum length (bp) | Total length (Mb) |
|---|---|---|---|---|---|
| One | PCAP.REP | 110 | 188 832 | 747 209 | 20 |
| One | PCAP | 136 | 162 560 | 461 995 | 20 |
| Two | PCAP.REP | 696 | 47 730 | 91 678 | 19 |
| Two | PCAP | 8981 | 1859 | 64 190 | 19 |
The numbers and lengths of supercontigs in top two groups for each of the assemblies produced by PCAP.REP and PCAP
| Group | Program | Number | N50 length (bp) | Maximum length (bp) | Total length (Mb) |
|---|---|---|---|---|---|
| One | PCAP.REP | 14 | 1 365 435 | 2 314 423 | 20 |
| One | PCAP | 23 | 978 523 | 2 076 521 | 20 |
| Two | PCAP.REP | 101 | 321 449 | 888 557 | 19 |
| Two | PCAP | 7164 | 4396 | 387 574 | 19 |
Three difference rates between the set of finished sequences and the set of contig consensus sequences for each of the assemblies produced by PCAP.REP and PCAP
| Program | Substitution rate | Deletion rate | Insertion rate |
|---|---|---|---|
| PCAP.REP | 0.001376 | 0.000089 | 0.000040 |
| PCAP | 0.002194 | 0.000164 | 0.000136 |
Numbers of local misassembly events per Mb in three categories for each of the assemblies produced by PCAP.REP and PCAP
| Program | Misordering | Interruption | Missing |
|---|---|---|---|
| PCAP.REP | 0.536441 | 0.387430 | 0.298023 |
| PCAP | 0.947530 | 13.265424 | 0.411969 |