| Literature DB >> 20831800 |
Sergey Koren1, Jason R Miller, Brian P Walenz, Granger Sutton.
Abstract
BACKGROUND: Finishing is the process of improving the quality and utility of draft genome sequences generated by shotgun sequencing and computational assembly. Finishing can involve targeted sequencing. Finishing reads may be incorporated by manual or automated means. One automated method uses targeted addition by local re-assembly of gap regions. An obvious alternative uses de novo assembly of all the reads.Entities:
Mesh:
Year: 2010 PMID: 20831800 PMCID: PMC2945939 DOI: 10.1186/1471-2105-11-457
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Use of finishing reads. Two algorithms for assembling shotgun reads and finishing reads. The control treats both read types equally. The bounded algorithm attempts to assemble finishing reads consistently with their bounding constraints. For each algorithm, the figure shows its construction of a scaffold from contigs (rectangles) with 2X in shotgun reads (black lines). Each finishing read (colored line) has a corresponding pair of PCR primer sites (arrows of same color). External to the scaffold is a unitig (grey area) deemed repetitive due to high coverage. (a) A mate pair constraint (curve) localizes one read and the unitig to this gap. Nevertheless, the control algorithm cannot tile this gap with reads. The bounded algorithm localizes two finishing reads by their primer sites. The bounded algorithm does tile the gap with reads, enabling a more accurate consensus sequence. (b) The control cannot localize the unitig or any reads to this gap. It does not close the gap. The bounded algorithm localizes the unitig by finishing reads and their primer sites. It tiles the gap with finishing reads from the unitig. (c) Both algorithms assemble finishing reads from a gap that is not a genomic repeat. In our data sets, most finishing reads fit gaps of this type.
Results using the bounded read placement algorithm.
| Placed finishing reads | Gaps closed | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Species | Candidate gaps | # Bounds | # Finishing reads | Control | Bounded | Alternate | Control | Bounded | Alternate |
| 14 (1) | 56 | 128 | 26 | 92 | N/A | 0 | 11 | N/A | |
| 2 (0) | 18 | 33 | 14 | 23 | N/A | 0 | 1 | N/A | |
| 9 (0) | 23 | 40 | 4 | 27 | N/A | 0 | 4 | N/A | |
| 11 (2) | 14 | 21 | 3 | 21 | 17 | 0 | 6 | 4 | |
| 49 (2) | 23 | 60 | 12 | 49 | 11 | 0 | 29 | 0 | |
| 4 (0) | 3 | 3 | 0 | 2 | 0 | 0 | 1 | 0 | |
| Total | 89 (5) | 137 | 285 | 59 | 214 | 28 | 0 | 52 | 4 |
Comparison of three algorithms. Control uses finishing reads like WGS reads. Bounded uses finishing reads with placement constraints. Alternate uses finishing reads in a second round of assembly without constraints. Candidate gaps include both regions in the control assembly between finishing constraints with zero coverage and a consensus sequence derived from a repeat unitig or no consensus sequence in the control assembly. The parentheses indicate the number of gaps with no consensus sequence in the control assembly. The gap and spanning constraint are not necessarily 1-to-1. Bounds: The total number of bounding constraints that span the repeat gap or were not satisfied in both control and bounded assemblies. Finishing reads: The total number of finishing reads generated for the bounds in the table. Placed finishing reads: The total number of finishing reads placed in the assembly by each of the assembly algorithms. Gaps closed: The number of gaps closed by filling in missing consensus sequence or by tiling repeat instances with reads. By definition, the control assembly always closes 0 gaps. The bounded assembly joins were verified by alignment to finished reference, where available.
Consensus quality of the bounded read placement algorithm.
| Species | # consensus differences | True positives | False positives |
|---|---|---|---|
| 14 | 14 | 0 | |
| 5 | 5 | 0 | |
| 47 | 44 | 3 | |
| N/A | N/A | N/A | |
| N/A | N/A | N/A | |
| N/A | N/A | N/A | |
| Total | 66 | 63 | 3 |
Performance of the same three algorithms described in Table 1. Number of consensus differences: The total number of bases in consensus that are different between the bounded and control assemblies versus reference. True positive: Number of consensus base changes that are supported by the reference. False positive: The number of consensus base changes that differ from the reference. The finishing reads used for E. coli K12 did not come from the same strain as the reference. We cannot validate whether a consensus discrepancy between an assembly and the reference is due to assembly error or to strain-level differences. Consensus quality could not be measured on the two genomes that lack a reference.
Results using contig metrics for bounding read placement algorithm.
| 6 | 5 | NA | 2,315,032 | 4,484,293 | NA | 5,656,811 | 5,661,119 | NA | |
| 6 | 6 | NA | 3,620,140 | 3,620,144 | NA | 4,813,438 | 4,813,442 | NA | |
| 19 | 19 | NA | 424,003 | 424,003 | NA | 5,835,215 | 5,834,616 | NA | |
| 4,273 | 4,273 | 5,765 | 12,070 | 12,070 | 11,444 | 37,616,884 | 37,616,884 | 47,976,992 | |
| 313 | 314 | 387 | 27,255 | 27,255 | 16,838 | 4,679,711 | 4,679,711 | 4,441,778 | |
| 26 | 26 | 38 | 307,040 | 307,040 | 152,524 | 2,525,388 | 2,525,392 | 2,507,351 | |
Performance of the same three algorithms described in Table 1. Contig count: number of contigs whose consensus is at least 2Kbp. Contig bases: sum of consensus lengths for contigs at least 2Kbp long.