| Literature DB >> 19712469 |
Sante Gnerre1, Eric S Lander, Kerstin Lindblad-Toh, David B Jaffe.
Abstract
We describe a new assembly algorithm, where a genome assembly with low sequence coverage, either throughout the genome or locally, due to cloning bias, is considerably improved through an assisting process via a related genome. We show that the information provided by aligning the whole-genome shotgun reads of the target against a reference genome can be used to substantially improve the quality of the resulting assembly.Entities:
Mesh:
Year: 2009 PMID: 19712469 PMCID: PMC2745769 DOI: 10.1186/gb-2009-10-8-r88
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Assisted assembly principle. (a) In this example, five reads align uniquely to the reference genome, and the two leftmost of these (purple) also appear as the two rightmost reads in an existing de novo contig. We can then extend the de novo contig by using the three unassembled reads (green), even if there is no supporting linking evidence (in general, ARACHNE requires a read to be linked to the contig it overlaps before using it to extend the contig). (b) Two scaffolds (blue and purple) are mapped and oriented on the reference genome by the trusted green reads. Furthermore, the two scaffolds are joined by a single link (black dotted line), although this is not trusted per se. The ARACHNE scaffolding algorithm would not normally join the two scaffolds; however, in this case the separation of the two scaffolds implied by the link is consistent with the separation implied by the mapping on the reference genome, and we thus implicitly validate the black dotted link and join the two scaffolds. (c) Trusted read placements anchor portions of a single scaffold onto two distant parts of the reference genome, suggesting either a bona fide syntenic break or a misassembly. To test for the latter, the contested region on the scaffold is subject to a stringent test for misassembly, and broken if it fails. The same level of stringency of misassembly testing could not be applied to the entire assembly because, at low coverage, there would be too many false positives.
Comparison between initial, assisted, and theoretical 2× canine assemblies
| Bases assembled (%) | 81.0 | 86.5 | 94.1 |
| Total contig length (Mb) | 1,697 | 1,823 | 1,969 |
| N50 contig (kb) | 2.5 | 2.8 | 3.3 |
| N50 scaffold gapped (kb) | 18.6 | 53.1 | 4,039.7 |
| N50 scaffold ungapped (kb) | 10.3 | 36.8 | 3,519.1 |
Figure 2Validation test. From the target assembly, we randomly select a pair of high-quality k-mers at distance d from each other. The pair is declared valid if the two k-mers are both present in the reference genome, with the same orientation and a separation d', approximately equal to d. This operation is repeated for many pairs. We report the fraction of such pairs that are valid.
Accuracy of initial and assisted assemblies, estimated using the Assembly proximity test*
| Initial draft | 97.9% | 97.5% | 97.4% | 97.1% | 96.2% | 95.3% | 94.4% |
| Assisted | 98.2% | 98.1% | 98.1% | 98.0% | 98.0% | 97.9% | 97.9% |
*Random paired k-mers were selected from the 2× canine assemblies and then matched against the high quality draft assembly. The table shows the success rate for various values of d (the distance between the pairs).
Assembly statistics for initial drafts and assisted assemblies for a selection of 2× mammal assemblies
| Bases assembled (%) | 76.1 | 85.7 | 77.5 | 84.2 | 80.1 | 85.3 | 75.6 | 82.4 |
| Total contig length (Mb) | 1,672 | 1,905 | 2,089 | 2,314 | 1,925 | 2,080 | 1,658 | 1,853 |
| N50 contig (kb) | 2.6 | 2.9 | 2.7 | 2.7 | 2.7 | 2.9 | 2.5 | 2.6 |
| N50 scaffold gapped (kb) | 13.6 | 71.6 | 11.8 | 37.0 | 13.3 | 53.9 | 11.0 | 44.5 |
| N50 scaffold ungapped (kb) | 9.1 | 37.6 | 8.4 | 15.9 | 9.5 | 20.1 | 7.6 | 12.2 |
*All assemblies were assisted against two references, Homo sapiens and C. familiaris.
Assembly statistics for initial drafts and assisted assemblies for the 8× assembly of P. falciparum HB3, which has severe cloning bias
| Bases assembled (%) | 85.6 | 93.4 |
| Total contig length (Mb) | 19.8 | 23.5 |
| N50 contig (kb) | 13.7 | 15.4 |
| N50 scaffold gapped (kb) | 17.0 | 48.8 |
| N50 scaffold ungapped (kb) | 16.8 | 47.5 |
Statistics of the alignments of reads onto the reference genomes
| 79.1% | 74.3% | ||
| 64.1% | 35.1% | ||
| 51.1% | 22.7% | ||
| 55.3% | 25.2% | ||
| 68.8% | 38.0% | ||
| 47.8% | 18.5% | ||
| 49.3% | 28.8% | ||
| 48.8% | 29.8% | ||
| 59.6% | 43.9% | ||
| 41.6% | 22.4% |
The projects from the Mammal24 set were assisted against both human and canine references.