| Literature DB >> 18350171 |
Michael Roberts1, Aleksey V Zimin, Wayne Hayes, Brian R Hunt, Cevat Ustun, James R White, Paul Havlak, James Yorke.
Abstract
The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of "reliable" overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our "reliable-overlap" algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps.Entities:
Mesh:
Year: 2008 PMID: 18350171 PMCID: PMC2266800 DOI: 10.1371/journal.pone.0001836
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Illustration of the technique that identifies reliable overlaps: (a) a scenario where a genome contains two copies of a repeat region R.
The correct positions of reads A, B, C and D are shown. (b) A “fork” in the overlaps. (c) a scenario where reads A and D have the same sequencing error at the same base.
Comparison of the three assemblies for the subset of the 21 BACs from the Rat genome.
| Assembly | % Non-Matching Contig Tails | % of Finished Sequence Matched | % Interior Error Rate | Number Of Conigs |
| original Atlas | 0.331 | 93.4 | 0.045 | 377 |
| original Atlas with UMD Plausible | 0.448 | 96.1 | 0.041 | 375 |
| original Atlas with UMD Reliable | 0.118 | 96.3 | 0.012 | 480 |
| two-pass Atlas with UMD Reliable | 0.075 | 96.3 | 0.011 | 371 |
The “original Atlas with UMD Plausible” and “original Atlas with UMD reliable” assembly results obtained by substituting Phrap for PhrapUMD with UMD plausible and reliable overlaps respectively. The best assembly (the bottom line) uses PhrapUMD and UMD reliable overlaps utilizing the 2-pass approach described in the “Methods” section. It has almost 3% more sequence matching finished sequence than original Atlas with Phrap at less than 1/4 the original base error rate.
Figure 2Two alignments of assemblies to the finished sequence of BAC GQQD.
The original Atlas assembly created two scaffolds only covering 73.2% of the finished sequence. Note the misplaced 20 Kb segment in the Atlas assembly. The UMD+Atlas assembly of GQQD correctly places the 20 Kb section originally misplaced and creates a single scaffold of the BAC covering 93.3% of the finished sequence. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave Atlas the most trouble.
Figure 3Two alignments of assemblies to the finished sequence of BAC GMEZ.
The original Atlas assembly created a single scaffold. The UMD+Atlas assembly of GMEZ assembled a 26 Kb section from the middle of the bigger scaffold into a separate Scaffold 1. Note that the large scaffold gap in the Scaffold 2 is estimated correctly. This UMD+Atlas assembly used reliable overlaps. This was the BAC that gave UMD+Atlas the most trouble and the only case where UMD+Atas assembly had two scaffolds.