| Literature DB >> 21810901 |
Aaron E Darling1, Andrew Tritt, Jonathan A Eisen, Marc T Facciotti.
Abstract
SUMMARY: High-throughput DNA sequencing technologies have spurred the development of numerous novel methods for genome assembly. With few exceptions, these algorithms are heuristic and require one or more parameters to be manually set by the user. One approach to parameter tuning involves assembling data from an organism with an available high-quality reference genome, and measuring assembly accuracy using some metrics. We developed a system to measure assembly quality under several scoring metrics, and to compare assembly quality across a variety of assemblers, sequence data types, and parameter choices. When used in conjunction with training data such as a high-quality reference genome and sequence reads from the same organism, our program can be used to manually identify an optimal sequencing and assembly strategy for de novo sequencing of related organisms. AVAILABILITY: GPL source code and a usage tutorial is at http://ngopt.googlecode.com CONTACT: aarondarling@ucdavis.edu SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.Entities:
Mesh:
Year: 2011 PMID: 21810901 PMCID: PMC3179657 DOI: 10.1093/bioinformatics/btr451
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Mauve assembly metrics for three assemblies of H.volcanii DS2
| Metric | volc454 | volcV | volcIDBA |
|---|---|---|---|
| Scaffold count | 157 | 1394 | 50 |
| Miscalled bases | 81 | 948 | 235 |
| Uncalled bases | 0 | 53 899 | 15 188 |
| Extra bases (%) | 0.04 | 10.8 | 2.54 |
| Missing bases (%) | 3.13 | 5.87 | 2.71 |
| Extra segments | 43 | 1079 | 262 |
| Missing segments | 117 | 1144 | 192 |
| DCJ Distance | 114 | 909 | 61 |
| Intact CDS (%) | 99.3 | 87.8 | 97.3 |
Fig. 1.(A) Density of extra and missing segments in the assemblies of H.volcanii DS2. Reference genome coordinates are given on the x-axis, and red vertical bars delineate the boundaries of the five circular replicons in the reference genome. (B) Size distribution of missing and extra segments in each assembly. The size of a missing segment is given on the x-axis, and the count of missing segments at that size on the y-axis.
Fig. 2.Biased errors in the base calling of each assembly. Errors are not uniformly random in any of the three assemblies. See Supplementary Material for more details.