Literature DB >> 21810901

Mauve assembly metrics.

Aaron E Darling¹, Andrew Tritt, Jonathan A Eisen, Marc T Facciotti.

Abstract

SUMMARY: High-throughput DNA sequencing technologies have spurred the development of numerous novel methods for genome assembly. With few exceptions, these algorithms are heuristic and require one or more parameters to be manually set by the user. One approach to parameter tuning involves assembling data from an organism with an available high-quality reference genome, and measuring assembly accuracy using some metrics. We developed a system to measure assembly quality under several scoring metrics, and to compare assembly quality across a variety of assemblers, sequence data types, and parameter choices. When used in conjunction with training data such as a high-quality reference genome and sequence reads from the same organism, our program can be used to manually identify an optimal sequencing and assembly strategy for de novo sequencing of related organisms. AVAILABILITY: GPL source code and a usage tutorial is at http://ngopt.googlecode.com CONTACT: aarondarling@ucdavis.edu SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 21810901 PMCID： PMC3179657 DOI： 10.1093/bioinformatics/btr451

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Given high-throughput sequencing data, most current genome assemblers apply deterministic heuristics to infer the genome sequence. Usually a variety of parameters can be used to control the heuristic, for which the optimal combination of values may not be obvious. Given a training dataset consisting of high-quality reference genomes and sequence reads generated from those genomes, it may be possible to manually or automatically select a good set of assembly parameters. A key requirement for this task is a means to quantify the accuracy of an assembly. Measuring the accuracy with which an assembly reconstructs the reference genome presents another inference problem. It is usually unknown which part of the inferred assembly corresponds to which part of the reference genome. We must somehow map parts of the assembly back onto the reference genome through sequence alignment, which usually takes one of two forms: local alignment, exemplified by algorithms like BLAST (Altschul ), and whole genome alignment with algorithms like MUMmer (Kurtz ) or Mauve (Darling ). We introduce a new set of assembly accuracy metrics based on the progressiveMauve genome aligner (Darling ). In our method, the assembly contigs and/or scaffolds are first reordered to match a reference genome with the Mauve Contig Mover (Rissman ). The ordered, aligned assembly is then compared to the reference to identify differences. Our method is most closely related to MUMmer's dnadiff program, which can measure assembly errors using genome alignment (Phillippy ). We, however, use a different alignment heuristic and evaluate some new types of error such as rearrangement distance. Several ongoing efforts are directed at measuring assembly accuracy on particular datasets, including the Assemblathon, GAGE and dnGASP. These initiatives use tools like dnadiff, the Mauve Assembly Metrics, and others.

2 METHODS

In the present work we summarize differences in a pairwise alignment of the assembly and reference genome [e.g. as computed by progressiveMauve (Darling )]. We illustrate this process by way of example. Given the following reference genome and assembled genome: Reference: AGGCTAGCGCGCGATTAGGATC Assembly: AGTAGCGGGCCGATTAAGANC A genome alignment of the reference and assembly might look like: Reference: AGGCTAGCGCG-CGATTAGGATC Assembly: AG–TAGCGGGCCGATTAAGANC From this alignment, we would calculate the assembly scoring metrics as follows (not an exhaustive list of metrics): Miscalled bases: 2 (C→G and G→A) Uncalled bases: 1 (N) Extra bases: 1 (Insertion of C in assembly) Missing bases: 2 (Deletion of GC in assembly) Number of extra segments: 1 Number of missing segments: 1 In addition to metrics summarizing the number of base miscalls, missing and extra segments (each also evaluated by dnadiff), our method produces a variety of other metrics. The location of miscalled bases, missing segments and extra segments is exported to a tab-delimited text file for subsequent analysis. GC content of the missing and extra regions is also exported. Misassemblies are identified as rearrangement breakpoints inside of contigs. The double cut and join (DCJ) distance (Bergeron ) between the assembly and reference is calculated to estimate the combined effect of misassembly and lack-of-assembly errors (excess contig breaks) on rearrangement distance. Finally each protein coding sequence in the reference genome is checked in the assembly for whether it yields an intact coding sequence, with types and location of substitution and frameshift errors reported.

2.1 Assembling genomes of Haloarchaea

In an ongoing effort, we are sequencing de novo the genomes of 60 halophilic archaea. Four of these organisms have high-quality reference genomes completed independently of our project. We elected to demonstrate our new assembly metrics on one of these organisms, Haloferax volcanii strain DS2. This organism has a 4.0 Mbp genome organized into five circular replicons with about 100 repetitive IS elements of 1–2 Kbp each (Hartman ). Using 454 and Illumina resequencing data, we generated three different assemblies to compare with our software. The assemblies are named volc454, volcV and volcIDBA (see Supplementary Material for sequencing and assembly details). We scored each assembly against the reference genome using the aforementioned method. An overview of each assembly's metrics is given in Table 1. The location of assembly errors is mapped on the H.volcanii DS2 reference genome in Figure 1. Finally, Figure 2 illustrates that each sequencing and assembly strategy appears to have bias in the direction of erroneous base calls.

Table 1.

Mauve assembly metrics for three assemblies of H.volcanii DS2

Metric	volc454	volcV	volcIDBA
Scaffold count	157	1394	50
Miscalled bases	81	948	235
Uncalled bases	0	53 899	15 188
Extra bases (%)	0.04	10.8	2.54
Missing bases (%)	3.13	5.87	2.71
Extra segments	43	1079	262
Missing segments	117	1144	192
DCJ Distance	114	909	61
Intact CDS (%)	99.3	87.8	97.3

Fig. 1.

(A) Density of extra and missing segments in the assemblies of H.volcanii DS2. Reference genome coordinates are given on the x-axis, and red vertical bars delineate the boundaries of the five circular replicons in the reference genome. (B) Size distribution of missing and extra segments in each assembly. The size of a missing segment is given on the x-axis, and the count of missing segments at that size on the y-axis.

Fig. 2.

Biased errors in the base calling of each assembly. Errors are not uniformly random in any of the three assemblies. See Supplementary Material for more details.

Mauve assembly metrics for three assemblies of H.volcanii DS2 (A) Density of extra and missing segments in the assemblies of H.volcanii DS2. Reference genome coordinates are given on the x-axis, and red vertical bars delineate the boundaries of the five circular replicons in the reference genome. (B) Size distribution of missing and extra segments in each assembly. The size of a missing segment is given on the x-axis, and the count of missing segments at that size on the y-axis. Biased errors in the base calling of each assembly. Errors are not uniformly random in any of the three assemblies. See Supplementary Material for more details.

3 DISCUSSION

The assembly metrics we describe illustrate substantial differences between sequencing and assembly strategies. For example, the volc454 assembly captured nearly all coding genes in the reference genome, but had a high scaffold count relative to the volcIDBA assembly. Striking an ideal balance between assembly error types, rates and sequencing cost is an exercise left for users of our software. When a finished reference genome is available and has been resequenced, the assembly metrics calculated by our system can be used to guide selection of sequencing strategy and tune assembly parameters. The reported metrics may form the basis for a future automated system to perform supervised machine learning of assembly parameters by conducting a parameter sweep over a large number of assembly strategies. Finally, we note that genome alignment algorithms are not perfect and some differences between the assembly and the reference may be due to alignment error and not true assembly errors. Funding: National Science Foundation award (ER 0949453). Conflict of Interest: none declared.

7 in total

1. Mauve: multiple alignment of conserved genomic sequence with rearrangements.

Authors: Aaron C E Darling; Bob Mau; Frederick R Blattner; Nicole T Perna
Journal: Genome Res Date: 2004-07 Impact factor: 9.043

Review 2. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

3. The complete genome sequence of Haloferax volcanii DS2, a model archaeon.

Authors: Amber L Hartman; Cédric Norais; Jonathan H Badger; Stéphane Delmas; Sam Haldenby; Ramana Madupu; Jeffrey Robinson; Hoda Khouri; Qinghu Ren; Todd M Lowe; Julie Maupin-Furlow; Mecky Pohlschroder; Charles Daniels; Friedhelm Pfeiffer; Thorsten Allers; Jonathan A Eisen
Journal: PLoS One Date: 2010-03-19 Impact factor: 3.240

4. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement.

Authors: Aaron E Darling; Bob Mau; Nicole T Perna
Journal: PLoS One Date: 2010-06-25 Impact factor: 3.240

5. Versatile and open software for comparing large genomes.

Authors: Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal: Genome Biol Date: 2004-01-30 Impact factor: 13.583

6. Reordering contigs of draft genomes using the Mauve aligner.

Authors: Anna I Rissman; Bob Mau; Bryan S Biehl; Aaron E Darling; Jeremy D Glasner; Nicole T Perna
Journal: Bioinformatics Date: 2009-06-10 Impact factor: 6.937

7. Genome assembly forensics: finding the elusive mis-assembly.

Authors: Adam M Phillippy; Michael C Schatz; Mihai Pop
Journal: Genome Biol Date: 2008-03-14 Impact factor: 13.583

7 in total

47 in total

1. Performance comparison of benchtop high-throughput sequencing platforms.

Authors: Nicholas J Loman; Raju V Misra; Timothy J Dallman; Chrystala Constantinidou; Saheer E Gharbia; John Wain; Mark J Pallen
Journal: Nat Biotechnol Date: 2012-05 Impact factor: 54.908

2. Implications of Genome-Based Discrimination between Clostridium botulinum Group I and Clostridium sporogenes Strains for Bacterial Taxonomy.

Authors: Michael R Weigand; Angela Pena-Gonzalez; Timothy B Shirey; Robin G Broeker; Maliha K Ishaq; Konstantinos T Konstantinidis; Brian H Raphael
Journal: Appl Environ Microbiol Date: 2015-06-05 Impact factor: 4.792

3. Exploring the diploid wheat ancestral A genome through sequence comparison at the high-molecular-weight glutenin locus region.

Authors: Lingli Dong; Naxin Huo; Yi Wang; Karin Deal; Ming-Cheng Luo; Daowen Wang; Olin D Anderson; Yong Qiang Gu
Journal: Mol Genet Genomics Date: 2012-09-28 Impact factor: 3.291

4. Selection, periodicity and potential function for Highly Iterative Palindrome-1 (HIP1) in cyanobacterial genomes.

Authors: Minli Xu; Jeffrey G Lawrence; Dannie Durand
Journal: Nucleic Acids Res Date: 2018-03-16 Impact factor: 16.971

5. Spontaneously Arising Streptococcus mutans Variants with Reduced Susceptibility to Chlorhexidine Display Genetic Defects and Diminished Fitness.

Authors: Justin R Kaspar; Matthew J Godwin; Irina M Velsko; Vincent P Richards; Robert A Burne
Journal: Antimicrob Agents Chemother Date: 2019-06-24 Impact factor: 5.191

6. Pyrobaculum yellowstonensis Strain WP30 Respires on Elemental Sulfur and/or Arsenate in Circumneutral Sulfidic Geothermal Sediments of Yellowstone National Park.

Authors: Z J Jay; J P Beam; A Dohnalkova; R Lohmayer; B Bodle; B Planer-Friedrich; M Romine; W P Inskeep
Journal: Appl Environ Microbiol Date: 2015-06-19 Impact factor: 4.792

7. Human Infections Caused by Clonally Related African Clade (Clade III) Strains of Candida auris in the Greater Houston Region.

Authors: S Wesley Long; Matthew Ojeda Saavedra; Paul A Christensen; James M Musser; Randall J Olsen
Journal: J Clin Microbiol Date: 2020-06-24 Impact factor: 5.948

8. riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions.

Authors: Nicholas R Waters; Florence Abram; Fiona Brennan; Ashleigh Holmes; Leighton Pritchard
Journal: Nucleic Acids Res Date: 2018-06-20 Impact factor: 16.971

9. Achieving Accurate Sequence and Annotation Data for Caulobacter vibrioides CB13.

Authors: Louis Berrios; Bert Ely
Journal: Curr Microbiol Date: 2018-09-26 Impact factor: 2.188

10. Predominant Acidilobus-like populations from geothermal environments in yellowstone national park exhibit similar metabolic potential in different hypoxic microbial communities.

Authors: Z J Jay; D B Rusch; S G Tringe; C Bailey; R M Jennings; W P Inskeep
Journal: Appl Environ Microbiol Date: 2013-10-25 Impact factor: 4.792