Literature DB >> 19515959

Reordering contigs of draft genomes using the Mauve aligner.

Anna I Rissman¹, Bob Mau, Bryan S Biehl, Aaron E Darling, Jeremy D Glasner, Nicole T Perna.

Abstract

SUMMARY: Mauve Contig Mover provides a new method for proposing the relative order of contigs that make up a draft genome based on comparison to a complete or draft reference genome. A novel application of the Mauve aligner and viewer provides an automated reordering algorithm coupled with a powerful drill-down display allowing detailed exploration of results. AVAILABILITY: The software is available for download at http://gel.ahabs.wisc.edu/mauve.

Entities: Chemical Species

Mesh：

Year: 2009 PMID： 19515959 PMCID： PMC2723005 DOI： 10.1093/bioinformatics/btp356

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

New high-throughput technologies have greatly reduced the cost of genome sequencing, leading to an abundance of draft-quality genome sequences that may be composed of hundreds or thousands of contigs. Ordering and orienting these contigs into larger units (scaffolds or supercontigs) facilitates genome closure and comparative analyses. Contigs can be ordered based on additional data, such as the presence of discontinuous portions of the same sequencing template (clone or fragment) in two contigs, but this type of information is not available for all projects. However, even without additional data, contig order can be predicted by comparison with a reference genome that is expected to have conserved genome organization. We present a new method for comparative contig ordering based on iterative genome alignment using Mauve. The reference used may be draft quality itself, or may have divergent genetic content. The Mauve aligner has been used extensively for microbial genome comparisons because it effectively identifies and aligns homologous regions even if genomes have undergone rearrangements, large insertions or deletions, and substantial sequence divergence. Mauve Contig Mover (MCM) provides advantages over methods that rely on matches in limited regions near the ends of contigs, require anchors at both ends of contigs, force users to exclude lineage-specific sequences at contig boundaries, or are unable to resolve which, if any, copies of repeated sequences are consistent with more extensive collinearity (Darling et al., 2004; Richter et al., 2007; van Hijum et al., 2005). An interactive full-genome alignment display shows the relative order of the contigs as well as potential gaps in sequence coverage and regions of possible rearrangement or misassembly. After reordering, Mauve is a useful platform for further detailed comparative sequence analysis that is often the motivation for the sequencing effort itself.

2 METHODS

The Mauve aligner filters and sorts internally identified matches into locally collinear blocks (LCBs). Each LCB represents a region of homologous sequence without rearrangement among the input genomes. Each LCB must be separated from the next by rearrangement in at least one genome (Darling et al., 2004). Contig boundaries (edges) represent potentially artificial LCB edges. Therefore, finding the contig order that minimizes the number of LCBs caused by contig edges is equivalent to finding a likely contig order. Using the Mauve alignment LCBs, the reordering process occurs in three steps: placing contigs with no apparent conflict in ordering information, placing contigs with conflicting information into intermediary anchor positions, and finally matching LCB ends that extend to contig boundaries. Each step occurs in at most O(n2) time, where n is the number of LCBs, plus the time required for alignment (Darling et al., 2004). Mauve assumes contigs are in the correct order when filtering matches, so as the order is optimized, alignment results change. Therefore, results are refined through iterative alignment until no further ordering is possible. MCM outputs a series of Mauve alignments, each representing an iteration of the reordering. In addition to the standard Mauve output, the reorder process produces a FastA file containing the new order and orientation, as well as a list of ordered contigs including name and coordinate location. The standard Mauve visualization can be applied in novel ways to analyze contig order. For example, we have used it to identify potential misassemblies in contigs, and to evaluate the presence or absence of genes split by contig boundaries or by rearrangements. If FastAs representing the order produced by other programs are created, Mauve can also be used to compare results, as in Supplementary Figure 1. Furthermore, annotations from GenBank format input can be viewed, even once reordered.

3 RESULTS

We have used MCM to order contigs for a variety of different bacterial genome projects based on comparison to the single best reference sequence available, and show some of our results in Table 1. These projects include draft genomes assembled from Sanger sequencing as well as short read generating technologies developed by 454 and Illumina, with the most fragmented example involving a 5 Mb genome with more than 1000 contigs. The draft and reference genome combinations selected include comparisons of genomes from different species, the same species, different strains and different assemblies of the same genome. Many of the draft genomes available through the Enteropathogen Resource Integration Center (Glasner et al., 2008a) and the ASAP database (Glasner et al., 2006) have been ordered using MCM. Examples are available as Supplementary Material on our web site. Supplementary Figure 1 shows a Yersinia pestis strain FV1 draft genome (Touchman et al., 2007) reordered and aligned to the complete Y. pestis CO92 reference genome. MCM was able to order 356 out of 400 contigs (4211103 out of 4472646 bp) reducing the alignment from 359 to 11 LCBs. Supplementary Figure 1 also shows the utility of the Mauve Viewer for comparing different suggested contig orders.

Table 1.

Summary of results of Mauve Contig Mover reorders

Draft genome	Reference genome	Number in draft		Number of contigs/% bp ordered
		Contigs	bp	Mauve		Projector		OSLay
P. brasiliensis Pbr1692 (Glasner et al., 2008b)	P. atroseptica SCRI1043 (Toth et al., 2004)	1370	4918574	121	95.9	89	90.7	112	93.5
P. caratovorum WPP14 (Glasner et al., 2008b)	P. atroseptica SCRI1043 (Toth et al., 2004)	741	4823187	222	96.4	176	90.7	198	93.8
Yersinia pestis FV-1 (Touchman et al., 2007)	Yersinia pestis CO92 (Parkhill et al., 2001)	400	4472646	355	94.1	353	93.4	Did not finish
Escherichia coli EC4501 (Glasner et al., 2008a)	Escherichia coli Sakai (Hayashi et al., 2001)	250	5677181	140	93.4	107	93.3	177	90.9
E. coli MG1655 mutant* (Glasner et al., 2008a)	Escherichia coli MG1655 (Blattner et al., 1997)	1663	4554569	1068	98.8	725	96.4	784	94.6
Erwinia chrysanthemi 3937 v3	E. chrysanthemi 3937 v6b (Glasner et al., 2008a)	767	5119283	228	95.4	212	95.4	267	90.7

Data includes the draft sequence and reference sequence used to perform the order, the number of contigs and base pairs contained in the draft, and the number of contigs and percentage of base pairs ordered by Mauve, Projector (van Hijum et al., 2005), and OSLay (Richter et al., 2007). Pectobacterium is abbreviated P. in table. All drafts were sequenced using 454 technology, except (*), which used Solexa. While we included numbers from OSLay reorders, the structure suggested by the OSLay reorder differs significantly from that of Mauve and Projector, as can be seen in Supplementary Figure 1. Table 2 and Supplementary Table 1 also summarize correctly ordered quantities based on artificially cut genomes.

Summary of results of Mauve Contig Mover reorders Data includes the draft sequence and reference sequence used to perform the order, the number of contigs and base pairs contained in the draft, and the number of contigs and percentage of base pairs ordered by Mauve, Projector (van Hijum et al., 2005), and OSLay (Richter et al., 2007). Pectobacterium is abbreviated P. in table. All drafts were sequenced using 454 technology, except (*), which used Solexa. While we included numbers from OSLay reorders, the structure suggested by the OSLay reorder differs significantly from that of Mauve and Projector, as can be seen in Supplementary Figure 1. Table 2 and Supplementary Table 1 also summarize correctly ordered quantities based on artificially cut genomes.

Table 2.

Overview of percents ordered and correctly ordered

Artificial draft	Reference	bp	Percentage ordered of total bp			Percentage correct of total bp
			Mauve/Projector/OSLay
P. atroseptica SCRI1043 (Toth et al., 2004)	P. atroseptica SCRI1043 (Toth et al., 2004)	5064019	99.4	98.8	97.7		99.4	98.7	95.1
Escherichia coli EDL933 (Perna et al., 2001)	Escherichia coli MG1655 (Blattner et al., 1997)	5528133	95.0	91.7	85.9		94.1	82.6	78.4
Yersinia pestis KIM (Deng et al., 2002)	Yersinia pestis CO92 (Parkhill et al., 2001)	4781603	96.5	96.8	92.7		90.4	61.7	66.8
Overall average			96.8	95.1	91.9	94.2	80.7	78.9

Each row is an average of the orders of the same sequences listed below in Supplementary Table 2. The draft, in each case, was artificially cut into pieces using in-house software. The pieces were ordered using Mauve, Projector and OSLay, and the results compared to the correct order. A piece (contig) was considered out of order if it was out of position relative to the closest correctly ordered contig on either side. The table shows the total number of base pairs, the percentage ordered using each algorithm, and the percent of the total base pairs that were correctly ordered. Draft sequences are prone to errors and omissions that have not been modeled in the artificially partitioned ‘drafts’ used. Therefore, these figures are meant to bound the number of ordered base pairs. Each row represents different genomes with different divergence, giving an idea of these percentages over a range of data.

Overview of percents ordered and correctly ordered Each row is an average of the orders of the same sequences listed below in Supplementary Table 2. The draft, in each case, was artificially cut into pieces using in-house software. The pieces were ordered using Mauve, Projector and OSLay, and the results compared to the correct order. A piece (contig) was considered out of order if it was out of position relative to the closest correctly ordered contig on either side. The table shows the total number of base pairs, the percentage ordered using each algorithm, and the percent of the total base pairs that were correctly ordered. Draft sequences are prone to errors and omissions that have not been modeled in the artificially partitioned ‘drafts’ used. Therefore, these figures are meant to bound the number of ordered base pairs. Each row represents different genomes with different divergence, giving an idea of these percentages over a range of data. We urge caution in interpretation of contig order predicted using MCM or any other algorithm. Many true bacterial genome rearrangements occur at repetitive sequences, which pose challenges for both genome assembly and alignment. Ordering contigs based on comparative analyses can mask true rearrangements anchored in repeats at these contig breaks. Annotations of complete and partial repeats can be viewed in the Mauve alignment display, providing a means of identifying such regions. Conversely, misassemblies can appear as false rearrangements. Mauve clearly displays these regions, and PCR primers may be designed that allows verification of the rearrangement or proof of the misassembly. The Mauve Viewer also allows exploration of alternate positions for contigs with multiple LCBs. A comparison of the number of LCBs between reference and draft to the number between other genomes expected to show similar levels of rearrangement can provide an estimate of this effect. Table 2 summarizes reorders without modeling these effects, showing accuracy between 90.4% and 99.4%. Generally, closer alignments will provide more accurate reorders covering more of the draft. Because MCM maximizes collinearity among genome sequences under comparison it produces alignments that are easily visualized and provides an excellent platform for analysis and finishing of draft genomes.

13 in total

1. Mauve: multiple alignment of conserved genomic sequence with rearrangements.

Authors: Aaron C E Darling; Bob Mau; Frederick R Blattner; Nicole T Perna
Journal: Genome Res Date: 2004-07 Impact factor: 9.043

2. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7.

Authors: N T Perna; G Plunkett; V Burland; B Mau; J D Glasner; D J Rose; G F Mayhew; P S Evans; J Gregor; H A Kirkpatrick; G Pósfai; J Hackett; S Klink; A Boutin; Y Shao; L Miller; E J Grotbeck; N W Davis; A Lim; E T Dimalanta; K D Potamousis; J Apodaca; T S Anantharaman; J Lin; G Yen; D C Schwartz; R A Welch; F R Blattner
Journal: Nature Date: 2001-01-25 Impact factor: 49.962

3. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12.

Authors: T Hayashi; K Makino; M Ohnishi; K Kurokawa; K Ishii; K Yokoyama; C G Han; E Ohtsubo; K Nakayama; T Murata; M Tanaka; T Tobe; T Iida; H Takami; T Honda; C Sasakawa; N Ogasawara; T Yasunaga; S Kuhara; T Shiba; M Hattori; H Shinagawa
Journal: DNA Res Date: 2001-02-28 Impact factor: 4.458

4. Genome sequence of Yersinia pestis, the causative agent of plague.

Authors: J Parkhill; B W Wren; N R Thomson; R W Titball; M T Holden; M B Prentice; M Sebaihia; K D James; C Churcher; K L Mungall; S Baker; D Basham; S D Bentley; K Brooks; A M Cerdeño-Tárraga; T Chillingworth; A Cronin; R M Davies; P Davis; G Dougan; T Feltwell; N Hamlin; S Holroyd; K Jagels; A V Karlyshev; S Leather; S Moule; P C Oyston; M Quail; K Rutherford; M Simmonds; J Skelton; K Stevens; S Whitehead; B G Barrell
Journal: Nature Date: 2001-10-04 Impact factor: 49.962

5. Niche-specificity and the variable fraction of the Pectobacterium pan-genome.

Authors: J D Glasner; M Marquez-Villavicencio; H-S Kim; C E Jahn; B Ma; B S Biehl; A I Rissman; B Mole; X Yi; C-H Yang; J L Dangl; S R Grant; N T Perna; A O Charkowski
Journal: Mol Plant Microbe Interact Date: 2008-12 Impact factor: 4.171

6. Genome sequence of Yersinia pestis KIM.

Authors: Wen Deng; Valerie Burland; Guy Plunkett; Adam Boutin; George F Mayhew; Paul Liss; Nicole T Perna; Debra J Rose; Bob Mau; Shiguo Zhou; David C Schwartz; Jaqueline D Fetherston; Luther E Lindler; Robert R Brubaker; Gregory V Plano; Susan C Straley; Kathleen A McDonough; Matthew L Nilles; Jyl S Matson; Frederick R Blattner; Robert D Perry
Journal: J Bacteriol Date: 2002-08 Impact factor: 3.490

7. Genome sequence of the enterobacterial phytopathogen Erwinia carotovora subsp. atroseptica and characterization of virulence factors.

Authors: K S Bell; M Sebaihia; L Pritchard; M T G Holden; L J Hyman; M C Holeva; N R Thomson; S D Bentley; L J C Churcher; K Mungall; R Atkin; N Bason; K Brooks; T Chillingworth; K Clark; J Doggett; A Fraser; Z Hance; H Hauser; K Jagels; S Moule; H Norbertczak; D Ormond; C Price; M A Quail; M Sanders; D Walker; S Whitehead; G P C Salmond; P R J Birch; J Parkhill; I K Toth
Journal: Proc Natl Acad Sci U S A Date: 2004-07-19 Impact factor: 11.205

8. ASAP: a resource for annotating, curating, comparing, and disseminating genomic data.

Authors: Jeremy D Glasner; Michael Rusch; Paul Liss; Guy Plunkett; Eric L Cabot; Aaron Darling; Bradley D Anderson; Paul Infield-Harm; Michael C Gilson; Nicole T Perna
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. A North American Yersinia pestis draft genome sequence: SNPs and phylogenetic analysis.

Authors: Jeffrey W Touchman; David M Wagner; Jicheng Hao; Stephen D Mastrian; Maulik K Shah; Amy J Vogler; Christopher J Allender; Erin A Clark; Debbie S Benitez; David J Youngkin; Jessica M Girard; Raymond K Auerbach; Stephen M Beckstrom-Sternberg; Paul Keim
Journal: PLoS One Date: 2007-02-21 Impact factor: 3.240

10. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies.

Authors: Sacha A F T van Hijum; Aldert L Zomer; Oscar P Kuipers; Jan Kok
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

297 in total

1. Genome sequence of the persistent Salmonella enterica subsp. enterica serotype Senftenberg strain SS209.

Authors: Olivier Grépinet; Zineb Boumart; Isabelle Virlogeux-Payant; Valentin Loux; Hélène Chiapello; Annie Gendrault; Jean-François Gibrat; Marianne Chemaly; Philippe Velge
Journal: J Bacteriol Date: 2012-05 Impact factor: 3.490

2. Whole-genome sequences of Bacillus subtilis and close relatives.

Authors: Ashlee M Earl; Mark Eppinger; W Florian Fricke; M J Rosovitz; David A Rasko; Sean Daugherty; Richard Losick; Roberto Kolter; Jacques Ravel
Journal: J Bacteriol Date: 2012-05 Impact factor: 3.490

3. Genome sequence of the invasive Salmonella enterica subsp. enterica serotype enteritidis strain LA5.

Authors: Olivier Grépinet; Aurore Rossignol; Valentin Loux; Hélène Chiapello; Annie Gendrault; Jean-François Gibrat; Philippe Velge; Isabelle Virlogeux-Payant
Journal: J Bacteriol Date: 2012-05 Impact factor: 3.490

4. Carriage of an ACME II variant may have contributed to methicillin-resistant Staphylococcus aureus sequence type 239-like strain replacement in Liverpool Hospital, Sydney, Australia.

Authors: B A Espedido; J A Steen; T Barbagiannakos; J Mercer; D L Paterson; S M Grimmond; M A Cooper; I B Gosbell; S J van Hal; S O Jensen
Journal: Antimicrob Agents Chemother Date: 2012-03-05 Impact factor: 5.191

5. Abundant toxin-related genes in the genomes of beneficial symbionts from deep-sea hydrothermal vent mussels.

Authors: Lizbeth Sayavedra; Manuel Kleiner; Ruby Ponnudurai; Silke Wetzel; Eric Pelletier; Valerie Barbe; Nori Satoh; Eiichi Shoguchi; Dennis Fink; Corinna Breusing; Thorsten Bh Reusch; Philip Rosenstiel; Markus B Schilhabel; Dörte Becher; Thomas Schweder; Stephanie Markert; Nicole Dubilier; Jillian M Petersen
Journal: Elife Date: 2015-09-15 Impact factor: 8.140

6. Whole-Genome Sequence of a blaOXA-48-Harboring Raoultella ornithinolytica Clinical Isolate from Lebanon.

Authors: Charbel Al-Bayssari; Abiola Olumuyiwa Olaitan; Thongpan Leangapichart; Liliane Okdah; Fouad Dabboussi; Monzer Hamze; Jean-Marc Rolain
Journal: Antimicrob Agents Chemother Date: 2016-03-25 Impact factor: 5.191

7. CSAR-web: a web server of contig scaffolding using algebraic rearrangements.

Authors: Kun-Tze Chen; Chin Lung Lu
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

8. tuf Gene Sequence Variation in Bifidobacterium longum subsp. infantis Detected in the Fecal Microbiota of Chinese Infants.

Authors: Blair Lawley; Manuela Centanni; Jun Watanabe; Ian Sims; Susan Carnachan; Roland Broadbent; Pheng Soon Lee; Khai Hong Wong; Gerald W Tannock
Journal: Appl Environ Microbiol Date: 2018-06-18 Impact factor: 4.792

9. Implications of Genome-Based Discrimination between Clostridium botulinum Group I and Clostridium sporogenes Strains for Bacterial Taxonomy.

Authors: Michael R Weigand; Angela Pena-Gonzalez; Timothy B Shirey; Robin G Broeker; Maliha K Ishaq; Konstantinos T Konstantinidis; Brian H Raphael
Journal: Appl Environ Microbiol Date: 2015-06-05 Impact factor: 4.792

10. Multifaceted mechanisms of colistin resistance revealed by genomic analysis of multidrug-resistant Klebsiella pneumoniae isolates from individual patients before and after colistin treatment.

Authors: Yan Zhu; Irene Galani; Ilias Karaiskos; Jing Lu; Su Mon Aye; Jiayuan Huang; Heidi H Yu; Tony Velkov; Helen Giamarellou; Jian Li
Journal: J Infect Date: 2019-07-30 Impact factor: 6.072