Literature DB >> 20015948

r2cat: synteny plots and comparative assembly.

Peter Husemann1, Jens Stoye.   

Abstract

SUMMARY: Recent parallel pyrosequencing methods and the increasing number of finished genomes encourage the sequencing and investigation of closely related strains. Although the sequencing itself becomes easier and cheaper with each machine generation, the finishing of the genomes remains difficult. Instead of the desired whole genomic sequence, a set of contigs is the result of the assembly. In this applications note, we present the tool r2cat (related reference contig arrangement tool) that helps in the task of comparative assembly and also provides an interactive visualization for synteny inspection.

Entities:  

Mesh:

Year:  2009        PMID: 20015948      PMCID: PMC2820676          DOI: 10.1093/bioinformatics/btp690

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

With the advent of high-throughput sequencing machines, it has become easier and cheaper to sequence a genome. A decade ago, a sequencing project lasted for years and required a million-dollar budget, whereas today the sequencing itself takes days and costs only a few thousand dollars. Nevertheless, the effort to close a genome completely is still non-negligible, and thus one very important step in genome finishing remains the closure of gaps between contigs. This task becomes easier if the order and the relative orientation of the contigs is known. Mapping the contigs on a closely related genome provides this kind of information. Consequently, a program that orders contigs regarding their matches and visualizes the synteny of contigs and a reference genome can be helpful to close the gaps. A number of tools have been developed to aid in this task such as Projector2 (van Hijum et al., 2005), a web service that maps contigs on a template genome and visualizes the result, OSLay (Richter et al., 2007) which computes an optimal syntenic layout for a set of contigs, or ABACAS (Assefa et al., 2009) that orders contigs using several external programs for matching, primer design and visualization. Our program r2cat (related reference contig arrangement tool) is able to quickly match a set of contigs onto a related genome, order the contigs according to their matches and display the result in an interactive synteny plot. The matching, however, is not restricted to contigs, such that the program can also be used to visualize the synteny of two finished genomes. The software is open source and available within the Comparative Genomics – Contig Arrangement Toolsuite (cg-cat; http://bibiserv.techfak.uni-bielefeld.de/cg-cat) on the Bielefeld Bioinformatics Server (BiBiServ).

2 METHODS

In a first step, similar regions between the contigs and a related reference genome have to be determined. For this task, a q-gram filter (Rasmussen et al., 2006) is used. Regions of up to 8% difference are found that have at least 44 exact matches of possibly overlapping 11mers, which are each not further apart than 64 bases. All these matching regions are displayed in an interactive synteny plot, as shown in Figure 1. The contigs can then in a second step be ordered and oriented automatically according to their matches. To this end, a sliding window approach determines that position of a contig on the reference sequence, where it gains the most matches. A manual correction, however, is easily possible.
Fig. 1.

Synteny plots produced by r2cat. The contigs of C.urealyticum (NCBI number: NC_010545) are mapped onto the reference sequence of C.jeikeium (NC_007164).

Synteny plots produced by r2cat. The contigs of C.urealyticum (NCBI number: NC_010545) are mapped onto the reference sequence of C.jeikeium (NC_007164). The resulting order can then be helpful for gap closing purposes in the finishing phase of a sequencing project, assuming that the corresponding genomes have a high degree of synteny.

3 IMPLEMENTATION

The tool r2cat that implements the matching, ordering and visualization is written in Java and can be started from the Internet without installation using the Java WebStart Framework. The sources are licensed under GPL and available from the author. Matching and ordering: the fast built-in matching runs well for prokaryotic genomes up to 12 MB. The matching routine is capable of handling multichromosomal genomes, provided in multi-FASTA files, and also finds matches for the reverse complement of each contig. After the matching, the contigs can be arranged automatically. The matches, as well as the inferred order and orientation, can be stored in and retrieved from human readable text files. These can be parsed from other programs as well or modified by hand if necessary. Visualization: the implemented visualization displays all matches in a dotplot thus providing a quick overview of the synteny. A horizontal bar at the bottom helps to assess the coverage of the matches: maximum coverage is displayed in black and fades to light grey with less coverage. Uncovered regions are marked explicitly. The implementation features an export of the synteny plot to either bitmap or vector-based graphics formats. Some of the latter are editable and are thus excellently suited for high-quality synteny plots to be used in publications and other print media. The view area itself is zoomable and panable. Contigs as well as single matches can be selected and displayed in separate table views. The contig table allows to reorder the contigs manually, if necessary, using drag and drop. The contigs can consequently be saved in the displayed order in FASTA format for further processing. While the main focus of this tool is to order a set of contigs, the synteny visualization can also be used to investigate the relationship between two species if, instead of the contigs, the genomic sequence of a related genome is chosen for matching.

4 RESULTS

To show that the matching implemented in r2cat is competitive, we compared it with the three well-known matching programs BLAST, BLAT and MUMer. Each program was used on two prokaryotic datasets to match a set of contigs onto a reference genome. The first dataset “S.suis”, taken from Assefa et al. (2009), consists of 281 contigs (2.1 Mb) of a Streptococcus suis strain that were matched on the genome of another strain SC84 (2.1Mb, NCBI number: NC_012924). The second dataset “S.meliloti” consists of 446 contigs in 7.2 Mb of a Sinorhizobium meliloti strain that were matched on a reference genome with three replicons: one chromosome (3.65 Mb, NC_003047) and two megaplasmids (1.68 Mb, NC_003078; 1.35 Mb, NC_003037). Table 1 shows for each program and dataset the time that was needed for matching and additionally the number of contigs that could not be matched and thus could not be ordered.
Table 1.

Times for matching a set of contigs on a reference genome

S.suis
S.meliloti
Time (s)UnmatchedTime (s)Unmatched
blast20.00162.10
blat46.994700.884
nucmer9.810945.692
r2cat6.210245.475

Additionally, the number of contigs is given that could not be matched. The employed programs are BLAST (Altschul et al., 1990, blastall v. 2.2.19), BLAT (Kent, 2002, blat v. 15), MUMmer (Kurtz et al., 2004, nucmer v. 3.06), and our matching routine implemented within r2cat. The experiments were performed on a sparcv9 processor operating at 1593 MHz.

Times for matching a set of contigs on a reference genome Additionally, the number of contigs is given that could not be matched. The employed programs are BLAST (Altschul et al., 1990, blastall v. 2.2.19), BLAT (Kent, 2002, blat v. 15), MUMmer (Kurtz et al., 2004, nucmer v. 3.06), and our matching routine implemented within r2cat. The experiments were performed on a sparcv9 processor operating at 1593 MHz.

5 CONCLUSION

Our software r2cat is suited for a quick synteny visualization as well as contig ordering using a single reference genome. The speed of our matching is competitive to other established programs, and the automated contig arrangement is helpful in the finishing phase of a sequencing project by giving valuable hints on the order and orientation of the contigs. The vector graphics export of the visualization provides a handy way to generate publication quality graphics. Matching, ordering and the visualization are combined in a single application that can easily be used with Java WebStart. The program was already applied in the sequencing project of Rhizobium lupini (now Agrobacterium sp. H13.3). A next step could be to extend the comparative assembly to employ several references and their phylogenetic relationships, as explored e.g. in Husemann and Stoye (2010).
  8 in total

1.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

2.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

3.  Efficient q-gram filters for finding all epsilon-matches over a given length.

Authors:  Kim R Rasmussen; Jens Stoye; Eugene W Myers
Journal:  J Comput Biol       Date:  2006-03       Impact factor: 1.479

4.  OSLay: optimal syntenic layout of unfinished assemblies.

Authors:  Daniel C Richter; Stephan C Schuster; Daniel H Huson
Journal:  Bioinformatics       Date:  2007-04-26       Impact factor: 6.937

5.  Versatile and open software for comparing large genomes.

Authors:  Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal:  Genome Biol       Date:  2004-01-30       Impact factor: 13.583

6.  Phylogenetic comparative assembly.

Authors:  Peter Husemann; Jens Stoye
Journal:  Algorithms Mol Biol       Date:  2010-01-04       Impact factor: 1.405

7.  Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies.

Authors:  Sacha A F T van Hijum; Aldert L Zomer; Oscar P Kuipers; Jan Kok
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

8.  ABACAS: algorithm-based automatic contiguation of assembled sequences.

Authors:  Samuel Assefa; Thomas M Keane; Thomas D Otto; Chris Newbold; Matthew Berriman
Journal:  Bioinformatics       Date:  2009-06-03       Impact factor: 6.937

  8 in total
  64 in total

Review 1.  A beginner's guide to eukaryotic genome annotation.

Authors:  Mark Yandell; Daniel Ence
Journal:  Nat Rev Genet       Date:  2012-04-18       Impact factor: 53.242

2.  First draft genome sequence of a strain from the genus Citricoccus.

Authors:  Corina Hayano-Kanashiro; Damar Lizbeth López-Arredondo; Pablo Cruz-Morales; Luis-David Alcaraz; Gabriela Olmedo; Francisco Barona-Gómez; Luis Herrera-Estrella
Journal:  J Bacteriol       Date:  2011-11       Impact factor: 3.490

3.  CSAR-web: a web server of contig scaffolding using algebraic rearrangements.

Authors:  Kun-Tze Chen; Chin Lung Lu
Journal:  Nucleic Acids Res       Date:  2018-07-02       Impact factor: 16.971

4.  IncH-type plasmid harboring bla CTX-M-15, bla DHA-1, and qnrB4 genes recovered from animal isolates.

Authors:  Andreas Schlüter; Patrice Nordmann; Rémy A Bonnin; Yves Millemann; Felix G Eikmeyer; Daniel Wibberg; Alfred Pühler; Laurent Poirel
Journal:  Antimicrob Agents Chemother       Date:  2014-04-21       Impact factor: 5.191

5.  An environmental bacterial taxon with a large and distinct metabolic repertoire.

Authors:  Micheal C Wilson; Tetsushi Mori; Christian Rückert; Agustinus R Uria; Maximilian J Helf; Kentaro Takada; Christine Gernert; Ursula A E Steffens; Nina Heycke; Susanne Schmitt; Christian Rinke; Eric J N Helfrich; Alexander O Brachmann; Cristian Gurgui; Toshiyuki Wakimoto; Matthias Kracht; Max Crüsemann; Ute Hentschel; Ikuro Abe; Shigeki Matsunaga; Jörn Kalinowski; Haruko Takeyama; Jörn Piel
Journal:  Nature       Date:  2014-01-29       Impact factor: 49.962

6.  ImtRDB: a database and software for mitochondrial imperfect interspersed repeats annotation.

Authors:  Viktor A Shamanskiy; Valeria N Timonina; Konstantin Yu Popadin; Konstantin V Gunbin
Journal:  BMC Genomics       Date:  2019-05-08       Impact factor: 3.969

7.  Complete Genome Sequencing of Acinetobacter baumannii Strain K50 Discloses the Large Conjugative Plasmid pK50a Encoding Carbapenemase OXA-23 and Extended-Spectrum β-Lactamase GES-11.

Authors:  Daniel Wibberg; Ileana P Salto; Felix G Eikmeyer; Irena Maus; Anika Winkler; Patrice Nordmann; Alfred Pühler; Laurent Poirel; Andreas Schlüter
Journal:  Antimicrob Agents Chemother       Date:  2018-04-26       Impact factor: 5.191

8.  Pangenomic study of Corynebacterium diphtheriae that provides insights into the genomic diversity of pathogenic isolates from cases of classical diphtheria, endocarditis, and pneumonia.

Authors:  Eva Trost; Jochen Blom; Siomar de Castro Soares; I-Hsiu Huang; Arwa Al-Dilaimi; Jasmin Schröder; Sebastian Jaenicke; Fernanda A Dorella; Flavia S Rocha; Anderson Miyoshi; Vasco Azevedo; Maria P Schneider; Artur Silva; Thereza C Camello; Priscila S Sabbadini; Cíntia S Santos; Louisy S Santos; Raphael Hirata; Ana L Mattos-Guaraldi; Androulla Efstratiou; Michael P Schmitt; Hung Ton-That; Andreas Tauch
Journal:  J Bacteriol       Date:  2012-04-13       Impact factor: 3.490

9.  A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs.

Authors:  Martin T Swain; Isheng J Tsai; Samual A Assefa; Chris Newbold; Matthew Berriman; Thomas D Otto
Journal:  Nat Protoc       Date:  2012-06-07       Impact factor: 13.491

10.  Draft genome sequence of the bean-nodulating Sinorhizobium fredii strain GR64.

Authors:  Gonzalo Torres Tejerizo; Luis Lozano; Víctor González; Patricia Bustos; David Romero; Susana Brom
Journal:  J Bacteriol       Date:  2012-12       Impact factor: 3.490

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.