Marnix H Medema1, Eriko Takano, Rainer Breitling. 1. Department of Microbial Physiology, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, The Netherlands.
Abstract
The genes encoding many biomolecular systems and pathways are genomically organized in operons or gene clusters. With MultiGeneBlast, we provide a user-friendly and effective tool to perform homology searches with operons or gene clusters as basic units, instead of single genes. The contextualization offered by MultiGeneBlast allows users to get a better understanding of the function, evolutionary history, and practical applications of such genomic regions. The tool is fully equipped with applications to generate search databases from GenBank or from the user's own sequence data. Finally, an architecture search mode allows searching for gene clusters with novel configurations, by detecting genomic regions with any user-specified combination of genes. Sources, precompiled binaries, and a graphical tutorial of MultiGeneBlast are freely available from http://multigeneblast.sourceforge.net/.
The genes encoding many biomolecular systems and pathways are genomically organized in operons or gene clusters. With MultiGeneBlast, we provide a user-friendly and effective tool to perform homology searches with operons or gene clusters as basic units, instead of single genes. The contextualization offered by MultiGeneBlast allows users to get a better understanding of the function, evolutionary history, and practical applications of such genomic regions. The tool is fully equipped with applications to generate search databases from GenBank or from the user's own sequence data. Finally, an architecture search mode allows searching for gene clusters with novel configurations, by detecting genomic regions with any user-specified combination of genes. Sources, precompiled binaries, and a graphical tutorial of MultiGeneBlast are freely available from http://multigeneblast.sourceforge.net/.
Many biological systems and pathways, not only from bacteria, archaea, and fungi, but also from plants (Field and Osbourn 2008) and animals (Garcia-Fernandez 2005) are encoded by genes that are physically clustered together on the chromosome in operons or gene clusters (Fischbach and Voigt 2010). The architectures of these gene clusters are sometimes well-conserved between species, but they may also evolve quickly through rearrangements, insertions, deletions, and duplications. In many cases, knowing the evolutionary context of a gene cluster can reveal much about its function, by offering information on which other organisms possess a similar biomolecular system or pathway as encoded by the gene cluster, which parts are most strongly evolutionarily conserved, and what variants of the system or pathway exist. Homology searching can also be useful for mining large numbers of gene or operon variants from homologous gene clusters, which can then function as building blocks for the synthetic biology engineering of novel pathways or systems (Medema et al. 2012).Although several efficient and user-friendly tools are available to perform homology searches for single genes and proteins (e.g., National Center for Biotechnology Information [NCBI]’s Basic Local Alignment Search Tool+ [BLAST+] implementation [Camacho et al. 2009]), there are few options to exhaustively mine the databases for homologs of entire operons or gene clusters. Tools such as JGI integrated microbial genomes (IMG; Mavromatis et al. 2009), PSAT (Fong et al. 2008), CCGV (Revanna et al. 2009), EDGAR (Blom et al. 2009), and Absynte (Despalins et al. 2011) each offer the possibility to perform gene neighborhood comparisons across prokaryotic genomes on precomputed data sets, but none of these allow searches against the entire GenBank database (Benson et al. 2013), nor do they allow generating custom databases from the user’s own sequence data. Another tool, SynBlast (Lehmann et al. 2008), is restricted to organisms whose genetic information is deposited in ENSEMBL (Flicek et al. 2012).Here, we present MultiGeneBlast, a comprehensive BLAST implementation to perform homology searches on multigene modules, which is built as a wrapper around NCBI BLAST+. As with the normal NCBI BLAST+ suite, the user can search the entire GenBank database or create his/her own databases. Additionally, MultiGeneBlast has the ability to perform “architecture searches,” which allow finding genomic loci containing homologs of specific user-specified combinations of genes. Multiple sequence alignments of homologs can be generated automatically after the search, and all results are visualized in a user-friendly interactive eXtensible HyperText Markup Language (XHTML) page.
Implementation of the Software
MultiGeneBlast functions as a Python-based wrapper around the blastp program from the NCBI BLAST+ suite (Camacho et al. 2009), which allows detecting even distant homology between genes by using the amino acid translation as a proxy for the gene sequence. MultiGeneBlast uses a specific database format in which each FASTA header in the database contains information on the parent nucleotide entry of the protein sequence as well as on the start and end positions and strand orientation of the gene that encodes it—besides, of course, its own functional annotation and accession number. To also make it possible to search unannotated genome sequences for homologous gene clusters, raw nucleotide databases can also be created, on which the tblastn algorithm is used instead of blastp. The MultiGeneBlast implementation (fig. 1) extends upon code written earlier for gene cluster comparison in antiSMASH (Medema, Blin, et al. 2011).
F
Outline of the homology search process by MultiGeneBlast. First, the amino acid translation of each gene sequence within the query gene cluster is searched against the selected MultiGeneBlast database, yielding a data set of BLAST hits. The BLAST hits are then mapped to their parent nucleotide scaffolds, based on the information from the database. The nucleotide scaffolds are then sorted according to their empirical similarity scores with the query gene cluster. Finally, the sorted list of genomic loci is displayed in an interactive XHTML file that can be viewed with any modern web browser.
Outline of the homology search process by MultiGeneBlast. First, the amino acid translation of each gene sequence within the query gene cluster is searched against the selected MultiGeneBlast database, yielding a data set of BLAST hits. The BLAST hits are then mapped to their parent nucleotide scaffolds, based on the information from the database. The nucleotide scaffolds are then sorted according to their empirical similarity scores with the query gene cluster. Finally, the sorted list of genomic loci is displayed in an interactive XHTML file that can be viewed with any modern web browser.Setting up a MultiGeneBlast run can be done not only from the command line (table 1) but also with a user-friendly graphical user interface (GUI) (fig. 2) that allows easy selection of genomic regions (see our graphical tutorial in supplementary file S1, Supplementary Material online). As in our gene cluster analysis tool antiSMASH, the output is visualized in an interactive XHTML page that can be opened in a web browser. The XHTML page shows a scalable vector graphics (SVG) visualization of all sorted genomic loci (fig. 3), and clicking on a gene leads to the display of annotation information, details of any blastp/tblastn hit to the (translated) sequence of this gene (percentage identity, sequence coverage, E-value, and bit score), and a direct link to run an individual blastp search with the amino acid translation of this gene on the NCBI server. Optionally, multiple sequence alignments of the amino acid translations of each query gene sequence with those of its homologs can also be generated using MUSCLE (Edgar 2004).
Table 1.
Applications in the MultiGeneBlast Package.
Name
Short Description
multigeneblast
Main command-line application to run MultiGeneBlast searches
mgb_gui
GUI for configuring and starting a MultiGeneBlast run
makedb
Application to construct MultiGeneBlast databases from user data
makegbdb
Application to construct MultiGeneBlast databases from GenBank divisions
makendb
Application to construct raw nucleotide MultiGeneBlast databases from user data
makegbndb
Application to construct raw nucleotide MultiGeneBlast databases from GenBank divisions
format_embl.py
Script to generate EMBL input files from a genome sequence + gene annotations
F
A user-friendly GUI allows easy construction of databases and easy use of the program. (A) User-friendly selection of input files and databases. (B) Direct download of GenBank entries from NCBI and simple button to download MultiGeneBlast-reformatted GenBank database. (C) Options to design databases from files, from online GenBank entries or from entire GenBank divisions. (D) Link to the MultiGeneBlast website with help pages, a tutorial, and various downloads.
F
Example output of a MultiGeneBlast run. The output consists of an interactive XHTML page, in which additional information on each gene appears on mouse-over or by clicking on a gene. This feature works for colored homologous genes and white nonhomologous genes. The first example output shown here displays a homology search for the coumermycin biosynthetic gene cluster, which identifies gene clusters encoding related compounds. The second example output shows the power of an architecture search to find specific pathways: By using a query of a type III polyketide synthase and a terpene cyclase, biosynthetic gene clusters encoding hybrid polyketide-terpene compounds are identified straightforwardly. Single alignments of the query gene clusters with any particular hit gene cluster can also be selected from a drop-down menu. All gene cluster images are stored in SVG format, so they can easily be transformed into publication-quality figures.
A user-friendly GUI allows easy construction of databases and easy use of the program. (A) User-friendly selection of input files and databases. (B) Direct download of GenBank entries from NCBI and simple button to download MultiGeneBlast-reformatted GenBank database. (C) Options to design databases from files, from online GenBank entries or from entire GenBank divisions. (D) Link to the MultiGeneBlast website with help pages, a tutorial, and various downloads.Example output of a MultiGeneBlast run. The output consists of an interactive XHTML page, in which additional information on each gene appears on mouse-over or by clicking on a gene. This feature works for colored homologous genes and white nonhomologous genes. The first example output shown here displays a homology search for the coumermycin biosynthetic gene cluster, which identifies gene clusters encoding related compounds. The second example output shows the power of an architecture search to find specific pathways: By using a query of a type III polyketide synthase and a terpene cyclase, biosynthetic gene clusters encoding hybrid polyketide-terpene compounds are identified straightforwardly. Single alignments of the query gene clusters with any particular hit gene cluster can also be selected from a drop-down menu. All gene cluster images are stored in SVG format, so they can easily be transformed into publication-quality figures.Applications in the MultiGeneBlast Package.
Two Distinct Search Modes
MultiGeneBlast offers two distinct search modes: “homology search" and “architecture search." The homology search mode serves to find homologs of a known operon or gene cluster and, hence, is an extended version of a standard BLAST homology search. The input for a homology search consists of an annotated genome sequence in GBK or EMBL format, together with the start and end locations spanning the query gene cluster or operon. Alternatively to start and end sites, a list of genes can be provided that constitute the gene cluster, which has the advantage that specific genes within the gene cluster can be left out of the analysis. After running separate blastp runs for each amino acid sequence encoded in the query genomic region, MultiGeneBlast locates all hits on their parent nucleotide scaffolds in the database. Each nucleotide scaffold that received blastp/tblastn hits is then subdivided into genomic loci containing blastp/tblastn hits with a maximum mutual distance of a given number of kilobases. The default value for this distance is 20 kb, a value which has been shown to work well for most bacterial gene clusters (Medema, Blin, et al. 2011), but higher values could work better for gene clusters in fungi and plants. Similar to the ClusterBlast implementation in antiSMASH (Medema, Blin, et al. 2011), genomic loci are then sorted by an empirical similarity score S = h + i · s, in which h represents the number of query genes with BLAST hits of at least a user-specified sequence coverage and percentage identity to the query, s represents the number of contiguous gene pairs with conserved synteny, and i represents a weighting factor that determines the weight of the synteny in determining the score. The default value for i is 0.5, which gives the number of homologous genes twice the weight as the conservation of synteny. If the obtained scores are equal, the loci are subsequently sorted by their cumulative blastp/tblastn bit scores. When testing the algorithm on a number of (semi-)manual gene cluster comparisons from the recent scientific literature, we observed that MultiGeneBlast could replicate their results accurately, as well as identify additional homologous but compositionally distinct gene clusters (supplementary file S2, Supplementary Material online).The architecture search mode differs from a standard homology search in that the query input consists not of a known genomic region but of a FASTA file with multiple protein sequence entries, designed by the user. Thus, the user can search for all genomic loci containing a combination of certain genes within the same gene cluster. This can be of great use, for example, when searching for gene clusters encoding specific metabolic pathways containing a specified combination of enzymatic steps.
Creating Custom Databases for MultiGeneBlast
MultiGeneBlast is shipped with a database consisting of the translated amino acid sequences of all gene sequences in the GenBank database (December 12, 2012), reformatted with new FASTA headers as stated earlier. Updated versions of this database will be made available for download regularly. MultiGeneBlast also offers two tools to generate custom databases. The first tool, MakeGBDB, allows the user to construct databases from a specified subset of the GenBank subdivisions (such as BCT for bacteria and PLN for plants). The tool downloads the specified subdivisions from the NCBI FTP server and then parses them to generate a MultiGeneBlast database. The second tool, MakeDB, allows the user to construct databases from his/her own sequence data and takes as input a user-specified set of sequence files in GBK or EMBL format. For convenience, a script to generate EMBL files from nucleotide FASTA files and gene annotations is also provided.
New Approaches
MultiGeneBlast is the first full-fledged BLAST implementation that combines the input of multiple genes into a single query. Compared with previous tools for the comparative analysis of operons and gene clusters, MultiGeneBlast offers a unique set of options (table 2).
Table 2.
Comparison of Different Software Tools for Gene Cluster Homology Searches.
Software
Web Tool
Stand- Alone Tool
Not Restricted to Precomputed Data
Can Search Entire GenBank Database
Based on Multiple Gene Queries
Allows Input of Personal Sequence Data
Allows Creation of Custom Databases
Architecture Search Mode
Command- Line Available
Open Source
MultiGeneBlast
X
X
X
X
X
X
X
X
X
IMG
X
EDGAR
X
Absynte
X
X
PSAT
X
X
CCGV
X
X
X
X
X
SynBlast
X
X
X
X
Comparison of Different Software Tools for Gene Cluster Homology Searches.First, MultiGeneBlast allows to create databases of any combination of published and unpublished data, including the user’s personal sequence data. As the costs of DNA sequencing are continuously decreasing, more and more laboratories have large amounts of unpublished sequence data that need to be analyzed before online publication. No tools have been published thus far that offer the possibility to select the user’s own sequence data as both query and subject of the analysis. The IMG framework, arguably the most popular tool for gene neighborhood analysis at the moment, by design does not allow any custom queries that are outside the precomputed database, nor does it offer the option to search against custom-designed databases. In contrast, the user-friendly GUI of MultiGeneBlast makes it easy even for biologists with little or no bioinformatic expertise to design their own databases and search them with their own sequence data.Second, most existing tools do not allow searches against the entire GenBank database but only against subsets of sequences (usually whole genome sequences) for which precomputed results have been obtained. Thousands of known and characterized gene clusters (especially biosynthetic ones) are not part of any whole-genome sequence but were instead cloned directly from the environment, or are part of a metagenomic data set, and are therefore not present in databases such as that of IMG. MultiGeneBlast, however, offers the opportunity to perform a truly exhaustive search to find all homologous genetic elements that are present in the current databases.Third, the architecture search mode is unique to MultiGeneBlast and allows finding operons that are not similar to any operon known in advance by the user but instead contain homologs of a user-specified combination of genes.Finally, unlike most available tools, MultiGeneBlast can be used from the command line and also generates a tab-delimited TXT output, so it can easily be integrated in a larger computational pipeline. With relatively simple scripting, large numbers of queries can thus be searched against one or more databases to perform higher-level bioinformatic analyses.
Practical Applications of MultiGeneBlast
MultiGeneBlast offers a simple and intuitive tool to perform comparative genomic analysis, facilitating functional inference and evolutionary studies of gene clusters encoding biomolecular machines or pathways.A major application of MultiGeneBlast is to get a quick overview of the biomolecular diversity of an entire genetic element in diverse organisms and to survey all the variants that have evolved. Because MultiGeneBlast does not just display the genomic neighborhoods of one single gene but finds genomic loci with a combination of any of a list of query genes, the output will contain variants of the query genomic region consisting of any subset of that region in any arrangement. This avoids the risk of missing variants that do not contain the query gene, in contrast to approaches based on single gene input. When combining the list of identified gene cluster variants with phylogenetic information (of either species or representative genes), the evolutionary history of a gene cluster can be reconstructed, which can give valuable insight into the biomolecular functions of the various components of the encoded system. Based on patterns of evolutionary conservation, one can sometimes also get a better idea of which genes do and which genes do not belong to the gene cluster as a functional unit.Often, distinct subclusters with separate evolutionary histories together constitute a larger gene cluster (Fischbach et al. 2008). A MultiGeneBlast analysis of the entire gene cluster may reveal its fundamental architecture, through the identification of distinct patterns of conservation of various subsets of genes from the gene cluster. This also cannot be achieved by approaches based on a single gene query.Another important and promising application of the approach is to rapidly harvest gene parts for the synthetic biology design of biochemical pathways (Medema, Breitling et al. 2011; Medema et al. 2012). When generating synthetic versions of a particular biochemical system for heterologous implementation in a pre-engineered host, it is of great importance to test multiple versions of the system to find the one that functions best in a particular organism (Bayer et al. 2009). Because MultiGeneBlast can search the entire GenBank database, as well as any personal sequence data that may be available, it can quickly and reliably be used to identify all extant versions of an operon or gene cluster in an exhaustive manner.Of course, many more applications of the tool are possible, as the colocalization of functionally related genes is a recurring evolutionary motif: MultiGeneBlast provides a general search tool that can be exploited in a wide range of comparative genomics studies of homologous multigene units, by expert bioinformaticians and experimental biologists alike.
Supplementary Material
Supplementary files S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169
Authors: Travis S Bayer; Daniel M Widmaier; Karsten Temme; Ethan A Mirsky; Daniel V Santi; Christopher A Voigt Journal: J Am Chem Soc Date: 2009-05-13 Impact factor: 15.419
Authors: Marnix H Medema; Kai Blin; Peter Cimermancic; Victor de Jager; Piotr Zakrzewski; Michael A Fischbach; Tilmann Weber; Eriko Takano; Rainer Breitling Journal: Nucleic Acids Res Date: 2011-06-14 Impact factor: 16.971
Authors: Konstantinos Mavromatis; Ken Chu; Natalia Ivanova; Sean D Hooper; Victor M Markowitz; Nikos C Kyrpides Journal: PLoS One Date: 2009-11-24 Impact factor: 3.240
Authors: Jochen Blom; Stefan P Albaum; Daniel Doppmeier; Alfred Pühler; Frank-Jörg Vorhölter; Martha Zakrzewski; Alexander Goesmann Journal: BMC Bioinformatics Date: 2009-05-20 Impact factor: 3.169
Authors: Seven Nazipi; Sofie G Vangkilde-Pedersen; Mette Marie Busck; Dorthe Kirstine Lund; Ian P G Marshall; Trine Bilde; Marie Braad Lund; Andreas Schramm Journal: Antonie Van Leeuwenhoek Date: 2021-02-04 Impact factor: 2.271
Authors: Joris Beld; Eva C Sonnenschein; Christopher R Vickery; Joseph P Noel; Michael D Burkart Journal: Nat Prod Rep Date: 2014-01 Impact factor: 13.423
Authors: Tiago Leao; Guilherme Castelão; Anton Korobeynikov; Emily A Monroe; Sheila Podell; Evgenia Glukhov; Eric E Allen; William H Gerwick; Lena Gerwick Journal: Proc Natl Acad Sci U S A Date: 2017-03-06 Impact factor: 11.205
Authors: Michelle A Schorn; Mohammad M Alanjary; Kristen Aguinaldo; Anton Korobeynikov; Sheila Podell; Nastassia Patin; Tommie Lincecum; Paul R Jensen; Nadine Ziemert; Bradley S Moore Journal: Microbiology Date: 2016-10-27 Impact factor: 2.777
Authors: Nadine Ziemert; Anna Lechner; Matthias Wietz; Natalie Millán-Aguiñaga; Krystle L Chavarria; Paul Robert Jensen Journal: Proc Natl Acad Sci U S A Date: 2014-03-10 Impact factor: 11.205
Authors: Sonia Giubergia; Christopher Phippen; Charlotte H Gotfredsen; Kristian Fog Nielsen; Lone Gram Journal: Appl Environ Microbiol Date: 2016-06-13 Impact factor: 4.792