Literature DB >> 23412913

Detecting sequence homology at the gene cluster level with MultiGeneBlast.

Marnix H Medema¹, Eriko Takano, Rainer Breitling.

Abstract

The genes encoding many biomolecular systems and pathways are genomically organized in operons or gene clusters. With MultiGeneBlast, we provide a user-friendly and effective tool to perform homology searches with operons or gene clusters as basic units, instead of single genes. The contextualization offered by MultiGeneBlast allows users to get a better understanding of the function, evolutionary history, and practical applications of such genomic regions. The tool is fully equipped with applications to generate search databases from GenBank or from the user's own sequence data. Finally, an architecture search mode allows searching for gene clusters with novel configurations, by detecting genomic regions with any user-specified combination of genes. Sources, precompiled binaries, and a graphical tutorial of MultiGeneBlast are freely available from http://multigeneblast.sourceforge.net/.

Entities: Chemical Gene Species

Mesh：

Year: 2013 PMID： 23412913 PMCID： PMC3670737 DOI： 10.1093/molbev/mst025

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Background and Rationale

Many biological systems and pathways, not only from bacteria, archaea, and fungi, but also from plants (Field and Osbourn 2008) and animals (Garcia-Fernandez 2005) are encoded by genes that are physically clustered together on the chromosome in operons or gene clusters (Fischbach and Voigt 2010). The architectures of these gene clusters are sometimes well-conserved between species, but they may also evolve quickly through rearrangements, insertions, deletions, and duplications. In many cases, knowing the evolutionary context of a gene cluster can reveal much about its function, by offering information on which other organisms possess a similar biomolecular system or pathway as encoded by the gene cluster, which parts are most strongly evolutionarily conserved, and what variants of the system or pathway exist. Homology searching can also be useful for mining large numbers of gene or operon variants from homologous gene clusters, which can then function as building blocks for the synthetic biology engineering of novel pathways or systems (Medema et al. 2012). Although several efficient and user-friendly tools are available to perform homology searches for single genes and proteins (e.g., National Center for Biotechnology Information [NCBI]’s Basic Local Alignment Search Tool+ [BLAST+] implementation [Camacho et al. 2009]), there are few options to exhaustively mine the databases for homologs of entire operons or gene clusters. Tools such as JGI integrated microbial genomes (IMG; Mavromatis et al. 2009), PSAT (Fong et al. 2008), CCGV (Revanna et al. 2009), EDGAR (Blom et al. 2009), and Absynte (Despalins et al. 2011) each offer the possibility to perform gene neighborhood comparisons across prokaryotic genomes on precomputed data sets, but none of these allow searches against the entire GenBank database (Benson et al. 2013), nor do they allow generating custom databases from the user’s own sequence data. Another tool, SynBlast (Lehmann et al. 2008), is restricted to organisms whose genetic information is deposited in ENSEMBL (Flicek et al. 2012). Here, we present MultiGeneBlast, a comprehensive BLAST implementation to perform homology searches on multigene modules, which is built as a wrapper around NCBI BLAST+. As with the normal NCBI BLAST+ suite, the user can search the entire GenBank database or create his/her own databases. Additionally, MultiGeneBlast has the ability to perform “architecture searches,” which allow finding genomic loci containing homologs of specific user-specified combinations of genes. Multiple sequence alignments of homologs can be generated automatically after the search, and all results are visualized in a user-friendly interactive eXtensible HyperText Markup Language (XHTML) page.

Implementation of the Software

MultiGeneBlast functions as a Python-based wrapper around the blastp program from the NCBI BLAST+ suite (Camacho et al. 2009), which allows detecting even distant homology between genes by using the amino acid translation as a proxy for the gene sequence. MultiGeneBlast uses a specific database format in which each FASTA header in the database contains information on the parent nucleotide entry of the protein sequence as well as on the start and end positions and strand orientation of the gene that encodes it—besides, of course, its own functional annotation and accession number. To also make it possible to search unannotated genome sequences for homologous gene clusters, raw nucleotide databases can also be created, on which the tblastn algorithm is used instead of blastp. The MultiGeneBlast implementation (fig. 1) extends upon code written earlier for gene cluster comparison in antiSMASH (Medema, Blin, et al. 2011).

Outline of the homology search process by MultiGeneBlast. First, the amino acid translation of each gene sequence within the query gene cluster is searched against the selected MultiGeneBlast database, yielding a data set of BLAST hits. The BLAST hits are then mapped to their parent nucleotide scaffolds, based on the information from the database. The nucleotide scaffolds are then sorted according to their empirical similarity scores with the query gene cluster. Finally, the sorted list of genomic loci is displayed in an interactive XHTML file that can be viewed with any modern web browser. Setting up a MultiGeneBlast run can be done not only from the command line (table 1) but also with a user-friendly graphical user interface (GUI) (fig. 2) that allows easy selection of genomic regions (see our graphical tutorial in supplementary file S1, Supplementary Material online). As in our gene cluster analysis tool antiSMASH, the output is visualized in an interactive XHTML page that can be opened in a web browser. The XHTML page shows a scalable vector graphics (SVG) visualization of all sorted genomic loci (fig. 3), and clicking on a gene leads to the display of annotation information, details of any blastp/tblastn hit to the (translated) sequence of this gene (percentage identity, sequence coverage, E-value, and bit score), and a direct link to run an individual blastp search with the amino acid translation of this gene on the NCBI server. Optionally, multiple sequence alignments of the amino acid translations of each query gene sequence with those of its homologs can also be generated using MUSCLE (Edgar 2004).

Table 1.

Applications in the MultiGeneBlast Package.

Name	Short Description
multigeneblast	Main command-line application to run MultiGeneBlast searches
mgb_gui	GUI for configuring and starting a MultiGeneBlast run
makedb	Application to construct MultiGeneBlast databases from user data
makegbdb	Application to construct MultiGeneBlast databases from GenBank divisions
makendb	Application to construct raw nucleotide MultiGeneBlast databases from user data
makegbndb	Application to construct raw nucleotide MultiGeneBlast databases from GenBank divisions
format_embl.py	Script to generate EMBL input files from a genome sequence + gene annotations

Example output of a MultiGeneBlast run. The output consists of an interactive XHTML page, in which additional information on each gene appears on mouse-over or by clicking on a gene. This feature works for colored homologous genes and white nonhomologous genes. The first example output shown here displays a homology search for the coumermycin biosynthetic gene cluster, which identifies gene clusters encoding related compounds. The second example output shows the power of an architecture search to find specific pathways: By using a query of a type III polyketide synthase and a terpene cyclase, biosynthetic gene clusters encoding hybrid polyketide-terpene compounds are identified straightforwardly. Single alignments of the query gene clusters with any particular hit gene cluster can also be selected from a drop-down menu. All gene cluster images are stored in SVG format, so they can easily be transformed into publication-quality figures.

A user-friendly GUI allows easy construction of databases and easy use of the program. (A) User-friendly selection of input files and databases. (B) Direct download of GenBank entries from NCBI and simple button to download MultiGeneBlast-reformatted GenBank database. (C) Options to design databases from files, from online GenBank entries or from entire GenBank divisions. (D) Link to the MultiGeneBlast website with help pages, a tutorial, and various downloads. Example output of a MultiGeneBlast run. The output consists of an interactive XHTML page, in which additional information on each gene appears on mouse-over or by clicking on a gene. This feature works for colored homologous genes and white nonhomologous genes. The first example output shown here displays a homology search for the coumermycin biosynthetic gene cluster, which identifies gene clusters encoding related compounds. The second example output shows the power of an architecture search to find specific pathways: By using a query of a type III polyketide synthase and a terpene cyclase, biosynthetic gene clusters encoding hybrid polyketide-terpene compounds are identified straightforwardly. Single alignments of the query gene clusters with any particular hit gene cluster can also be selected from a drop-down menu. All gene cluster images are stored in SVG format, so they can easily be transformed into publication-quality figures. Applications in the MultiGeneBlast Package.

Two Distinct Search Modes

MultiGeneBlast offers two distinct search modes: “homology search" and “architecture search." The homology search mode serves to find homologs of a known operon or gene cluster and, hence, is an extended version of a standard BLAST homology search. The input for a homology search consists of an annotated genome sequence in GBK or EMBL format, together with the start and end locations spanning the query gene cluster or operon. Alternatively to start and end sites, a list of genes can be provided that constitute the gene cluster, which has the advantage that specific genes within the gene cluster can be left out of the analysis. After running separate blastp runs for each amino acid sequence encoded in the query genomic region, MultiGeneBlast locates all hits on their parent nucleotide scaffolds in the database. Each nucleotide scaffold that received blastp/tblastn hits is then subdivided into genomic loci containing blastp/tblastn hits with a maximum mutual distance of a given number of kilobases. The default value for this distance is 20 kb, a value which has been shown to work well for most bacterial gene clusters (Medema, Blin, et al. 2011), but higher values could work better for gene clusters in fungi and plants. Similar to the ClusterBlast implementation in antiSMASH (Medema, Blin, et al. 2011), genomic loci are then sorted by an empirical similarity score S = h + i · s, in which h represents the number of query genes with BLAST hits of at least a user-specified sequence coverage and percentage identity to the query, s represents the number of contiguous gene pairs with conserved synteny, and i represents a weighting factor that determines the weight of the synteny in determining the score. The default value for i is 0.5, which gives the number of homologous genes twice the weight as the conservation of synteny. If the obtained scores are equal, the loci are subsequently sorted by their cumulative blastp/tblastn bit scores. When testing the algorithm on a number of (semi-)manual gene cluster comparisons from the recent scientific literature, we observed that MultiGeneBlast could replicate their results accurately, as well as identify additional homologous but compositionally distinct gene clusters (supplementary file S2, Supplementary Material online). The architecture search mode differs from a standard homology search in that the query input consists not of a known genomic region but of a FASTA file with multiple protein sequence entries, designed by the user. Thus, the user can search for all genomic loci containing a combination of certain genes within the same gene cluster. This can be of great use, for example, when searching for gene clusters encoding specific metabolic pathways containing a specified combination of enzymatic steps.

Creating Custom Databases for MultiGeneBlast

MultiGeneBlast is shipped with a database consisting of the translated amino acid sequences of all gene sequences in the GenBank database (December 12, 2012), reformatted with new FASTA headers as stated earlier. Updated versions of this database will be made available for download regularly. MultiGeneBlast also offers two tools to generate custom databases. The first tool, MakeGBDB, allows the user to construct databases from a specified subset of the GenBank subdivisions (such as BCT for bacteria and PLN for plants). The tool downloads the specified subdivisions from the NCBI FTP server and then parses them to generate a MultiGeneBlast database. The second tool, MakeDB, allows the user to construct databases from his/her own sequence data and takes as input a user-specified set of sequence files in GBK or EMBL format. For convenience, a script to generate EMBL files from nucleotide FASTA files and gene annotations is also provided.

New Approaches

MultiGeneBlast is the first full-fledged BLAST implementation that combines the input of multiple genes into a single query. Compared with previous tools for the comparative analysis of operons and gene clusters, MultiGeneBlast offers a unique set of options (table 2).

Table 2.

Comparison of Different Software Tools for Gene Cluster Homology Searches.

Software	Web Tool	Stand- Alone Tool	Not Restricted to Precomputed Data	Can Search Entire GenBank Database	Based on Multiple Gene Queries	Allows Input of Personal Sequence Data	Allows Creation of Custom Databases	Architecture Search Mode	Command- Line Available	Open Source
MultiGeneBlast		X	X	X	X	X	X	X	X	X
IMG	X
EDGAR	X
Absynte	X					X
PSAT	X									X
CCGV	X		X			X			X	X
SynBlast		X	X						X	X

Comparison of Different Software Tools for Gene Cluster Homology Searches. First, MultiGeneBlast allows to create databases of any combination of published and unpublished data, including the user’s personal sequence data. As the costs of DNA sequencing are continuously decreasing, more and more laboratories have large amounts of unpublished sequence data that need to be analyzed before online publication. No tools have been published thus far that offer the possibility to select the user’s own sequence data as both query and subject of the analysis. The IMG framework, arguably the most popular tool for gene neighborhood analysis at the moment, by design does not allow any custom queries that are outside the precomputed database, nor does it offer the option to search against custom-designed databases. In contrast, the user-friendly GUI of MultiGeneBlast makes it easy even for biologists with little or no bioinformatic expertise to design their own databases and search them with their own sequence data. Second, most existing tools do not allow searches against the entire GenBank database but only against subsets of sequences (usually whole genome sequences) for which precomputed results have been obtained. Thousands of known and characterized gene clusters (especially biosynthetic ones) are not part of any whole-genome sequence but were instead cloned directly from the environment, or are part of a metagenomic data set, and are therefore not present in databases such as that of IMG. MultiGeneBlast, however, offers the opportunity to perform a truly exhaustive search to find all homologous genetic elements that are present in the current databases. Third, the architecture search mode is unique to MultiGeneBlast and allows finding operons that are not similar to any operon known in advance by the user but instead contain homologs of a user-specified combination of genes. Finally, unlike most available tools, MultiGeneBlast can be used from the command line and also generates a tab-delimited TXT output, so it can easily be integrated in a larger computational pipeline. With relatively simple scripting, large numbers of queries can thus be searched against one or more databases to perform higher-level bioinformatic analyses.

Practical Applications of MultiGeneBlast

MultiGeneBlast offers a simple and intuitive tool to perform comparative genomic analysis, facilitating functional inference and evolutionary studies of gene clusters encoding biomolecular machines or pathways. A major application of MultiGeneBlast is to get a quick overview of the biomolecular diversity of an entire genetic element in diverse organisms and to survey all the variants that have evolved. Because MultiGeneBlast does not just display the genomic neighborhoods of one single gene but finds genomic loci with a combination of any of a list of query genes, the output will contain variants of the query genomic region consisting of any subset of that region in any arrangement. This avoids the risk of missing variants that do not contain the query gene, in contrast to approaches based on single gene input. When combining the list of identified gene cluster variants with phylogenetic information (of either species or representative genes), the evolutionary history of a gene cluster can be reconstructed, which can give valuable insight into the biomolecular functions of the various components of the encoded system. Based on patterns of evolutionary conservation, one can sometimes also get a better idea of which genes do and which genes do not belong to the gene cluster as a functional unit. Often, distinct subclusters with separate evolutionary histories together constitute a larger gene cluster (Fischbach et al. 2008). A MultiGeneBlast analysis of the entire gene cluster may reveal its fundamental architecture, through the identification of distinct patterns of conservation of various subsets of genes from the gene cluster. This also cannot be achieved by approaches based on a single gene query. Another important and promising application of the approach is to rapidly harvest gene parts for the synthetic biology design of biochemical pathways (Medema, Breitling et al. 2011; Medema et al. 2012). When generating synthetic versions of a particular biochemical system for heterologous implementation in a pre-engineered host, it is of great importance to test multiple versions of the system to find the one that functions best in a particular organism (Bayer et al. 2009). Because MultiGeneBlast can search the entire GenBank database, as well as any personal sequence data that may be available, it can quickly and reliably be used to identify all extant versions of an operon or gene cluster in an exhaustive manner. Of course, many more applications of the tool are possible, as the colocalization of functionally related genes is a recurring evolutionary motif: MultiGeneBlast provides a general search tool that can be exploited in a wide range of comparative genomics studies of homologous multigene units, by expert bioinformaticians and experimental biologists alike.

Supplementary Material

Supplementary files S1 and S2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

18 in total

Review 1. Exploiting plug-and-play synthetic biology for drug discovery and production in microorganisms.

Authors: Marnix H Medema; Rainer Breitling; Roel Bovenberg; Eriko Takano
Journal: Nat Rev Microbiol Date: 2010-12-29 Impact factor: 60.633

2. A web-based software system for dynamic gene cluster comparison across multiple genomes.

Authors: Kashi Vishwanath Revanna; Vivek Krishnakumar; Qunfeng Dong
Journal: Bioinformatics Date: 2009-02-09 Impact factor: 6.937

Review 3. Prokaryotic gene clusters: a rich toolbox for synthetic biology.

Authors: Michael Fischbach; Christopher A Voigt
Journal: Biotechnol J Date: 2010-12 Impact factor: 4.677

4. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

5. Metabolic diversification--independent assembly of operon-like gene clusters in different plants.

Authors: Ben Field; Anne E Osbourn
Journal: Science Date: 2008-03-20 Impact factor: 47.728

6. Synthesis of methyl halides from biomass using engineered microbes.

Authors: Travis S Bayer; Daniel M Widmaier; Karsten Temme; Ethan A Mirsky; Daniel V Santi; Christopher A Voigt
Journal: J Am Chem Soc Date: 2009-05-13 Impact factor: 15.419

7. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences.

Authors: Marnix H Medema; Kai Blin; Peter Cimermancic; Victor de Jager; Piotr Zakrzewski; Michael A Fischbach; Tilmann Weber; Eriko Takano; Rainer Breitling
Journal: Nucleic Acids Res Date: 2011-06-14 Impact factor: 16.971

8. Gene context analysis in the Integrated Microbial Genomes (IMG) data management system.

Authors: Konstantinos Mavromatis; Ken Chu; Natalia Ivanova; Sean D Hooper; Victor M Markowitz; Nikos C Kyrpides
Journal: PLoS One Date: 2009-11-24 Impact factor: 3.240

9. EDGAR: a software framework for the comparative analysis of prokaryotic genomes.

Authors: Jochen Blom; Stefan P Albaum; Daniel Doppmeier; Alfred Pühler; Frank-Jörg Vorhölter; Martha Zakrzewski; Alexander Goesmann
Journal: BMC Bioinformatics Date: 2009-05-20 Impact factor: 3.169

10. PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes.

Authors: Christine Fong; Laurence Rohmer; Matthew Radey; Michael Wasnick; Mitchell J Brittnacher
Journal: BMC Bioinformatics Date: 2008-03-26 Impact factor: 3.169

135 in total

1. Indole Biodegradation in Acinetobacter sp. Strain O153: Genetic and Biochemical Characterization.

Authors: Mikas Sadauskas; Justas Vaitekūnas; Renata Gasparavičiūtė; Rolandas Meškys
Journal: Appl Environ Microbiol Date: 2017-09-15 Impact factor: 4.792

2. An antimicrobial Staphylococcus sciuri with broad temperature and salt spectrum isolated from the surface of the African social spider, Stegodyphus dumicola.

Authors: Seven Nazipi; Sofie G Vangkilde-Pedersen; Mette Marie Busck; Dorthe Kirstine Lund; Ian P G Marshall; Trine Bilde; Marie Braad Lund; Andreas Schramm
Journal: Antonie Van Leeuwenhoek Date: 2021-02-04 Impact factor: 2.271

Review 3. The phosphopantetheinyl transferases: catalysis of a post-translational modification crucial for life.

Authors: Joris Beld; Eva C Sonnenschein; Christopher R Vickery; Joseph P Noel; Michael D Burkart
Journal: Nat Prod Rep Date: 2014-01 Impact factor: 13.423

4. Comparative genomics uncovers the prolific and distinctive metabolic potential of the cyanobacterial genus Moorea.

Authors: Tiago Leao; Guilherme Castelão; Anton Korobeynikov; Emily A Monroe; Sheila Podell; Evgenia Glukhov; Eric E Allen; William H Gerwick; Lena Gerwick
Journal: Proc Natl Acad Sci U S A Date: 2017-03-06 Impact factor: 11.205

5. Sequencing rare marine actinomycete genomes reveals high density of unique natural product biosynthetic gene clusters.

Authors: Michelle A Schorn; Mohammad M Alanjary; Kristen Aguinaldo; Anton Korobeynikov; Sheila Podell; Nastassia Patin; Tommie Lincecum; Paul R Jensen; Nadine Ziemert; Bradley S Moore
Journal: Microbiology Date: 2016-10-27 Impact factor: 2.777