Literature DB >> 21609955

GCView: the genomic context viewer for protein homology searches.

Abstract

Genomic neighborhood can provide important insights into evolution and function of a protein or gene. When looking at operons, changes in operon structure and composition can only be revealed by looking at the operon as a whole. To facilitate the analysis of the genomic context of a query in multiple organisms we have developed Genomic Context Viewer (GCView). GCView accepts results from one or multiple protein homology searches such as BLASTp as input. For each hit, the neighboring protein-coding genes are extracted, the regions of homology are labeled for each input and the results are presented as a clear, interactive graphical output. It is also possible to add more searches to iteratively refine the output. GCView groups outputs by the hits for different proteins. This allows for easy comparison of different operon compositions and structures. The tool is embedded in the framework of the Bioinformatics Toolkit of the Max-Planck Institute for Developmental Biology (MPI Toolkit). Job results from the homology search tools inside the MPI Toolkit can be forwarded to GCView and results can be subsequently analyzed by sequence analysis tools. Results are stored online, allowing for later reinspection. GCView is freely available at http://toolkit.tuebingen.mpg.de/gcview.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteins

Year: 2011 PMID： 21609955 PMCID： PMC3125770 DOI： 10.1093/nar/gkr364

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In bacterial and archaeal genomes, about one half of all protein-coding genes are organized into operons. (1). But even for the other half, conservation of the genomic context i.e. the genes upstream and downstream on the chromosome, is observable between related species (2). The genomic context can provide important information about duplication, insertion, translocation or deletion events. While the past decades have equipped scientists with a broad range of excellent bioinformatics tools for analysis and comparison of single protein sequences, taking a step back and looking at the bigger genomic picture and comparing it between different organisms is still largely manual work. For many well annotated proteins and operons, databases like BioCyc (3), STRING (4), The SEED (5) or Ensembl Bacteria (6) can provide important information. However, looking beyond the content of those databases to extend the search into more genomes or investigating less well-characterized proteins can be challenging. GCView, the Genomic Context Viewer for protein homology searches aims to ease and automate the manual process of extracting and comparing genomic regions of interest. It is integrated into the Bioinformatics Toolkit of the Max-Planck Institute for Developmental Biology (MPI Toolkit) (7) and can be accessed through a user-friendly web interface at http://toolkit.tuebingen.mpg.de/gcview. This website is free and open to all users and there is no login requirement. GCView uses protein homology to assign corresponding genes. The underlying homology information is taken from standard protein homology search tools like BLASTp or PSI-BLAST (8). In contrast to the above mentioned databases such as STRING, the homology searches are not precomputed, giving the user full control over and insight into the processes leading to the final result. GCView can integrate multiple searches (e.g. one for each component of an operon) and compile a comprehensive overview of the combinatorial variants found in different genomes. Genomes featuring the same number and order of genes of interest are grouped together. The results can be mapped onto a taxonomy tree for a quick overview of the distribution of operon structures throughout all sequenced procaryotic organisms. The output is a series of images showing the genomic regions that contain the genes of interest. Additionally, for each image a list of the encoded proteins is provided that contains additional information such as descriptions and database links. Hits from the underlying searches are colored in the output for easy identification. The integration into the MPI Toolkit allows users to run homology search jobs independent of GCView, providing maximum control over the input parameters, and then to internally forward the results to GCView for integration. Consequently, the results from GCView can also be forwarded to other specialized tools for a more detailed analysis of subsets of proteins or genes. All results are stored on the server for 2 weeks and can be revisited and reviewed at a later time point. It is possible to create an account on the MPI Toolkit, which allows jobs to be bound to the account and saved for extended periods of time.

FUNCTIONALITY

The design goal for GCView was to provide a quick and accurate overview of the combinatorial variants of operons in different genomes based on well established homology search methods accessible through a user-friendly straightforward web interface. The workflow of the tool is summarized in Figure 1.

Figure 1.

GCView workflow. Input: red; processing: yellow and results: green.

Input

GCView accepts several different types of input: FASTA protein sequences, protein GI or UniProt identifiers and forwarded homology search jobs. Currently GCView is limited to protein homology searches or protein sequences as input, mostly due to the higher sensitivity of protein searches compared to DNA searches. The inclusion of DNA searches (BLASTn) is planned for a future version. It is possible to use not only full protein sequences, but also single domains as query for the search. Genes containing multiple domains will be labeled accordingly in the output. Primarily, homology search jobs can be forwarded to GCView within the MPI Toolkit. If, alternatively, FASTA sequences or protein identifiers are provided, GCView internally executes a PSI-BLAST run for each sequence or identifier provided and analyzes the results. Additional input parameters are the size of the genomic region to be displayed and the E-value cutoffs for the results to be included in the output. The size of the genomic region is interpreted as the number of genes to be extracted before the first hit and after the last hit in any genome. Note that the quality of the GCView results strongly depends on the underlying homology search being exhaustive, i.e. containing results at least up to the E-value cutoff specified for GCView. This is especially important in Group View: only exhaustive searches lead to a maximum of labeled operon components. Operons with unlabeled components lead to additional groups, which would not be observed after an exhaustive search. For the same reason, caution is advised when using BLAST databases prefiltered at a certain sequence similarity cutoff. For technical reasons, it is only possible to use BLAST databases, which contain GI or UniProt identifiers. Using a database which does not provide appropriate identifiers in the output will not give any results in GCView.

Processing

From each input homology search, a list of protein GI numbers is extracted along with the exact region and degree of similarity. The lists are filtered for proteins with E-values below the threshold specified in the input and for proteins from organisms which have not been fully sequenced. The backend database of sequenced genome data is built from the genomes found in NCBI GenBank (9) (ftp://ftp.ncbi.nih.gov/genomes/Bacteria) and comprises fully sequenced bacterial and archaeal genomes. For each hit the genes upstream and downstream of the hit are extracted from the database, resulting in one genome chunk for each hit. The number of genes extracted depends on the range set in the input parameters. Overlapping regions from the same genome are subsequently merged. This implies that an operon which has been duplicated in a genome can show up as one or two chunks, depending on the distance between the duplicates and the range settings. After merging, the resulting regions are grouped by the number and order of genes of interest.

OUTPUT

GCView generates two different views for the results: the Group View and the Taxonomy View. Both views contain the same information the difference is in the sorting. Figure 2 shows example outputs for both views for two different runs of GCView.

Figure 2.

Example output. (A) Using GCView to look at different operon components. The lac Operon (Demo Data) is shown in Group View with one group expanded. Insert: Group View Overview for the same run. (B) Using GCView to look at single domains in different contexts. POTRA domains from Omp85 and related proteins (10) in different organisms shown in Group View. Insert: Taxonomy View for the same run. The Group View presents an overview of the results. A group comprises all organisms which contain a specific number and order of the genes of interest. A schematic image of each group summarizes which of the genes of interest can be found in the group and in which order they appear in the genome. Each query gene is represented by a colored arrow. The colors are explained in the legend, which is displayed on the top of the page. Additionally, the identifier of the input query is indicated on each arrow. The arrows in the Group View are not to scale and the colors do not indicate the degree of identity between query and hit sequences. Fused arrows indicate that multiple query sequences were mapped onto one gene. Gray boxes represent one or multiple genes that are not homologous to any of the query sequences but located between genes of interest. A number indicates how many genes are represented by the corresponding box. The groups can be expanded to view the detailed genomic context for each organism in the respective group. The Taxonomy View maps all results onto a taxonomy tree. The numbers next to the organism names represent the number of hits in this taxon and its sub-groups. Branches of the tree can be collapsed or expanded as required. The detailed information for each hit can be viewed at the leaves of the tree. The detail representation of every genomic region is identical in both views. Each representation contains a genome ID, indicating the nucleotide GI number of the genome from which the corresponding region was extracted. In the case that genes of interest are located in several non-overlapping regions of the same genome (e.g. due to operon duplication), multiple representations with the same nucleotide ID are shown, one for each region. A schematic image of the region shows the genetic neighborhood of the genes of interest. Protein-coding genes are shown as arrows. Regions of homology to the genes of interest are highlighted in the corresponding colors, which are indicated in the legend. In contrast to the Group View, the intensity of the color corresponds to the identity score of the hit and the arrow length correlates with the length of the gene. Please note that the scale may differ between different images. The ruler at the bottom of each image shows the position in the genome. Each section of the ruler corresponds to 1000 bp. Various details for each gene (description, precise location, length, distance to neighboring genes) can be viewed by hovering the mouse over the arrows. Clicking on an arrow expands a detailed list of the genes in the image and the search hits therein. The selected gene is highlighted in the list. A clipboard widget located in the right corner of the screen can be used to pick genes from the output. These genes can be forwarded to sequence retrieval tools for further analysis or used in another GCView run for an iterative expansion of the set of analyzed genes.

CONCLUSIONS

We present GCView, an interactive web tool for automated retrieval and comparison of the genomic context of protein-coding genes. The underlying homology searches use protein sequences instead of DNA for higher sensitivity. Compared to classical databases like The SEED or BioCyc, the advantages of GCView are: (i) a greater focus on the query, as only the homologs of the input proteins are highlighted, and the degree of similarity is easily visible from the output; (ii) interactivity, as the query can iteratively be extended by more proteins of interest; (iii) transparency, as the user can have full control over the parameters of the underlying homology search; and (iv) flexibility, as e.g. single domains can be used as query, revealing different domain contexts. GCView is embedded into the MPI Toolkit, which allows users to save their GCView runs for later reinspection and directly analyze the genes found by GCView using a broad range of sequence and structure analysis tools.

FUNDING

Funding for the project as well as for open access charge: Departmental funding of the Max Planck Society. Conflict of interest statement. None declared.

10 in total

1. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs.

Authors: Jan O Korbel; Lars J Jensen; Christian von Mering; Peer Bork
Journal: Nat Biotechnol Date: 2004-07 Impact factor: 54.908

Review 2. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

3. Ensembl Genomes: extending Ensembl across the taxonomic space.

Authors: P J Kersey; D Lawson; E Birney; P S Derwent; M Haimel; J Herrero; S Keenan; A Kerhornou; G Koscielny; A Kähäri; R J Kinsella; E Kulesha; U Maheswari; K Megy; M Nuhn; G Proctor; D Staines; F Valentin; A J Vilella; A Yates
Journal: Nucleic Acids Res Date: 2009-11-01 Impact factor: 16.971

4. Omp85 from the thermophilic cyanobacterium Thermosynechococcus elongatus differs from proteobacterial Omp85 in structure and domain composition.

Authors: Thomas Arnold; Kornelius Zeth; Dirk Linke
Journal: J Biol Chem Date: 2010-03-29 Impact factor: 5.157

5. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored.

Authors: Damian Szklarczyk; Andrea Franceschini; Michael Kuhn; Milan Simonovic; Alexander Roth; Pablo Minguez; Tobias Doerks; Manuel Stark; Jean Muller; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

6. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

7. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes.

Authors: Peter D Karp; Christos A Ouzounis; Caroline Moore-Kochlacs; Leon Goldovsky; Pallavi Kaipa; Dag Ahrén; Sophia Tsoka; Nikos Darzentas; Victor Kunin; Núria López-Bigas
Journal: Nucleic Acids Res Date: 2005-10-24 Impact factor: 16.971

8. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.

Authors: Ross Overbeek; Tadhg Begley; Ralph M Butler; Jomuna V Choudhuri; Han-Yu Chuang; Matthew Cohoon; Valérie de Crécy-Lagard; Naryttza Diaz; Terry Disz; Robert Edwards; Michael Fonstein; Ed D Frank; Svetlana Gerdes; Elizabeth M Glass; Alexander Goesmann; Andrew Hanson; Dirk Iwata-Reuyl; Roy Jensen; Neema Jamshidi; Lutz Krause; Michael Kubal; Niels Larsen; Burkhard Linke; Alice C McHardy; Folker Meyer; Heiko Neuweger; Gary Olsen; Robert Olson; Andrei Osterman; Vasiliy Portnoy; Gordon D Pusch; Dmitry A Rodionov; Christian Rückert; Jason Steiner; Rick Stevens; Ines Thiele; Olga Vassieva; Yuzhen Ye; Olga Zagnitko; Veronika Vonstein
Journal: Nucleic Acids Res Date: 2005-10-07 Impact factor: 16.971

9. The life-cycle of operons.

Authors: Morgan N Price; Adam P Arkin; Eric J Alm
Journal: PLoS Genet Date: 2006-06-23 Impact factor: 5.917

10. The MPI Bioinformatics Toolkit for protein sequence analysis.

Authors: Andreas Biegert; Christian Mayer; Michael Remmert; Johannes Söding; Andrei N Lupas
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

10 in total

15 in total

1. IgA-coated E. coli enriched in Crohn's disease spondyloarthritis promote T_H17-dependent inflammation.

Authors: Monica Viladomiu; Charles Kivolowitz; Ahmed Abdulhamid; Belgin Dogan; Daniel Victorio; Jim G Castellanos; Viola Woo; Fei Teng; Nhan L Tran; Andrew Sczesnak; Christina Chai; Myunghoo Kim; Gretchen E Diehl; Nadim J Ajami; Joseph F Petrosino; Xi K Zhou; Sergio Schwartzman; Lisa A Mandl; Meira Abramowitz; Vinita Jacob; Brian Bosworth; Adam Steinlauf; Ellen J Scherl; Hsin-Jung Joyce Wu; Kenneth W Simpson; Randy S Longman
Journal: Sci Transl Med Date: 2017-02-08 Impact factor: 17.956

2. Raoultibacter phocaeensis sp. nov., A New Bacterium Isolated from a Patient with Recurrent Clostridioides difficile Infection.

Authors: Abdourahamane Yacouba; Edmond Kuete Yimagou; Cheikh Ibrahima Lo; Ornella La Fortune Tchoupou Saha; Stephane Alibar; Amael Fadlane; Anthony Fontanini; Ludivine Brechard; Didier Raoult; Jean-Christophe Lagier; Grégory Dubourg
Journal: Curr Microbiol Date: 2022-07-20 Impact factor: 2.343

3. Highly Active and Specific Tyrosine Ammonia-Lyases from Diverse Origins Enable Enhanced Production of Aromatic Compounds in Bacteria and Saccharomyces cerevisiae.

Authors: Christian Bille Jendresen; Steen Gustav Stahlhut; Mingji Li; Paula Gaspar; Solvej Siedler; Jochen Förster; Jérôme Maury; Irina Borodina; Alex Toftgaard Nielsen
Journal: Appl Environ Microbiol Date: 2015-04-24 Impact factor: 4.792

4. Transferable Immunoglobulin A-Coated Odoribacter splanchnicus in Responders to Fecal Microbiota Transplantation for Ulcerative Colitis Limits Colonic Inflammation.

Authors: Svetlana F Lima; Lasha Gogokhia; Monica Viladomiu; Lance Chou; Gregory Putzel; Wen-Bing Jin; Silvia Pires; Chun-Jun Guo; Ylaine Gerardin; Carl V Crawford; Vinita Jacob; Ellen Scherl; Su-Ellen Brown; John Hambor; Randy S Longman
Journal: Gastroenterology Date: 2021-10-02 Impact factor: 22.682

5. MGcV: the microbial genomic context viewer for comparative genome analysis.

Authors: Lex Overmars; Robert Kerkhoven; Roland J Siezen; Christof Francke
Journal: BMC Genomics Date: 2013-04-01 Impact factor: 3.969

Review 6. Type V secretion: mechanism(s) of autotransport through the bacterial outer membrane.

Authors: Jack C Leo; Iwan Grin; Dirk Linke
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2012-04-19 Impact factor: 6.237

7. The MPI bioinformatics Toolkit as an integrative platform for advanced protein sequence and structure analysis.

Authors: Vikram Alva; Seung-Zin Nam; Johannes Söding; Andrei N Lupas
Journal: Nucleic Acids Res Date: 2016-04-29 Impact factor: 16.971

8. Complete genome sequence of "Thiodictyon syntrophicum" sp. nov. strain Cad16^T, a photolithoautotrophic purple sulfur bacterium isolated from the alpine meromictic Lake Cadagno.

Authors: Samuel M Luedin; Joël F Pothier; Francesco Danza; Nicola Storelli; Niels-Ulrik Frigaard; Matthias Wittwer; Mauro Tonolla
Journal: Stand Genomic Sci Date: 2018-05-09

9. SyntTax: a web server linking synteny to prokaryotic taxonomy.

Authors: Jacques Oberto
Journal: BMC Bioinformatics Date: 2013-01-16 Impact factor: 3.169

10. A trimeric lipoprotein assists in trimeric autotransporter biogenesis in enterobacteria.

Authors: Iwan Grin; Marcus D Hartmann; Guido Sauer; Birte Hernandez Alvarez; Monika Schütz; Samuel Wagner; Johannes Madlung; Boris Macek; Alfonso Felipe-Lopez; Michael Hensel; Andrei Lupas; Dirk Linke
Journal: J Biol Chem Date: 2013-12-25 Impact factor: 5.157