Literature DB >> 32956448

FlaGs and webFlaGs: discovering novel biology through the analysis of gene neighbourhood conservation.

Chayan Kumar Saha¹, Rodrigo Sanches Pires², Harald Brolin³, Maxence Delannoy⁴, Gemma Catherine Atkinson¹.

Abstract

SUMMARY: Analysis of conservation of gene neighbourhoods over different evolutionary levels is important for understanding operon and gene cluster evolution, and predicting functional associations. Our tool FlaGs (standing for Flanking Genes) takes a list of NCBI protein accessions as input, clusters neighbourhood-encoded proteins into homologous groups using sensitive sequence searching, and outputs a graphical visualization of the gene neighbourhood and its conservation, along with a phylogenetic tree annotated with flanking gene conservation. FlaGs has demonstrated utility for molecular evolutionary analysis, having uncovered a new toxin-antitoxin system in prokaryotes and bacteriophages. The web tool version of FlaGs (webFlaGs) can optionally include a BLASTP search against a reduced RefSeq database to generate an input accession list and analyse neighbourhood conservation within the same run.
AVAILABILITY AND IMPLEMENTATION: FlaGs can be downloaded from https://github.com/GCA-VH-lab/FlaGs or run online at http://www.webflags.se/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Year: 2021 PMID： 32956448 PMCID： PMC8189683 DOI： 10.1093/bioinformatics/btaa788

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Conservation of gene order at long evolutionary distances is a strong indicator of a functional relationship among genes (Overbeek ). Extreme examples are the tryptophan biosynthesis (Dandekar, 1998), and str ribosomal protein operons (Lechner ), which are conserved from bacteria to archaea. The vast amount of genomic sequence data that has become available in recent decades is a treasure trove of clues about the function of uncharacterized proteins, and the pathways in which they are involved (Gabaldon and Huynen, 2004). High-throughput identification of gene order conservation in genomes is a promising approach for predicting the involvement of proteins in particular pathways or systems. In addition to yielding functional predictions, the identification of conserved genomic architectures is essential for understanding the evolutionary dynamics behind the formation and restructuring of gene clusters, including reassembly of operons after disruption during evolution (Omelchenko ). While there are a range of tools that analyse gene neighbourhood conservation or integrate this data along with other metrics for functional association prediction, these tend to be either restrictive in the genomes that can be considered (e.g. only complete genomes or those of model organisms) or require the creation of local genome databases (Garcia ; Lemoine ; Martinez-Guerrero ; Overmars ; Szklarczyk ). Other tools that connect to the National Center for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov/) to detect operons may lack sensitive sequence searching for homology assignments of neighbourhood genes (Gumerov and Zhulin, 2020). We felt there was a need for a tool that allows the use of the huge quantity of publicly accessible data in the NCBI RefSeq database (O'Leary ) and is sensitive enough to answer questions about homologous proteins over any evolutionary distance, from the strain or isolate level, to inter-kingdom or even inter-domain comparisons. We set out to build a Python tool that fulfils our list of essential criteria: allows the user to have complete control over the input genomes being analysed; has a simple input format that does not require coding, downloading of genomes or formatting of databases; nevertheless, also has the option of running using locally stored genomes for offline analyses or analysing genomes that are not public; can be run via a server with results emailed to the user; can detect remote homology, suitable for analysing the most distant relationships among proteins and taxa as well as closer comparative analyses; outputs gene neighbourhood annotated onto a phylogenetic tree; produces publication-quality editable vector graphics.

2 The FlaGs workflow

Our resulting tool that fulfils the above requirements is called FlaGs (standing for Flanking Genes) (Fig. 1A). FlaGs takes in user-determined NCBI accessions that link to the RefSeq database (around 170 million proteins from almost 100 000 organisms as of March 2020). Input files can be easily and quickly prepared from selected sequences in the output of an NCBI BLASTP search against the RefSeq database without any scripting (see the manual; Supplementary Materials File S1). An optional addition to the input file is the NCBI genome assembly identifier to target a particular genome. FlaGs clusters flanking gene-encoded proteins using the sensitive Hidden Markov Model-based method Jackhmmer, part of the HMMER distribution (Eddy, 2011). There are three ways to run FlaGs:

Fig. 1.

The FlaGs workflow and example results. (A) The user inputs a list of protein accession numbers—optionally with GCF assembly IDs—and can specify the number of adjacent flanking genes to consider, and the sensitivity of the Jackhmmer search through changing the E value cut-off and number of iterations. The web version of FlaGs (webFlaGs) can optionally use a single protein sequence or NCBI accession and begin by executing a BLASTP search against the RefSeq database (excluding eukaryotes) or a representative genome database to generate the input list of accessions. The output always includes a to-scale figure of flanking genes, a description of the flanking gene identities as a legend, and optionally, a phylogenetic tree annotated with colour- and number-coded pennant flags. (B) Example results using toxins of the toxSAS toxin–antitoxin system (Jimmy ) as the query. Empty genes with grey borders are not conserved in the dataset, and grey genes with blue borders are pseudogenes. In this example, FlaGs reveals four different homologous groups of antitoxins as flanking genes, two of which (green and yellow) are antitoxins for the same cognate toxin. Group number 5 is an integrase. As FlaGs does not require complete genomes, regions can lack flanking genes on one side if the query gene is close to the end of a contig, as is the case with Arthrobacter castelli in this example through the web server at www.webflags.se. This method can optionally include a BLASTP search against microbial RefSeq genomes or a representative genome database to identify homologues which with to run FlaGs (Camacho ); locally, with FlaGs querying NCBI as it runs, and not requiring locally stored genomes; locally, using locally stored genomes in RefSeq GFF and protein FASTA format. FlaGs outputs information on the conservation of flanking gene-encoded proteins, and their identity, in graphical and text formats (Fig. 1A). The output always includes a to-scale diagram of flanking genes, number- and colour-coded by conservation groups (Fig. 1B). A ‘description’ file is also included, which acts as a legend for interpreting the flanking gene diagram. An optional output is a phylogenetic tree annotated with flanking genes reduced to triangular pennant-like flags. The tree-building feature uses the ETE 3 Python environment (Huerta-Cepas ). FlaGs is a flexible tool for sensitive detection of flanking gene conservation at any evolutionary distance, and displays results in an intuitive, publication-quality vector graphics format. The utility of FlaGs is exemplified by our recent discovery of a novel toxin–antitoxin system exploiting growth control via ppApp alarmone nucleotide signalling (Jimmy ). The web server STRING is one of the most widely used tools to study the gene neighbourhood conservation of a gene of interest (Szklarczyk ). STRING’s great strength is that it brings together pre-computed association data from a number of different sources to predict functional associations. It is an excellent first port of call for predicting the function of conserved genes. STRING, however, uses a limited set of around 5000 input organisms, and does not include bacteriophages. Therefore, it is somewhat limited when addressing neighbourhood conservation of genes with extremely patchy distributions as is often the case with genes belonging to the accessory component of pangenomes. The discovery of toxSASs was only possible through the access of FlaGs to the extensive cellular and viral genome resources in the RefSeq database. We expect that FlaGs will continue to be successful in the prediction and evolutionary analysis of genomic loci with various functions, not just toxin–antitoxins, but for example, secretion systems (where it has already been used in the description of a novel system (Palmer )), antibiotic biogenesis clusters, viral defence mechanisms, gene transfer agents, pathogenicity islands and transposons. A future direction of FlaGs is to go beyond RefSeq, taking advantage of all the genomic data stored in Genbank, which will further increase the genomes accessible to neighbourhood analysis by FlaGs by hundreds of thousands. Click here for additional data file.

17 in total

1. The use of gene clusters to infer functional coupling.

Authors: R Overbeek; M Fonstein; M D'Souza; G D Pusch; N Maltsev
Journal: Proc Natl Acad Sci U S A Date: 1999-03-16 Impact factor: 11.205

Review 2. Prediction of protein function and pathways in the genome era.

Authors: T Gabaldón; M A Huynen
Journal: Cell Mol Life Sci Date: 2004-04 Impact factor: 9.261

Review 3. A holin/peptidoglycan hydrolase-dependent protein secretion system.

Authors: Tracy Palmer; Alexander J Finney; Chayan Kumar Saha; Gemma C Atkinson; Frank Sargent
Journal: Mol Microbiol Date: 2020-10-12 Impact factor: 3.501

4. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

5. MGcV: the microbial genomic context viewer for comparative genome analysis.

Authors: Lex Overmars; Robert Kerkhoven; Roland J Siezen; Christof Francke
Journal: BMC Genomics Date: 2013-04-01 Impact factor: 3.969

6. Accelerated Profile HMM Searches.

Authors: Sean R Eddy
Journal: PLoS Comput Biol Date: 2011-10-20 Impact factor: 4.475

7. GeConT 2: gene context analysis for orthologous proteins, conserved domains and metabolic pathways.

Authors: C E Martinez-Guerrero; R Ciria; C Abreu-Goodger; G Moreno-Hagelsieb; E Merino
Journal: Nucleic Acids Res Date: 2008-05-29 Impact factor: 16.971

8. STRING v10: protein-protein interaction networks, integrated over the tree of life.

Authors: Damian Szklarczyk; Andrea Franceschini; Stefan Wyder; Kristoffer Forslund; Davide Heller; Jaime Huerta-Cepas; Milan Simonovic; Alexander Roth; Alberto Santos; Kalliopi P Tsafou; Michael Kuhn; Peer Bork; Lars J Jensen; Christian von Mering
Journal: Nucleic Acids Res Date: 2014-10-28 Impact factor: 16.971

9. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data.

Authors: Jaime Huerta-Cepas; François Serra; Peer Bork
Journal: Mol Biol Evol Date: 2016-02-26 Impact factor: 16.240

10. TREND: a platform for exploring protein function in prokaryotes based on phylogenetic, domain architecture and gene neighborhood analyses.

Authors: Vadim M Gumerov; Igor B Zhulin
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

13 in total

1. Hyphal compartmentalization and sporulation in Streptomyces require the conserved cell division protein SepX.

Authors: Matthew J Bush; Kelley A Gallagher; Govind Chandra; Kim C Findlay; Susan Schlimpert
Journal: Nat Commun Date: 2022-01-10 Impact factor: 14.919

2. Gene rppA co-regulated by LRR, SigA, and CcpA mediates antibiotic resistance in Bacillus thuringiensis.

Authors: Xia Cai; Xuelian Li; Jiaxin Qin; Yizhuo Zhang; Bing Yan; Jun Cai
Journal: Appl Microbiol Biotechnol Date: 2022-07-30 Impact factor: 5.560

3. Context-based sensing of orthosomycin antibiotics by the translating ribosome.

Authors: Kyle Mangano; James Marks; Dorota Klepacki; Chayan Kumar Saha; Gemma C Atkinson; Nora Vázquez-Laslop; Alexander S Mankin
Journal: Nat Chem Biol Date: 2022-09-22 Impact factor: 16.174

4. Loss of YhcB results in dysregulation of coordinated peptidoglycan, LPS and phospholipid synthesis during Escherichia coli cell growth.

Authors: Emily C A Goodall; Georgia L Isom; Jessica L Rooke; Karthik Pullela; Christopher Icke; Zihao Yang; Gabriela Boelter; Alun Jones; Isabel Warner; Rochelle Da Costa; Bing Zhang; James Rae; Wee Boon Tan; Matthias Winkle; Antoine Delhaye; Eva Heinz; Jean-Francois Collet; Adam F Cunningham; Mark A Blaskovich; Robert G Parton; Jeff A Cole; Manuel Banzhaf; Shu-Sin Chng; Waldemar Vollmer; Jack A Bryant; Ian R Henderson
Journal: PLoS Genet Date: 2021-12-23 Impact factor: 5.917

5. Immunity proteins of dual nuclease T6SS effectors function as transcriptional repressors.

Authors: Sunil Kumar Yadav; Ankita Magotra; Srayan Ghosh; Aiswarya Krishnan; Amrita Pradhan; Rahul Kumar; Joyati Das; Mamta Sharma; Gopaljee Jha
Journal: EMBO Rep Date: 2021-03-30 Impact factor: 9.071

6. Extreme genetic diversity in the type VII secretion system of Listeria monocytogenes suggests a role in bacterial antagonism.

Authors: Kieran Bowran; Tracy Palmer
Journal: Microbiology (Reading) Date: 2021-02-18 Impact factor: 2.956

Review 7. Bacterial type II toxin-antitoxin systems acting through post-translational modifications.

Authors: Si-Ping Zhang; Han-Zhong Feng; Qian Wang; Megan L Kempher; Shuo-Wei Quan; Xuanyu Tao; Shaomin Niu; Yong Wang; Hu-Yuan Feng; Yong-Xing He
Journal: Comput Struct Biotechnol J Date: 2020-12-11 Impact factor: 7.271

8. Bistable Expression of a Toxin-Antitoxin System Located in a Cryptic Prophage of Escherichia coli O157:H7.

Authors: Dukas Jurėnas; Nathan Fraikin; Frédéric Goormaghtigh; Pieter De Bruyn; Alexandra Vandervelde; Safia Zedek; Thomas Jové; Daniel Charlier; Remy Loris; Laurence Van Melderen
Journal: mBio Date: 2021-11-30 Impact factor: 7.867

9. Sal-type ABC-F proteins: intrinsic and common mediators of pleuromutilin resistance by target protection in staphylococci.

Authors: Merianne Mohamad; David Nicholson; Chayan Kumar Saha; Vasili Hauryliuk; Thomas A Edwards; Gemma C Atkinson; Neil A Ranson; Alex J O'Neill
Journal: Nucleic Acids Res Date: 2022-02-28 Impact factor: 16.971

10. A hyperpromiscuous antitoxin protein domain for the neutralization of diverse toxin domains.

Authors: Tatsuaki Kurata; Chayan Kumar Saha; Jessica A Buttress; Toomas Mets; Tetiana Brodiazhenko; Kathryn J Turnbull; Ololade F Awoyomi; Sofia Raquel Alves Oliveira; Steffi Jimmy; Karin Ernits; Maxence Delannoy; Karina Persson; Tanel Tenson; Henrik Strahl; Vasili Hauryliuk; Gemma C Atkinson
Journal: Proc Natl Acad Sci U S A Date: 2022-02-08 Impact factor: 11.205