Literature DB >> 32282909

TREND: a platform for exploring protein function in prokaryotes based on phylogenetic, domain architecture and gene neighborhood analyses.

Abstract

Key steps in a computational study of protein function involve analysis of (i) relationships between homologous proteins, (ii) protein domain architecture and (iii) gene neighborhoods the corresponding proteins are encoded in. Each of these steps requires a separate computational task and sets of tools. Currently in order to relate protein features and gene neighborhoods information to phylogeny, researchers need to prepare all the necessary data and combine them by hand, which is time-consuming and error-prone. Here, we present a new platform, TREND (tree-based exploration of neighborhoods and domains), which can perform all the necessary steps in automated fashion and put the derived information into phylogenomic context, thus making evolutionary based protein function analysis more efficient. A rich set of adjustable components allows a user to run the computational steps specific to his task. TREND is freely available at http://trend.zhulinlab.org.

Entities: Chemical

Mesh：

Substances：

Year: 2020 PMID： 32282909 PMCID： PMC7319448 DOI： 10.1093/nar/gkaa243

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In prokaryotes, genes that encode functionally linked (e.g. involved in the same biological pathway) proteins are often organized in gene clusters (1). Efficient regulation of transcription employs the same principle organizing genes in operons. Gene order in clusters/operons is often conserved especially in more closely related species (1). Operons that encode subunits of highly conserved multi-protein complexes, such as ribosomes, are conserved even in distantly related bacterial genomes and even between archaea and bacteria (2). Gene order conservation can be efficiently utilized in functional prediction of uncharacterized and poorly characterized proteins (3,4). Arrangement of constituent domains (domain architecture) to a large extent determines the function of a protein (5). Globular protein domains are structurally compact, independently folding units (6). Individual domains can expand their functional repertoire in a number of ways, including residue mutations, changes in the core fold of a domain that change ligand binding properties, altered interaction partners and loop extensions (7). Altered combination of domains is a source of functional changes in proteins and identification of complete domain architecture is an efficient instrument for their functional characterization (8). A protein sequence-based phylogenetic tree groups the proteins in evolutionary related clusters. Placing information about the domain architectures and gene neighborhoods onto the protein phylogenetic tree provides the means to understand evolutionary changes in proteins of known function and to facilitate functional prediction of uncharacterized genes. There is a limited number of resources available for protein analysis in evolutionary context. Online version of MAFFT (9) gives an option to align sequences using iterative refinement method and to produce a Neighbor-Joining or UPGMA tree based on provided sequences. ClustalW2—Phylogeny and Simple Phylogeny (10) have an option to build a tree using a provided multiple sequence alignment (MSA). NGPhylogeny.fr (11) offers an option to construct MSA and build and visualize a phylogenetic tree; it allows a user to adjust parameters of each step and select a tree building program. IQ-TREE (12) is another online tool that builds MSA and a maximum likelihood tree. iTOL (13) enables combining a tree with the features identified in a set of protein sequences. The user needs to identify protein features and build a phylogenetic tree beforehand using external tools, and the data needs to be prepared according to the iTOL requirements. The STRING database (14) can produce gene neighborhoods for 4613 prokaryotic organisms (as of 2 November 2020). Protein sequences or identifiers from the same organism should be provided as input. STRING has no option for building a protein tree for evolutionary inference based on provided protein sequences. The online ETE Toolkit TreeViewer (15) combines tree and MSA in one picture. MSA can be presented as aligned blocks or in condensed format. Both, a tree and an alignment should be provided as input. TreeDom (16) allows a user to build a tree and map Pfam domains on it fetching the alignment and domain architectures from Pfam database. TreeDom is limited to building a Neighbor-Joining tree, as input PFAM-family identifiers should be used. TreeDom does not identify other protein features and does not provide an option for protein alignment. AnnoTree (17) enables visualization of precomputed genome annotations across the bacterial and archaeal phylogenies. Here, we describe an intuitive platform, TREND, which has a capability to automatically identify protein features, calculate gene neighborhoods of each gene corresponding to provided prokaryotic proteins, cluster neighboring genes, identify operons, and put all these data in a phylogenetic context. To complete the full analysis, only a set of proteins sequences or protein identifiers is needed as input. The results are presented as an interactive picture on the web-site and can be downloaded for subsequent analysis. TREND is the only resource that can calculate gene neighborhoods for more than 125 000 prokaryotic genomes, identify protein features and correlate this with phylogeny in automated fashion, while providing ample opportunities for adjusting each step of the analysis.

RESULTS

Main features

TREND has an extensible modular architecture and consists of four main components: (i) a component dedicated to building a phylogenetic tree by running multiple sequence alignment and a tree building software; (ii) a component for identification of protein features and combining the results with the protein tree; (iii) a component for gene neighborhoods identification and mapping this information on the protein tree and (iv) a component for clustering genes and identification of operons. These components are grouped in two pipelines: Domains and Neighborhoods (Figure 1).

Figure 1.

Domains and Neighborhoods pipelines in TREND. Steps that can be skipped are marked by dashed lines. If identifiers provided instead of protein sequences the sequences will be fetched from NCBI and/or MiST databases. Optionally the redundancy of the provided sequences can be reduced based on the identity level specified by a user.

Domains pipeline

A typical analysis starts with collecting similar protein sequences, for example, by running BLASTP against a protein sequence database using the protein of interest as a query. The collected set of protein sequences can be used as input for TREND (Figure 1). Another option is to input NCBI or MiST (18) compatible protein identifiers, such as RefSeq Id, locus tag or MiST stable Id, and corresponding sequences will be fetched from these resources. Upon the start, the pipeline will process input, identify protein features, build a tree, combine the results in one interactive picture, and present it to a user. A user can choose a database of Pfam profile HMMs (19) or CDD position-specific scoring matrices (20) for protein domains identification. We also integrated a collection of profile HMMs specific for signal transduction pathways in prokaryotes (18). Protein domains are identified by running HMMER (21) or RPS-BLAST (20) against the corresponding databases. In addition to protein domains, transmembrane and low-complexity regions are identified by running TMHMM (22) and SEG (23), correspondingly. Upon completion of this step, the sequences will be aligned using the algorithms implemented in MAFFT (9) and the tree will be built based on the MSA using FastTree (24) or MEGA (25). The user can customize each step adjusting corresponding parameters, choosing the alignment algorithm and the phylogenetic tree building method. The sequences in MSA will be sorted according to the order of the tree leaves, thus enabling the identification of amino acid residues that are conserved in specific, phylogenetically related groups of sequences. In addition, if a corresponding option is selected, the sequences in MSA and corresponding leaves will be enumerated. The pipeline also has a built in sequence redundancy reduction step implemented using CD-HIT (26), which will be executed, if the corresponding option is selected. The domain analysis pipeline can be customized in a way that the tree will be built using specified parts of sequences, while protein features will be identified using full-length sequences. This is helpful when the evolutionary history of a specific domain or other fragment of sequences is explored. For this option, a user need to click ‘Add second area’ button and another input area will appear where fragments of sequences should be inserted. If a user already has an alignment in the FASTA format, which was edited and prepared for the analysis, he can skip the alignment step and start the pipeline from tree building. If a user has a tree in the Newick format, it can be used together with the original sequences and (optionally) with an alignment, which was used to build the tree, to run the pipeline in a Partial mode. In this case, only the steps not related to tree building will be run. Instead of providing sequences used to build the tree, corresponding protein Ids can be placed at the beginning of the tree leaf names separated from the rest of the names by an underscore, a space, a forward slash or a vertical bar. In this case, the sequences will be fetched from NCBI or MiST. The Partial mode is particularly useful in domain evolution analysis and ancestral domain architecture reconstruction (27), in which case a prebuilt organismal phylogenetic tree can be combined with protein sequences containing a domain(s) of interest. TREND can also be used just as an online tree building tool if the protein identification step is skipped. As a result, an interactive tree will be produced (Figure 2). Clicking on features will open an information block with the details of the identified features. Domain analysis details contain links to entries in corresponding databases (Pfam and CDD) for each identified domain. Clicking on a feature will highlight the part of a sequence corresponding to the feature. The tree can be saved in SVG and Newick formats and used for further analysis. The identified features can be saved in JSON format for subsequent computational analysis, and the MSA—in FASTA.

Figure 2.

Phylogenetic tree of Pseudomonas aeruginosa PilJ chemoreceptor homologs combined with domain architecture (identified protein domains and transmembrane regions) generated by TREND. Domain nomenclature is according to the Pfam database.

Neighborhoods pipeline

To start the gene neighborhoods pipeline, four types of input information can be used: (i) a set of protein sequences in FASTA format, (ii) an MSA in FASTA format, (iii) a tree in Newick format or (iv) a list of NCBI and/or MiST compatible protein identifiers separated by a new line character. The main prerequisite for the first tree types of input is the placement of protein identifiers at the beginning of corresponding protein names, separating them by space, vertical bar, forward slash or underscore from the rest of the names. Gene neighborhoods will be identified by fetching corresponding information from MiST database (18). Although any identifiers can be used to build the tree, it is required for the gene neighborhoods analysis to provide NCBI RefSeq identifiers, locus tags or MiST identifiers. If sequences or a list of Ids are provided, then the tree will be built based on the MSA constructed at the first step. If an MSA or a Newick-formatted tree was provided, then corresponding steps will be skipped. The type of the input is detected automatically and validated prior to running the pipeline. A user can specify the number of neighboring genes to be identified and processed for each provided protein. Upon identifying of gene neighborhoods, genes will be clustered based on shared domains of the encoded proteins. A user can control the permitted number of non shared domains for corresponding genes to be grouped in a cluster. In this case, not all domains in proteins should be shared for genes to be grouped. The next step is identification of operons based on the distance between the genes. The default distance between the genes is set to 200 bp (28). This parameter can be changed by a user. Similar to Domain analysis pipeline, sequences and corresponding tree leaves can be enumerated and sequence redundancy can be reduced. All this information will be combined and presented to a user as an interactive picture (Figure 3). All genes with the same set of domains, i.e. those that belong to the same cluster, are shown in the same color. Genes that belong to the same operon have the borders with the same color. Genes corresponding to the proteins that were used as queries are shown with thicker borders. Hovering the mouse over the picture shows the corresponding gene information including its product name, NCBI and MiST ids, links to corresponding databases, encoded domains and cluster ids. Gene neighborhood information together with the identified clusters and operons can be saved in JSON format and the picture can be saved in SVG format. Parsing the JSON formatted file allows a researcher to run an additional analysis, for example, to identify groups of gene clusters based on gene contents.

Figure 3.

Phylogenetic tree of a subset of CheA kinase homologs and their corresponding gene neighborhoods generated by TREND. Genes that encode proteins consisting of the same domains are shown in the same color: cheA genes are shown in blue. Genes encoding proteins from corresponding branches of the phylogenetic tree (left panel) are marked with bold borders (in this case, cheA genes).

CASE STUDY

To show how the usual computational workflow can be simplified using TREND we used it for the analysis of oxygen di-iron proteins (ODP) described in (29). In this work a new member of metalloenzymes superfamily (Pfam Lactamase_B PF00753) was identified, which was recruited to various bacterial and archaeal signal transduction (ST) pathways, including chemotaxis, to function as oxygen and iron sensors. The analysis done in that work identified several groups of ODPs: two groups of the sensors encoded in operons with methyl-accepting chemotaxis encoding genes in two different orders, ODPs in operons with different ST related genes, and the proteins fused with signal transduction domains. The computational workflow to make such an analysis would include running a set of tools to align proteins, build a tree, identify domains, identify gene neighborhoods for each gene, and then manually cross relating the generated data with extreme care in order to identify the groups and the events of domains fusion or detecting operons with ST proteins encoding genes. Applying TREND and interactively exploring the results we easily identified the branches on the tree (Supplementary Figure S1) with the described two types of operons (29) and the branches with signal transduction genes encoded in operons with ODP genes. Using Domains pipeline we identified domain architectures of the ODPs and observed the described groups of proteins fused with the signal transduction related domains (Supplementary Figure S1). Thus, TREND makes identification of genomic events more efficient allowing a researcher to automate all the necessary steps, and prevents the errors that can originate from the manual analysis.

CONCLUSION AND FUTURE DEVELOPMENTS

TREND is a highly customizable computational platform that allows researchers to explore protein function identifying protein features, gene neighborhoods and operons, clustering neighboring genes, and integrating all these data into phylogenomic context and cross-referencing with RefSeq, Pfam, CDD and MiST databases. The extensible modular architecture facilitates adding new functionality with not much effort. The platform has a potential to grow into one of the major hubs for computational protein analysis and accelerate the research in computational biology. Among the future developments, we are planning to integrate Markov clustering of homologs and correlating the results with phylogeny, additional software for protein tree building and built-in BLASTP for collecting homologous proteins. We will be updating the web server by integrating new releases of Pfam, CDD and MiST databases and the underlying software when they become available.

IMPLEMENTATION AND AVAILABILITY

The backend is implemented in Java using Spring Boot (https://spring.io/projects/spring-boot) and Hibernate ORM (https://hibernate.org/orm/) frameworks, PostgreSQL is used as a database system, Tomcat—as a web server, NGINX—as a reverse proxy. Separate scripts are written in Python and integrated in the application. Thymeleaf (https://www.thymeleaf.org/) is used as a template engine. The frontend is developed using jQuery, d3.js, Bootstrap and custom packages for data manipulation and visualization. TREND is freely available at http://trend.zhulinlab.org. GitHub repository location – https://github.com/ToshkaDev/TREND. Click here for additional data file.

29 in total

1. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

2. Exploitation of gene context.

Authors: M Huynen; B Snel; W Lathe; P Bork
Journal: Curr Opin Struct Biol Date: 2000-06 Impact factor: 6.809

3. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

4. Evolution of Protein Domain Architectures.

Authors: Sofia K Forslund; Mateusz Kaduk; Erik L L Sonnhammer
Journal: Methods Mol Biol Date: 2019

Review 5. Dynamics and adaptive benefits of modular protein evolution.

Authors: Erich Bornberg-Bauer; M Mar Albà
Journal: Curr Opin Struct Biol Date: 2013-04-03 Impact factor: 6.809

6. CDD: NCBI's conserved domain database.

Authors: Aron Marchler-Bauer; Myra K Derbyshire; Noreen R Gonzales; Shennan Lu; Farideh Chitsaz; Lewis Y Geer; Renata C Geer; Jane He; Marc Gwadz; David I Hurwitz; Christopher J Lanczycki; Fu Lu; Gabriele H Marchler; James S Song; Narmada Thanki; Zhouxi Wang; Roxanne A Yamashita; Dachuan Zhang; Chanjuan Zheng; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2014-11-20 Impact factor: 16.971

7. A di-iron protein recruited as an Fe[II] and oxygen sensor for bacterial chemotaxis functions by stabilizing an iron-peroxy species.

Authors: Alise R Muok; Yijie Deng; Vadim M Gumerov; Jenna E Chong; Jennifer R DeRosa; Kurni Kurniyati; Rachael E Coleman; Kyle M Lancaster; Chunhao Li; Igor B Zhulin; Brian R Crane
Journal: Proc Natl Acad Sci U S A Date: 2019-07-03 Impact factor: 11.205

8. Accelerated Profile HMM Searches.

Authors: Sean R Eddy
Journal: PLoS Comput Biol Date: 2011-10-20 Impact factor: 4.475

9. Grammar of protein domain architectures.

Authors: Lijia Yu; Deepak Kumar Tanwar; Emanuel Diego S Penha; Yuri I Wolf; Eugene V Koonin; Malay Kumar Basu
Journal: Proc Natl Acad Sci U S A Date: 2019-02-07 Impact factor: 11.205

10. Interactive Tree Of Life (iTOL) v4: recent updates and new developments.

Authors: Ivica Letunic; Peer Bork
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

13 in total

1. The Campylobacter jejuni chemoreceptor Tlp10 has a bimodal ligand-binding domain and specificity for multiple classes of chemoeffectors.

Authors: Bassam A Elgamoudi; Ekaterina P Andrianova; Lucy K Shewell; Christopher J Day; Rebecca M King; Hossinur Rahman; Lauren E Hartley-Tassell; Igor B Zhulin; Victoria Korolik
Journal: Sci Signal Date: 2021-01-05 Impact factor: 8.192

2. Amino acid sensor conserved from bacteria to humans.

Authors: Vadim M Gumerov; Ekaterina P Andrianova; Miguel A Matilla; Karen M Page; Elizabet Monteagudo-Cascales; Annette C Dolphin; Tino Krell; Igor B Zhulin
Journal: Proc Natl Acad Sci U S A Date: 2022-03-01 Impact factor: 11.205

3. Methylation-Independent Chemotaxis Systems Are the Norm for Gastric-Colonizing Helicobacter Species.

Authors: Xiaolin Liu; Karen M Ottemann
Journal: J Bacteriol Date: 2022-08-16 Impact factor: 3.476

4. The Natural Product Domain Seeker version 2 (NaPDoS2) webtool relates ketosynthase phylogeny to biosynthetic function.

Authors: Leesa J Klau; Sheila Podell; Kaitlin E Creamer; Alyssa M Demko; Hans W Singh; Eric E Allen; Bradley S Moore; Nadine Ziemert; Anne Catrin Letzel; Paul R Jensen
Journal: J Biol Chem Date: 2022-09-12 Impact factor: 5.486

5. Characterization of Opposing Responses to Phenol by Bacillus subtilis Chemoreceptors.

Authors: Girija A Bodhankar; Payman Tohidifar; Zachary L Foust; George W Ordal; Christopher V Rao
Journal: J Bacteriol Date: 2022-01-10 Impact factor: 3.476

6. Synthaser: a CD-Search enabled Python toolkit for analysing domain architecture of fungal secondary metabolite megasynth(et)ases.

Authors: Cameron L M Gilchrist; Yit-Heng Chooi
Journal: Fungal Biol Biotechnol Date: 2021-11-11