Literature DB >> 35910305

OBI: A computational tool for the analysis and systematization of the positive selection in proteins.

Julián H Calvento¹, Franco Leonardo Bulgarelli², Ana Julia Velez Rueda¹.

Abstract

There are multiple tools for positive selection analysis, including vaccine design and detection of variants of circulating drug-resistant pathogens in population selection. However, applying these tools to analyze a large number of protein families or as part of a comprehensive phylogenomics pipeline could be challenging. Since many standard bioinformatics tools are only available as executables, integrating them into complex Bioinformatics pipelines may not be possible. We have developed OBI, an open-source tool aimed to facilitate positive selection analysis on a large scale. It can be used as a stand-alone command-line app that can be easily installed and used as a Conda package. Some advantages of using OBI are:•It speeds up the analysis by automating the entire process•It allows multiple starting points and customization for the analysis•It allows the retrieval and linkage of structural and evolutive data for a protein throughWe hope to provide with OBI a solution for reliably speeding up large-scale protein evolutionary and structural analysis.

Entities: Chemical

Keywords: Proteins evolution; Python library; Structural bioinformatics pipeline

Year: 2022 PMID： 35910305 PMCID： PMC9334345 DOI： 10.1016/j.mex.2022.101786

Source DB: PubMed Journal: MethodsX ISSN： 2215-0161

Specifications Table

Method details

Introduction and general background

Despite their robustness, proteins exhibit remarkable evolutionary adaptability, and new functionalities have emerged throughout the history of the planet [1,2]. We now know that new enzyme functions can evolve in a matter of a few decades, as has happened with enzymes that break down synthetic chemicals that first appeared on this planet during the 20th century [3,4], and the alarming evolution of drug resistance. There is evidence that evolution operates by selecting functional dynamic movements or restricting structural movements that are detrimental to protein function [5,6]. Selection processes then allow for a better adaptation of organisms to their environment. Therefore, identifying sites of a protein subject to positive selection can enrich studies of evolutionary biology and functional characterization. Positive selection analysis is a bioinformatic prediction technique with multiple applications, including, for example, vaccine design or the detection of new drug-resistant pathogenic variants [7,8]. However, efficient detection of positive selection could be problematic since selection often operates on only a few sites in a short evolutionary time frame [9], [10], [11]. Consequently, choosing the appropriate method for its detection and making the correct interpretations for its results is critical. Here we present OBI, a tool that integrates several bioinformatics tools, optimized for making evolutionary inferences and positive selection analysis. In addition, OBI maps such sequential information to the protein structure. By just receiving a protein's FASTA sequence, our tool retrieves the homologous proteins [12], and gene sequences using Entrez [13] and performs the positive selection analysis using Hyphy [14]. Furthermore, OBI links the evolutionary information with the structural data available for the protein of interest, allowing the user to easily detect positive selection cases related to structural changes and their possible association with the activity and function of proteins. An extra complication could be applying these analyses on a large scale, for a big number of protein families, or as part of a bigger pipeline. This kind of analysis requires automation and optimization in computing speed and interoperability between technological tools, which makes it hard to achieve. OBI is an open-source tool that facilitates the analysis of positive selection on a large scale. We have implemented a stand-alone command-line app, developed entirely in Python, that can be easily installed and used as a Conda1 package.

Package structure and user interface

OBI presents a pipeline architecture [15], in which a protein sequence is processed hierarchically until reaching a positive selection analysis report. In each stage, in-house developed utilities are combined with frequently used Python bioinformatics tools such as Biopython, Blast [16], or Uniprot [17]. The whole pipeline can be run through a command-line interface, which allows the specification of analysis parameters such as the min-coverage for getting the targets or the e-value used for filtering the hits obtained (Fig. 1A). All the parameters information and their usage can be accessed by running the obi –help option (Fig. 1B).

Fig. 1

A. The data preparation pipeline flow includes I) the homologous proteins search using BLAST, the sequences clustering using CD-Hit, and the sequences alignment using Clustal; II) finally produce a nucleotide alignment guided by the amino acid alignment; B. Obi command general information and usage can be accessed by using –help flag; C. Obi provides different running configurations, that allow the users to customize the pipeline running according to their preferences.

Data preparation

OBI exposes several configurable entry points, and also its usage in different contexts, and users can make a manual curation of data if needed. When running the complete pipeline, the query sequence introduced by the user is fully processed in three steps: data preparation, positive selection analysis and an output with the information necessary for the positive selection analysis is obtained. In the initial stage, the software retrieves the homologous proteins for the query sequence provided by the user in a FASTA file using the Python implementation of BLAST2 [18]. These results can be filtered by the user preferences, to obtain the most appropriate construction of the sequence alignment necessary to obtain reliable results in evolutionary inferences (see Fig. 1C) [19,20]. After retrieving the homologous sequences, they are clustered to reduce redundancy and improve the performance of the following steps [21] (see Fig. 1A). For this step, the CDHIT algorithm [22] is used. The outputs generated in this step include a FASTA file with the query protein and the non-redundant homologous sequences, which are subsequently aligned using CLUSTAL-Omega (or ClustalO) [23] for feeding the evolutionary reconstruction software. In this step, OBI also retrieves the coding gene sequences for all the homologous proteins through ENTREZ [13]. This database provides the linkage between the gene-oriented and genome information and the protein information. From the information provided by Entrez and guided by the proteins alignment previously obtained, OBI builds an equivalent nucleotide alignment of the coding regions to feed the evolutionary reconstruction software in the following step. When alignments curation is required, users can omit the –include-analysis parameter, so the positive selection analysis won't be executed. After the first step's output manual curation, users will be able to resume the pipeline, using this stage outputs after the manual revision, by running the analysis command (Fig. 2B).

Fig. 2

A. Alternative Local or remote positive selection analysis flow: OBI provides users with two different strategies for running the positive selection analysis; B. Positive selection analysis can be run separately from the alignments construction and homolog proteins retrieving, by using the first step output files; C. When running Obi in the remote mode, the analysis can be resumed with the resume command.

Positive selection analysis

The OBI pipeline provides two alternative execution workflows for the positive selection analysis (Fig. 2A). When executing the pipeline locally, extra installations are required, which OBI solves for the users during its initial setup. It implements the phylogenetic inference by using the IQTree [24,25] software, which finds the best maximum likelihood tree [26] guided by a heuristics search. This phylogenetic tree serves as input for positive selection analysis with HyPhy [14]. In particular, the OBI uses the MEME method [27], which is a computational technique aimed to identify instances of episodic and pervasive positive selection at the level of an individual site. It has been shown to have superior performance over other models under a broad range of scenarios [28], [29], [30]. OBI also offers users the ability to run HyPhy remotely by using the Datamonkey API REST [31]. As an initial result of remote execution, the response from the server is persisted to a datamonkey_response.json file within the output directory, so that the user can resume the work by using the resume command (Fig. 2C). By running this command, OBI will find the answer previously saved and consult the status of the analysis at the HyPhy server and, in case it has finished, get the results to continue with the rest of the pipeline execution.

Report and deliverables

Our tool allows the user to obtain all its intermediate results individually. Both the proteins and nucleic acids alignments are provided as deliverables, to make the analysis reproducible. The evolutionary analysis generates multiple deliverables such as the blast search result and the phylogenetic trees. With the positive selection analysis results, OBI generates a report summarizing the results obtained. This report contains for each codon its gene sequence id, the codon's sequence, the position in the codon's alignment, positive selection analysis p-value, protein's corresponding amino acid for this codon, proteins' alignment position, and protein's related PDB information. The structural information of each analyzed protein is automatically mapped to the protein sequence using SIFTS database [32]. This report is written into the chosen results directory with the name positive_selection.json (Fig. 3B). A complete input and output example could be found in the Obi project's repository, as well as the commands to be run for executing the pipeline for the human Hemoglobin protein.

Fig. 3

Report and deliverables example: A. OBI's command-line interface running analysis command; B. JSON report file content example; C. Phylogenetic tree generated by OBI example; D. Protein sequences alignment example.

Software distribution

The OBI software is distributed via Conda, an open-source package manager and environment management system commonly used for bioinformatics and research projects. OBI is a multiplatform tool, meaning that can be installed and used in Windows, Linux, and macOS. OBI can be used as a stand-alone tool for automated bioinformatic analysis, which may be useful for users without coding skills. Alternatively, it can be also used as a Python library, to allow easy integration into other bioinformatics pipelines. Also, this is a key aspect of OBI, since many standard bioinformatics tools - such as HyPhy and BLAST - are only available as executables, thus reducing interoperability. The OBI project is open to contributions and thus can be downloaded and installed from the code source on GitHub (https://github.com/jcalvento/obi).

Conclusion

Here we presented OBI, an open-source tool built-in Python, that aims to ease the protein's positive selection analysis. It provides a starting point for several specific pipelines and future works. It is an open-source code tool that can be easily merged in Bioinformatics pipelines as a Conda package or even to an initial source to be adapted. Our software allows not only the full analysis for a query protein but also a user-customized analysis with different entry points. The OBI software automatically retrieves all the homologous sequences for the analysis and maps the positions under positive selection to all the PDB structures available for the query protein. The high-level approach for retrieving the structural and evolutive data for a protein through OBI facilitates its application to large-scale analysis. Our tools present significant contributions to bioinformatics since it solves a problem of great interest to the field, by applying software architecture techniques that maximize robustness and flexibility. We hope to provide with OBI a tool that reliably speeds up the evolutionary and structural analysis of proteins on a large scale.

Declaration of Competing Interest

The authors have no conflicts of interest to declare.

Subject Area	Bioinformatics
More specific subject area:	3: Biochemistry, Genetics and Molecular Biology7: Chemistry8: Computer Science
Method name:	OBI
Name and reference of original method	HyPhy
Resource availability	https://anaconda.org/jcalvento/obi

29 in total

1. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Authors: Emmanuel Boutet; Damien Lieberherr; Michael Tognolli; Michel Schneider; Parit Bansal; Alan J Bridge; Sylvain Poux; Lydie Bougueleret; Ioannis Xenarios
Journal: Methods Mol Biol Date: 2016

2. Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates.

Authors: Shozo Yokoyama; Takashi Tada; Huan Zhang; Lyle Britt
Journal: Proc Natl Acad Sci U S A Date: 2008-09-03 Impact factor: 11.205

3. Evolution of conformational dynamics determines the conversion of a promiscuous generalist into a specialist enzyme.

Authors: Taisong Zou; Valeria A Risso; Jose A Gavira; Jose M Sanchez-Ruiz; S Banu Ozkan
Journal: Mol Biol Evol Date: 2014-10-13 Impact factor: 16.240

4. Datamonkey 2.0: A Modern Web Application for Characterizing Selective and Other Evolutionary Processes.

Authors: Steven Weaver; Stephen D Shank; Stephanie J Spielman; Michael Li; Spencer V Muse; Sergei L Kosakovsky Pond
Journal: Mol Biol Evol Date: 2018-03-01 Impact factor: 16.240

5. Protein promiscuity: drug resistance and native functions--HIV-1 case.

Authors: Ariel Fernández; Dan S Tawfik; Ben Berkhout; Rogier Sanders; Andrzej Kloczkowski; Taner Sen; Bob Jernigan
Journal: J Biomol Struct Dyn Date: 2005-06

6. Adaptation and convergence in circadian-related genes in Iberian freshwater fish.

Authors: Maria M Coelho; Vitor C Sousa; João M Moreno; Tiago F Jesus
Journal: BMC Ecol Evol Date: 2021-03-08

7. Identification of positive selection in genes is greatly improved by using experimentally informed site-specific models.

Authors: Jesse D Bloom
Journal: Biol Direct Date: 2017-01-17 Impact factor: 4.540

8. DGINN, an automated and highly-flexible pipeline for the detection of genetic innovations on protein-coding genes.

Authors: Lea Picard; Quentin Ganivet; Omran Allatif; Andrea Cimarelli; Laurent Guéguen; Lucie Etienne
Journal: Nucleic Acids Res Date: 2020-10-09 Impact factor: 16.971

9. CD-HIT Suite: a web server for clustering and comparing biological sequences.

Authors: Ying Huang; Beifang Niu; Ying Gao; Limin Fu; Weizhong Li
Journal: Bioinformatics Date: 2010-01-06 Impact factor: 6.937

10. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins.

Authors: Jose M Dana; Aleksandras Gutmanas; Nidhi Tyagi; Guoying Qi; Claire O'Donovan; Maria Martin; Sameer Velankar
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971