Literature DB >> 28459991

GASS-WEB: a web server for identifying enzyme active sites based on genetic algorithms.

João P A Moraes¹, Gisele L Pappa², Douglas E V Pires³, Sandro C Izidoro¹.

Abstract

Enzyme active sites are important and conserved functional regions of proteins whose identification can be an invaluable step toward protein function prediction. Most of the existing methods for this task are based on active site similarity and present limitations including performing only exact matches on template residues, template size restraints, despite not being capable of finding inter-domain active sites. To fill this gap, we proposed GASS-WEB, a user-friendly web server that uses GASS (Genetic Active Site Search), a method based on an evolutionary algorithm to search for similar active sites in proteins. GASS-WEB can be used under two different scenarios: (i) given a protein of interest, to match a set of specific active site templates; or (ii) given an active site template, looking for it in a database of protein structures. The method has shown to be very effective on a range of experiments and was able to correctly identify >90% of the catalogued active sites from the Catalytic Site Atlas. It also managed to achieve a Matthew correlation coefficient of 0.63 using the Critical Assessment of protein Structure Prediction (CASP 10) dataset. In our analysis, GASS was ranking fourth among 18 methods. GASS-WEB is freely available at http://gass.unifei.edu.br/.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Enzymes

Year: 2017 PMID： 28459991 PMCID： PMC5570142 DOI： 10.1093/nar/gkx337

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Active sites are regions usually on the surface of enzymes specially modelled by nature during evolution that either catalyse a reaction or are responsible for substrate binding. The active site can be, therefore, divided into two parts, which include the catalytic site and the substrate binding site (1). Active site amino acid residues are known to be more conserved during evolution than other enzyme regions, a useful information that has been used in function prediction tasks (2,3). A number of methods based on the structure of these active sites have been proposed over the years to infer protein function based on active site similarity (4–6). Given an active site template, these methods use different mathematical modelling and searching procedures to match the template to a given set of proteins (4–8). Many of the current available methods present, however, limitations such as performing only exact matches on template residues (not accounting for conservative mutations), restricting the number of amino acids in the template or pruning the search space, using ad-hoc procedures and are usually not capable of finding inter-domain active sites. In order to tackle these problems, we had proposed GASS (Genetic Active Site Search) (9), a search method based on genetic algorithms that aims to cope with the aforementioned issues. Since then, we have proposed MeGASS (10), a multi-objective version of GASS that also considers the depth of the residues when searching for active sites. This was important as, on general, active sites are closer to the protein surface to allow access of the substrate. Here we propose a user-friendly web server implementation of our method, called GASS-WEB. Our method is now capable of, in addition to catalytic sites templates, perform searches based on binding sites templates. The web server interface has been improved and complementary information such as enzyme EC number, UNIPROT accession code and structure resolution are now presented on the results page. The method can be used under two different scenarios: (i) given a protein of interest, it matches the protein to a set of specific templates (i.e. known active sites) stored in a database; or (ii) given an active site template, it searches for it in a database of protein structures. GASS-WEB is freely available through an intuitive web interface at http://gass.unifei.edu.br/.

MATERIALS AND METHODS

Datasets

The GASS-WEB database consists of active site templates and their respective proteins structures obtained from the Protein Data Bank. GASS-WEB uses 1691 catalytic site templates based on the Catalytic Site Atlas (CSA) (11). Only literature entries were used. The database is also composed by 1819 binding sites templates from CSA (literature entries only) and 23 318 enzymes from the NCBI-VAST non-redundant database (12). A complementary dataset composed of six subsets (DS) randomly chosen from CSA was also used to test the GASS-WEB. Each DS has a specific protein family (distinct EC number). All protein structures were extracted from the PDB, and their active sites validated by CSA. Each subset has its own literature-derived template and proteins annotated as CSA HOMOLOG (Supplementary Table S1). The dataset has 551 proteins and 6 templates LIT.

Genetic active site search

The heuristic search behind GASS-WEB uses the GASS method, which works in two steps: (i) a data pre-processing step, which generates the databases used by the server and (ii) a genetic algorithm (GA) (13,14), which performs the search itself (Figure 1). The pre-processing step finds the proteins of interest to the user and active sites templates in Protein Data Bank (PDB) (15), CSA and NCBI-VAST database and returns, for each amino acid, its name, chain, the last heavy atom and its coordinates, which composes the template. This information is stored in a relational database, and accessed by GASS to create its initial pool of candidate active sites. GASS then performs a heuristic search to find matching active sites in the selected proteins and outputs one or more candidate active sites. The method can also search in a database of proteins structures an active site template. In order to account for conservative mutations, GASS also has the option of consulting a substitution matrix (8).

Figure 1.

GASS method: data is extracted from PDB, CSA and NCBI and pre-processed. GASS performs a heuristic search to find matching active sites in the proteins of interest or, given an active site template it searches for it in the database of protein structures using a genetic algorithm. Matching active sites are then returned to the user. The GA implemented by GASS evolves a population of individuals, where each individual represents a solution to the problem, i.e. a candidate active site match. These solutions are evaluated and ranked according to a fitness function (Supplementary Equation S1), which for GASS-WEB is a modified root-mean-squared deviation (RMSD) between the template and searched amino acid residues. The fitness function indicates the degree of structural similarity of the candidate active site with the template. After evaluation, individuals are selected to undergo crossover and mutation operations according to tuned probabilities. This process goes on until a stop criterion, which is usually based on a maximum number of generations (iterations), is met. Different from other search methods, GASS returns a ranking of the n best solutions found. Since it is possible that the best solution (best fitness) is buried in the protein, instead of being in a pocket, it might be interesting for the user to analyse a set of solutions to filter potential false positive cases.

WEB SERVER

GASS-WEB was implemented using the Flask framework for Python (16), and the front-end design was created using the Bootstrap framework (17), a well-established tool for intuitive design. GASS-WEB back-end runs on top of an Apache server, and the communication with the server is made through a Web Server Gateway Interface, as determined by the Python Enhancement Proposal 333. The server works by filtering and redirecting the user input to a C++ implementation of the GASS search method. All requests are queued using the Redis Queue data structure and handled using RQ workers, this allows for an easy control and parallel processing of the jobs running.

Input

All searches on GASS-WEB are performed based on a PDB file, which can be provided by the user through the four letter code (as used by the RCSB PDB), or by uploading a custom file. This custom file needs to be in accordance with PDB format standards. The file is then validated and converted into a binary file for the heuristic search method. GASS-WEB offers three different types of resources: searching proteins for CSA active sites (that uses CSA catalytic sites templates), searching proteins for CSA binding sites (that uses binding sites templates generated from the CSA literature entry) and searching the NCBI-VAST database for specific active site templates. Both resources, CSA sites search and CSA binding sites search, have the same input. To perform a search using CSA sites search, for example, it is necessary to provide a protein structure by either uploading your own file, which must comply with the PDB format, or supplying a four-letter PDB code (Figure 2A1 and A2). The next step is to choose the template size for matching, which is the number of residues of the active site (Figure 2A3). Then one is ready to submit your query to GASS-WEB (Figure 2A4). GASS-WEB takes about 1 min to show the results in both resources.

Figure 2.

(A) Protein search for catalytic or binding sites requires the protein PDB file (Step 1) and the template size (Step 2). (B) NCBI-VAST Database search requires the PDB file (Step 1) and a template (Step 2), and has an optional field for email allowing the user to be emailed once the search finishes (Step 3). (C) Protein search for catalytic sites results page. The protein search using the NCBI-VAST database also requires the protein structure by either uploading your own file (PDB format) or supplying a four-letter PDB code (Figure 2B1 and B2). In contrast, it requires an active site template (Figure 2B3). The format of the template is detailed on the website. The protein search using the NCBI-VAST database takes considerably longer to finish than the other two previously presented (about 50 min), as it compares the template to all other proteins in NCBI-VAST database. For this reason, an optional field was added, allowing the user to receive the results via email once the search finishes (Figure 2B4).

Output

After the search is complete, the user is automatically redirected to the results page where results are displayed and also available as a CSV file for download (Figure 2C8), which will be kept for a week and can be accessed using the job URL. The results are displayed on a table ordered by fitness score of the matched residues (modified RMSD) of each solution (Figure 2C1), followed by a list of residues found by GASS-WEB (Figure 2C2), the template (PDB ID) (Figure 2C3), residues list of the matched template (Figure 2C4), EC number, Uniprot accession code and resolution of the matched template (Figure 2C5, C6 and C7). The columns C2 and C3 in Figure 2 have a small icon to visualize the predicted active site and the template using GLMol plugin. In the case of searching for binding sites, a column displaying the ligand in the matched template is also added to the results. The output is also available for download as a CSV file. Reporting a ranking of candidate active sites allows users to evaluate the results closely, and easily identify potential false positives, given the complementary supporting information shown for each solution found.

RESULTS

The GASS method was previously tested, and proved to be very effective in a number of experiments. Based on the CSA annotation, it was able to correctly identify more than 90% of the catalytic sites catalogued. In specific enzymes sets as the Nitric Oxide Synthase (125 enzymes) and Trypsin-like (1085 enzymes), GASS-WEB found more than 90% of the active site correctly within the fifth place in the ranking (9). This property is very desirable since it facilitates the user's visual inspection. It also managed to achieve a Matthew correlation coefficient (MCC) of 0.63 using the Critical Assessment of protein Structure Prediction (CASP 10) dataset. In our analysis, GASS was ranking fourth among 18 methods (9). To further evaluate the method's performance, we carried out a test involving a dataset composed of 6 subsets (DS1:DS6) randomly chosen from CSA literature entries. Each DS has a template and a specific protein family (distinct EC number). All proteins were extracted from PDB and their active sites validated in CSA. Analysing the results of GASS-WEB, the average accuracy for all the subsets was 99%. Figure 3 (blue) shows a cumulative match score curve (CMS) for the experiment. This curve shows the relation between the number of correct catalytic sites found according to CSA and their position in the GASS-WEB ranking. As observed, the best catalytic site candidates appear mostly at the top five positions of the ranking. The accuracy value of each subset as well as a ROC graph are in the Supplementary Data.

Figure 3.

CMS score of catalytic sites found correctly according to the CSA—Datasets DS1:DS6 (blue). CMS score of catalytic sites found correctly according to the CSA—NCBI-VAST database experiment (red).

CMS score of catalytic sites found correctly according to the CSA—Datasets DS1:DS6 (blue). CMS score of catalytic sites found correctly according to the CSA—NCBI-VAST database experiment (red). In another experiment, GASS-WEB found correctly the catalytic site of the enzyme 2GCT (Gamma-Chymotrypsin A), according to CSA at the first position of the ranking. Annotated by homology, the enzyme has HIS 57 and ASP 102 in chain B, and SER 195 in chain C, showing that finding inter-domain catalytic sites are not a limitation for the method. Figure 2C shows the result of this experiment. Further case studies can be found in Supplementary Data. In a complementary experiment we evaluated the fitness distribution obtained by GASS-WEB on a large-scale search, with NCBI-VAST database (23 318 proteins) and the template enzyme 1ACB ((bovine alpha-chymotrypsin-eglin C complex). The number of catalytic sites found according to CSA was 270 (1.16% of all structures), and of these structures, 138 (51.11%) presented lower fitness than 1 Å, and 126 (46.67%) presented fitness between 1Å and 5Å. Thus, 97.78% of the sites found according to the CSA have a fitness value ≤5 Å. This indicates that, according to the fitness function (Supplementary Data), the sites found have a high degree of structural similarity with the template. Figure 3 (red) shows a CMS for the experiment. Analysing the 200 similar active sites reported by GASS-WEB using the template 1ACB, the number of catalytic sites found according to CSA was 31. It is important to emphasize that among the results reported by GASS-WEB there are correct active sites that are not annotated in the CSA (for more examples please see Supplementary Data). As well as the fitness value, ranking position can be very useful information, assisting the user in the inspection and validation of the results.

CONCLUSION

GASS-WEB is a free and a user-friendly web server created for searching similar active sites based on data from the PDB, CSA and NCBI-VAST. Based on these three different resources, GASS-WEB can use catalytic and binding sites templates to search similar sites in a protein. It also can use an specific active site template to search similar active sites in a protein database. Without the limitations of the traditional methods (performing only exact matches, restricting the number of amino acids in the template or pruning the search space using ad-hoc procedures besides finding inter-domain active sites) the method showed to be effective in finding similar active sites in most datasets. In our analysis, GASS-WEB managed to achieve a MCC of 0.63 on the Critical Assessment of protein Structure Prediction (CASP 10) dataset (ranking fourth among 18 methods) besides being able to correctly identify >90% of the catalogued active sites in CSA. In a dataset with six specific protein families (551 proteins and 6 templates), the GASS-WEB average accuracy was 99%. On a large-scale search (NCBI-VAST), 97.78% of the sites found according to the CSA had a fitness value ≤5 Å and appeared mostly at the top five positions of the ranking. This implies that both fitness value and ranking position can help the user in the inspection and validation of the results. During the experiments some sites were found according to the literature but are not included in the CSA (for more details please see Supplementary Data). This indicates that our method can be of great help in improving and increasing coverage of resources including the CSA and similar databases. We believe, therefore, that GASS-WEB could be an invaluable tool for assisting protein function prediction and active site annotation. Click here for additional data file.

10 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures.

Authors: Alexander Stark; Robert B Russell
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

3. A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation.

Authors: Michal Brylinski; Jeffrey Skolnick
Journal: Proc Natl Acad Sci U S A Date: 2007-12-28 Impact factor: 11.205

4. GASS: identifying enzyme active sites with genetic algorithms.

Authors: Sandro C Izidoro; Raquel C de Melo-Minardi; Gisele L Pappa
Journal: Bioinformatics Date: 2014-11-10 Impact factor: 6.937

5. 3DLigandSite: predicting ligand-binding sites using similar structures.

Authors: Mark N Wass; Lawrence A Kelley; Michael J E Sternberg
Journal: Nucleic Acids Res Date: 2010-05-31 Impact factor: 16.971

6. The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes.

Authors: Nicholas Furnham; Gemma L Holliday; Tjaart A P de Beer; Julius O B Jacobsen; William R Pearson; Janet M Thornton
Journal: Nucleic Acids Res Date: 2013-12-06 Impact factor: 16.971

7. MMDB and VAST+: tracking structural similarities between macromolecular complexes.

Authors: Thomas Madej; Christopher J Lanczycki; Dachuan Zhang; Paul A Thiessen; Renata C Geer; Aron Marchler-Bauer; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2013-12-06 Impact factor: 16.971

Review 8. Proteins and Their Interacting Partners: An Introduction to Protein-Ligand Binding Site Prediction Methods.

Authors: Daniel Barry Roche; Danielle Allison Brackenridge; Liam James McGuffin
Journal: Int J Mol Sci Date: 2015-12-15 Impact factor: 5.923

9. SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures.

Authors: Nurul Nadzirin; Eleanor J Gardiner; Peter Willett; Peter J Artymiuk; Mohd Firdaus-Raih
Journal: Nucleic Acids Res Date: 2012-05-09 Impact factor: 16.971

10. Rapid catalytic template searching as an enzyme function prediction procedure.

Authors: Jerome P Nilmeier; Daniel A Kirshner; Sergio E Wong; Felice C Lightstone
Journal: PLoS One Date: 2013-05-10 Impact factor: 3.240

10 in total

9 in total

1. IHEC_RAAC: a online platform for identifying human enzyme classes via reduced amino acid cluster strategy.

Authors: Hao Wang; Qilemuge Xi; Pengfei Liang; Lei Zheng; Yan Hong; Yongchun Zuo
Journal: Amino Acids Date: 2021-01-23 Impact factor: 3.520

2. Computational methods and tools for binding site recognition between proteins and small molecules: from classical geometrical approaches to modern machine learning strategies.

Authors: Gabriele Macari; Daniele Toti; Fabio Polticelli
Journal: J Comput Aided Mol Des Date: 2019-10-18 Impact factor: 3.686

Review 3. Specifics of Metabolite-Protein Interactions and Their Computational Analysis and Prediction.

Authors: Dirk Walther
Journal: Methods Mol Biol Date: 2023

4. GRaSP-web: a machine learning strategy to predict binding sites based on residue neighborhood graphs.

Authors: Charles A Santana; Sandro C Izidoro; Raquel C de Melo-Minardi; Jonathan D Tyzack; António J M Ribeiro; Douglas E V Pires; Janet M Thornton; Sabrina de A Silveira
Journal: Nucleic Acids Res Date: 2022-05-07 Impact factor: 19.160

5. Demystifying Chronic Kidney Disease of Unknown Etiology (CKDu): Computational Interaction Analysis of Pesticides and Metabolites with Vital Renal Enzymes.

Authors: Harindu Rajapaksha; Dinesh R Pandithavidana; Jayangika N Dahanayake
Journal: Biomolecules Date: 2021-02-10

6. In Silico and Transcription Analysis of Trehalose-6-phosphate Phosphatase Gene Family of Wheat: Trehalose Synthesis Genes Contribute to Salinity, Drought Stress and Leaf Senescence.

Authors: Md Ashraful Islam; Md Mustafizur Rahman; Md Mizanor Rahman; Xiujuan Jin; Lili Sun; Kai Zhao; Shuguang Wang; Ashim Sikdar; Hafeez Noor; Jong-Seong Jeon; Wenjun Zhang; Daizhen Sun
Journal: Genes (Basel) Date: 2021-10-20 Impact factor: 4.096

7. Inhibitory Effects of Bacterial Silk-like Biopolymer on Herpes Simplex Virus Type 1, Adenovirus Type 7 and Hepatitis C Virus Infection.

Authors: Esmail M El-Fakharany; Marwa M Abu-Serie; Noha H Habashy; Nehal M El-Deeb; Gadallah M Abu-Elreesh; Sahar Zaki; Desouky Abd-El-Haleem
Journal: J Funct Biomater Date: 2022-02-02

8. High precision protein functional site detection using 3D convolutional neural networks.

Authors: Wen Torng; Russ B Altman
Journal: Bioinformatics Date: 2019-05-01 Impact factor: 6.937

9. Predicted antiviral drugs Darunavir, Amprenavir, Rimantadine and Saquinavir can potentially bind to neutralize SARS-CoV-2 conserved proteins.

Authors: Umesh C Halder
Journal: J Biol Res (Thessalon) Date: 2021-08-04 Impact factor: 1.889

9 in total