Literature DB >> 21576237

MetalDetector v2.0: predicting the geometry of metal binding sites from protein sequence.

Andrea Passerini¹, Marco Lippi, Paolo Frasconi.

Abstract

MetalDetector identifies CYS and HIS involved in transition metal protein binding sites, starting from sequence alone. A major new feature of release 2.0 is the ability to predict which residues are jointly involved in the coordination of the same metal ion. The server is available at http://metaldetector.dsi.unifi.it/v2.0/.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：

Year: 2011 PMID： 21576237 PMCID： PMC3125771 DOI： 10.1093/nar/gkr365

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Metalloproteins are a large and diverse class of proteins which bind one or more metal ions in their native conformation (1). Metal atoms play a wide range of structural, regulatory or catalytic roles which are critical to protein function (2). Zinc ions contribute, for instance, to stabilize the structure of a huge number of transcription factors such as zinc fingers. Enzymes often employ metal ions as cofactors in their catalytic sites (3). Metal binding proteins are implicated in heavy metal toxicity, in processes such as apoptosis (4), aging (5) and carcinogenesis (6). Identifying metal binding sites in novel proteins can significantly contribute to their functional characterization, as well as help in understanding metal-related malfunctions. X-ray absorption spectroscopy (HT-XAS) has recently proved capable of identifying metalloproteins with high reliability (7,8). However, the specific ligands involved in binding the metal ion(s) cannot be identified by these techniques. Bioinformatics tools can significantly contribute to a detailed annotation of metal binding sites, as well as in scaling-up to proteome-wide analyses. Motif-based approaches, relying on regular expression patterns or Pfam probabilistic models, have been employed (9) for sequence-based predictions on entire proteomes. The drawback of these methods is that they cannot identify novel sites: regular expression patterns tend to be quite specific but with low coverage (many false negatives), and Pfam models are limited to known metal-binding domains. In order to overcome these limitations, a number of supervised learning techniques [e.g. (10,11,12)] have been recently developed for predicting the metal bonding state of all residues in a sequence. The task consists of discriminating between free and metal-bonded residues (or disulfide bonded for cysteines). MetalDetector (13) predicts metal-bonding state of CYS and HIS residues, focusing on transition metals, heme and Fe/S groups as candidate heterogens. The system has been active since April 2008 and has served roughly 10 000 queries so far. It was recently (8) employed in combination with HT-XAS in order to identify putative metal binding sites in a large set of protein targets generated within the Protein Structure Initiative (http://www.structuralgenomics.org). Identification of binding sites geometry is the main new feature of release 2.0 presented in this article. The task consists in predicting the number of ions binding the protein together to their respective sets of ligands in the sequence. Figure 1 shows an example of a protein kinase C cystein-rich domain (PDB entry 1tbn). It highlights the 3D structure of the binding sites (top) and a graph-based representation of the input sequence together to the desired output (bottom). These predictions can have a significant impact in a number of tasks, including: detailed functional annotation of experimentally unsolved proteins, e.g. characterization of active sites in enzymes, many of which employ metal ions as cofactors (3); experimental determination of new metalloproteins, as the prediction of metal binding sites can guide the preparation of samples for in vitro studies (7).

Figure 1.

Metal binding prediction subtasks. (a): given sequence; (b) candidate ligands (CYS and HIS) are assigned bonding state (boldface for metal binding); (c) metal-binding residues are grouped to form binding site configurations. There exist several web servers for metal-binding sites prediction. DiANNA (10) predicts cysteine-bonding states only, while it is not able to reconstruct metal-binding site geometry; MetSite (14) identifies sites using sequence profile information in combination with approximate structural data coming from low-resolution (or predicted) models; FINDSITE-metal (15) predicts metal-binding sites from evolutionarily related templates detected by threading; Feature (16) identifies zinc-binding sites for proteins whose 3D structure is given. The applicability of these web servers is thus limited to structurally determined proteins, or proteins for which a reasonable 3D model can be derived. SeqCHED (17) is a recently developed server predicting metal binding geometry from protein sequence, which relies on remote homology detection to create a structural model of the target protein, over which the original CHED (18) structure-based algorithm is applied. It thus cannot predict metal binding sites for proteins having novel folds. Similar limitations hold for the up-mentioned pattern-based or domain-based approaches. MetalDetector2 is the first server capable of predicting metal binding geometry for novel folds starting from sequence information alone.

MATERIALS AND METHODS

Overview

There are two crucial aspects concerning prediction of metal binding geometry. First, the number of admissible configurations can be extremely large. For a protein chain with n CYS and HIS (candidate ligands), m ions and k ligands for the i-th ion, the number of configurations is the multinomial coefficient . In practice, each ion is coordinated by a variable number of ligands (typically ranging from 1 to 4, but occasionally more), and each protein chain binds a variable number of ions (typically ranging from 1 to 4). Assuming n = 12, m = 2 and k = 4 (like in the small example shown in Figure 1), we obtain 831 600 alternative configurations. We are not considering the rare exceptions in which a CYS or HIS residue can bind multiple ions (in the December 2009 release of PDB, only 0.9% HIS and 1.6% CYS are found to be within 3 Å of two different ions). This assumption allows us to develop an efficient polynomial-time algorithm (19) for geometry prediction. To reduce the output search space and improve accuracy, we limit the maximum number of ions to 4 (covering 97% of known transition metal sites in current PDB). The second key aspect of the task is that the participation of a residue to a metal binding site should not be predicted independently from the other residues: interdependencies between candidates should be taken into account to form a collective prediction. These aspects strongly suggest solutions based on structured-output learning (20). This recent research field aims at generalizing learning algorithms, traditionally developed for classification or regression tasks, to predict outputs consisting of complex structures [like the one shown in Figure 1c]. In MetalDetector2, identification of binding geometry is decomposed into two cascaded subtasks. The initial task consists of assigning bonding state to every CYS and HIS in two states (positive cases are metal-binding residues, negative cases are the rest, including half-cystines, i.e. cysteines forming disulfide bridges). The second task consists of grouping together metal-binding CYS and HIS, assigning them a conventional metal-ion identifier. This process is illustrated in Figure 1. Identification of the involved chemical element is not attempted. The server uses a combination of different machine learning algorithms. The overall operation flow is shown in Figure 2.

Figure 2.

Schematic diagram of methods in MetalDetector v2.0.

Bonding state identification

This was the only functionality of MetalDetector1 (13) and the first stage of prediction in MetalDetector2. In Refs (11,13), we used a bidirectional recurrent neural network and Viterbi decoding with a simple probabilistic automaton to refine local predictions and obtain a collective assignment. In MetalDetector1, it was important to train the predictor including examples of non-metalloproteins and chains rich in disulfide bridges (since otherwise metal-binding CYS and half-cystines could be easily confused). When the input chain is not known to be a metalloprotein, we still rely on MetalDetector1 for prediction (Figure 2). On the other hand, if the input chain is known to be a metalloprotein (users can select a checkbox in the web interface to indicate this knowledge), then half-cystines are rare <3% and better accuracy can be obtained by training on metalloproteins only. In this case, half-cystines are not predicted and we solve the supervised sequence labeling task using SVM-HMM (20), a model that can be essentially interpreted as a hidden Markov model with discriminatively learned parameters, and that collectively assigns bonding state to all CYS and HIS in the sequence. The SVM-HMM sequence is the subsequence containing CYS and HIS only and observations (emissions) for each position include vectors of multiple alignment profiles among other features. Preliminary experiments showed that performance difference between MetalDetector1 and SVM-HMM is negligible under the same experimental conditions, while the latter is much simpler to train and engineer. Notably, knowing that a protein binds metal simplifies the prediction task by reducing the space of candidate outputs, resulting in better prediction accuracy on average.

Binding geometry identification

The core and novel feature in MetalDetector2 takes as input a protein chain and a (predicted) bonding state assignment and predicts binding geometry. This task is formalized as a link prediction in a bipartite graph, where a ligand node is connected to an ion node if and only if the residue coordinates that ion. In order to solve the structured-output learning problem, we introduce a function F(x, y) measuring the ‘compatibility’ between the input information x (sequence and bonding state assignment) and every admissible binding geometry y. The function is a linear combination of features of both x and y. The difficulty in this learning task is the inference step where F must be maximized with respect to y (in general, this is a hard combinatorial optimization problem). It turns out that under relatively mild assumptions, namely that every CYS or HIS coordinates at most one metal ion, there exists an optimal greedy algorithm that can identify very efficiently the binding configuration y that maximizes F—see Ref. (19) for details. Features of x and y required to construct F are defined by means of a kernel function that defines the similarity between two chains. The kernel takes into account several sources of information, including the coordination pattern of each (predicted) site and multiple alignment profiles.

THE WEB SERVER INTERFACE

Input

The input sequence can be entered either as a plain aminoacid string or in FASTA format. The web interface allows to choose between three different settings, corresponding to the three different paths in Figure 2: (i) no prior knowlegde (default operation mode); (ii) the chain is known to belong to a metalloprotein; (iii) the chain is known to belong to a metalloprotein, and the user can also provide (a guess for) the bonding state of each CYS and HIS. Note that checking in the web interface that a chain is known to bind metal is a form of positive evidence (i.e. not checking it means ignorance, not negative evidence). This knowledge can be obtained, for example, if the protein was annotated as a metalloprotein via HT-XAS (7,8).

Output

Output is either presented on a separate web page or delivered by via e-mail. It consists of a table having an entry for each CYS and HIS, with the indication of its position within the sequence, its predicted bonding state and, if the residue was predicted as metal bonded, the assigned metal ion identifier. Residues predicted to coordinate the same ion will share the same identifier. Every identifier is an integer ranging from 1 to 4 (maximum number of binding sites that can be predicted). Its value has no special biochemical semantics but lower values corresponds to a higher level of confidence for the predictor, as the greedy algorithm first builds sites where it is more confident. Figure 3 shows a web browser output for PDB entry 1t3qA.

Figure 3.

Output of the predictor for PDB entry 1t3qA.

RESULTS AND DISCUSSION

We evaluated performance according to several measures: precision (P) and recall (R) of residue bonding state; precision is the ratio of true positives by the total number of residues predicted in metal-bonding state; recall or sensitivity is the ratio of true positives by the total number of metal-binding residues; precision (P) and recall (R) of (ligand prediction, i.e. assignment of a residue to a metal ion. As we are not trying to predict ions of the chemical elements but to correctly group together ligands of the same ion, equivalence classes due to arbitrary reordering of ion identifiers are taken into account. In Figure 1, for instance, the correct labeling is {(3,33,36,52), (16,19,41,44)}. A prediction like {(16,19,41,52), (33,35,36)} would contain five out of seven correct assignments, while the true overall number of ligands is eight, giving P = 5/7 and R = 5/8. Note that the measure also accounts for residues predicted as non-metal-binding, like 3 or 44, and non-ligands predicted as metal binding, like 35. The former negatively affect recall, the latter precision. true-positive hit rate (H) and false-positive hit rate (H) where a hit is counted whenever the intersection between a predicted and a true site is non-empty: H is, therefore, the fraction of sites having at least one correctly identified ligand, and H is the fraction of predicted sites having no correctly identified residues. The server was tested on three distinct data sets, according to the different criteria for redundancy elimination. (All the data sets are available online at in the server website Supplementary Data). The first data set was obtained starting from the one in Ref. (11), where redundacy between sequences was removed using UniqueProt (21). the 199 metal-binding chains were collected from that data set, after removing sites containing residues different from CYS/HIS, or with a coordination number greater than four. Results in the first row of Table 1 are averages of 30 different train/test random splits, always in a ratio of 80/20. When starting from known bonding state, the predictor achieves on this data P = R = 90±3. We finally measured accuracy in the metalloprotein prediction task (i.e. classifying the whole sequence as metalloprotein or not), on the whole data set in Ref. (11): MetalDetector v2.0 correctly predicted as metalloproteins 65% of the ones in this data set, and as non-metalloproteins 96% of the 2362 chains having no metal-bonded CYS/HIS.

Table 1.

Evaluation of MetalDetector2

Data set	Size	P_B	R_B	P_E	R_E	H_T	H_F
UniqueProt	199	79 ± 4	88 ± 4	68 ± 4	74 ± 4	93 ± 4	10 ± 3
SCOP-folds	1824	62 ± 5	71 ± 10	61 ± 6	57 ± 7	70 ± 9	19 ± 4
SCOP-superfamilies	1466	60 ± 4	74 ± 10	56 ± 6	60 ± 10	74 ± 10	22 ± 5
PDB 2010	549	60	75	50	62	77	20

Evaluation of MetalDetector2 The second data set was built according to the Structural Classification of Proteins (SCOP) hierarchy (22): the goal here was to test the predictor on new (i.e. not seen during the training phase) SCOP folds/superfamilies. We started from the December 2009 release of PDB, extracting 17 783 protein chains with at least a CYS or HIS bonded to a metal ion, and we retained only those chains which were mapped in SCOP 1.75 release (June 2009). After removing very few cases of chains bonded to more than five ions, we finally obtained a sequence-unique data set of 1 824 protein chains by running CD-HIT v4.0 (23) with sequence identity threshold set to 0.9 (default value). Using this second data set, we partitioned the chains in 10 different subsets, maintaining the same average percentage of ligands in each subset, and allowing no pair of chains in different subsets to belong to the same SCOP superfamily. In a second version of this data set, we considered SCOP folds instead of superfamilies, and we therefore had to discard multi domain chains, as building the partition would have been otherwise unfeasible: this version of the data set was therefore reduced to 1466 chains. We trained 10 different models, using 9 of the subsets as the training set and the remaining subset as the test set. Results are summarized in the second and the third row in Table 1. Performance measures are averaged on the 10 splits. The predictor available on the web server was trained on the whole SCOP-based data set. As a final test, we extracted 549 metal-bonded chains from PDB entries deposited in 2010 (after removing duplicates). Performance of the web server on this data set is reported in the fourth row of Table 1. Results in this setting are comparable to those obtained on the SCOP-based data sets. In the Supplementary Data, we show the breakdown of prediction performance according to the number of coordinating ligands per ion. These results indicate that in the majority of cases MetalDetector2 is capable of identifying most of the binding site: in PDB 2010 data set, for example, among the 268 sites having 2 coordinating residues, MetalDetector2 correctly identifies both residues in 41.6% of the cases and one of the two 42.0% of the times. In 65 and 62% of the cases, the server misses at most one ligand in the sites with three and four coordinating residues, respectively. Concerning precision, at least half of the returned candidates actually belong to the site on average.

CONCLUSION

This release of MetalDetector adds an important feature to metalloproteins prediction, namely the ability to identify the number of binding sites and the involved CYS and HIS ligands. Unlike existing servers that can perform this task, MetalDetector does not rely on 3D structure similarity and can predict binding sites of proteins in novel folds.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: LION lab, DISI, Unitn. Conflict of interest statement. None declared.

20 in total

1. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

2. Robust recognition of zinc binding sites in proteins.

Authors: Jessica C Ebert; Russ B Altman
Journal: Protein Sci Date: 2007-11-27 Impact factor: 6.725

3. Prediction of zinc-binding sites in proteins from sequence.

Authors: Nanjiang Shu; Tuping Zhou; Sven Hovmöller
Journal: Bioinformatics Date: 2008-02-01 Impact factor: 6.937

4. Characterization of metalloproteins by high-throughput X-ray absorption spectroscopy.

Authors: Wuxian Shi; Marco Punta; Jen Bohon; J Michael Sauder; Rhijuta D'Mello; Mike Sullivan; John Toomey; Don Abel; Marco Lippi; Andrea Passerini; Paolo Frasconi; Stephen K Burley; Burkhard Rost; Mark R Chance
Journal: Genome Res Date: 2011-04-11 Impact factor: 9.043

5. SCOP: a structural classification of proteins database for the investigation of sequences and structures.

Authors: A G Murzin; S E Brenner; T Hubbard; C Chothia
Journal: J Mol Biol Date: 1995-04-07 Impact factor: 5.469

6. Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks.

Authors: Andrea Passerini; Marco Punta; Alessio Ceroni; Burkhard Rost; Paolo Frasconi
Journal: Proteins Date: 2006-11-01

Review 7. Zinc, antioxidant systems and metallothionein in metal mediated-apoptosis: biochemical and cytochemical aspects.

Authors: Alessia Formigari; Paola Irato; Alessandro Santon
Journal: Comp Biochem Physiol C Toxicol Pharmacol Date: 2007-08-01 Impact factor: 3.228

8. Prediction of transition metal-binding sites from apo protein structures.

Authors: Mariana Babor; Sergey Gerzon; Barak Raveh; Vladimir Sobolev; Marvin Edelman
Journal: Proteins Date: 2008-01-01

9. MetalDetector: a web server for predicting metal-binding sites and disulfide bridges in proteins from sequence.

Authors: Marco Lippi; Andrea Passerini; Marco Punta; Burkhard Rost; Paolo Frasconi
Journal: Bioinformatics Date: 2008-07-16 Impact factor: 6.937

10. DiANNA 1.1: an extension of the DiANNA web server for ternary cysteine classification.

Authors: F Ferrè; P Clote
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

17 in total

1. Validation of metal-binding sites in macromolecular structures with the CheckMyMetal web server.

Authors: Heping Zheng; Mahendra D Chordia; David R Cooper; Maksymilian Chruszcz; Peter Müller; George M Sheldrick; Wladek Minor
Journal: Nat Protoc Date: 2013-12-19 Impact factor: 13.491

2. Characterizing metal-binding sites in proteins with X-ray crystallography.

Authors: Katarzyna B Handing; Ewa Niedzialkowska; Ivan G Shabalin; Misty L Kuhn; Heping Zheng; Wladek Minor
Journal: Nat Protoc Date: 2018-04-19 Impact factor: 13.491

3. Structure of galactarate dehydratase, a new fold in an enolase involved in bacterial fitness after antibiotic treatment.

Authors: Monica Rosas-Lemus; George Minasov; Ludmilla Shuvalova; Zdzislaw Wawrzak; Olga Kiryukhina; Nathan Mih; Lukasz Jaroszewski; Bernhard Palsson; Adam Godzik; Karla J F Satchell
Journal: Protein Sci Date: 2019-12-17 Impact factor: 6.725

4. mebipred: identifying metal binding potential in protein sequence.

Authors: A A Aptekmann; J Buongiorno; D Giovannelli; M Glamoclija; D U Ferreiro; Y Bromberg
Journal: Bioinformatics Date: 2022-05-27 Impact factor: 6.931

5. Ferritinophagy via NCOA4 is required for erythropoiesis and is regulated by iron dependent HERC2-mediated proteolysis.

Authors: Joseph D Mancias; Laura Pontano Vaites; Sahar Nissim; Douglas E Biancur; Andrew J Kim; Xiaoxu Wang; Yu Liu; Wolfram Goessling; Alec C Kimmelman; J Wade Harper
Journal: Elife Date: 2015-10-05 Impact factor: 8.140

6. Powerful sequence similarity search methods and in-depth manual analyses can identify remote homologs in many apparently "orphan" viral proteins.

Authors: Durga B Kuchibhatla; Westley A Sherman; Betty Y W Chung; Shelley Cook; Georg Schneider; Birgit Eisenhaber; David G Karlin
Journal: J Virol Date: 2013-10-23 Impact factor: 5.103

7. The lifestyle switch protein Bd0108 of Bdellovibrio bacteriovorus is an intrinsically disordered protein.

Authors: Gerd Prehna; Benjamin E Ramirez; Andrew L Lovering
Journal: PLoS One Date: 2014-12-16 Impact factor: 3.240

Review 8. Minimal Functional Sites in Metalloproteins and Their Usage in Structural Bioinformatics.

Authors: Antonio Rosato; Yana Valasatava; Claudia Andreini
Journal: Int J Mol Sci Date: 2016-05-04 Impact factor: 5.923

9. Secondary structure preferences of mn (2+) binding sites in bacterial proteins.

Authors: Tatyana Aleksandrovna Khrustaleva
Journal: Adv Bioinformatics Date: 2014-03-17

10. A Sco protein among the hypothetical proteins of Bacillus lehensis G1: Its 3D macromolecular structure and association with Cytochrome C Oxidase.

Authors: Soo Huei Tan; Yahaya M Normi; Adam Thean Chor Leow; Abu Bakar Salleh; Roghayeh Abedi Karjiban; Abdul Munir Abdul Murad; Nor Muhammad Mahadi; Mohd Basyaruddin Abdul Rahman
Journal: BMC Struct Biol Date: 2014-03-19