Literature DB >> 29307143

Prediction of Metal Ion Binding Sites in Proteins from Amino Acid Sequences by Using Simplified Amino Acid Alphabets and Random Forest Model.

Suresh Kumar1.   

Abstract

Metal binding proteins or metallo-proteins are important for the stability of the protein and also serve as co-factors in various functions like controlling metabolism, regulating signal transport, and metal homeostasis. In structural genomics, prediction of metal binding proteins help in the selection of suitable growth medium for overexpression's studies and also help in obtaining the functional protein. Computational prediction using machine learning approach has been widely used in various fields of bioinformatics based on the fact all the information contains in amino acid sequence. In this study, random forest machine learning prediction systems were deployed with simplified amino acid for prediction of individual major metal ion binding sites like copper, calcium, cobalt, iron, magnesium, manganese, nickel, and zinc.

Entities:  

Keywords:  amino acid sequence; binding sites; machine learning; proteins

Year:  2017        PMID: 29307143      PMCID: PMC5769865          DOI: 10.5808/GI.2017.15.4.162

Source DB:  PubMed          Journal:  Genomics Inform        ISSN: 1598-866X


Introduction

Amino acids play a central role in the building block of protein. The primary structure of the protein is determined by the arrangement of 20 naturally occurring amino acids. The function of a protein is determined from their amino acids and also they depend upon interaction with cofactors, binding with metal ions and interaction with other proteins. The proteome of all the organism share significant metal ions and metal binding cofactors to carry out its essential function. It has been estimated that approximately 30% of all proteins contain at least one metal. The proteins play a vital role in biological processes and in the stability of the protein by binding with metal ions or metal containing-cofactors [1]. The proteins bind with major metal ions like transition metals, alkali, and alkaline metals. The frequent metal ions that bind with proteins are sodium, copper, iron, magnesium, manganese, potassium, and zinc ions respectively. In in-vitro condition, the unfolded polypeptide may are observed to interact with metal ions that direct the polypeptide folding process [2]. Identification of metal binding through experimental procedures like the use of metal ion affinity column chromatography [3, 4], electrophoretic mobility shift assay [5, 6], absorbance spectroscopy [7], gel electrophoresis [8], nuclear magnetic resonance spectroscopy [9-11], and mass spectrometry [3, 12] require tedious steps and specific instruments, making them expensive and may be unsuitable for unknown targets. In this aspect, there is a need for computational predictors of protein binding metal ion in order to reduce time and cost. For example, predictions of protein metal binding ions are useful in structural genomics, to select proper growth medium for overexpression studies and for the easy interpretation of electron density maps. But fortunately, metal-binding ability are encoded in the amino acidic sequences and these primary sequences help in protein structure formation. Through genomic projects various organism genomic sequences have been annotated somehow along with metalloproteins contained in them [1]. Bioinformatics has been extensively used to predict metal-binding ability from amino acid sequences. Various computational methods like artificial neural networks [13], support vector machines [14], decision tree algorithm [15], graph theory [16], FoldX force field [17], CHED [18, 19], and geometry algorithm methods [16]. These methods depend upon either only sequence information or the use of both sequence and structure information. However, most of the available prediction methods are either based on the knowledge of the apoprotein structure or restricted to few specific cases, like the metal binding of histidines/cysteines. Most of these methods have been implemented as standalone software or web servers to the research community [15, 20]. Due to the availability of cheap and advancement of sequencing instruments, the sequence of proteins has increased rapidly over when compared to protein structure data. This due to the fact that experimental determining the three-dimensional of protein is difficult and expensive. Through various theoretical and experimental studies, it is proved that minimal set of the amino acid is sufficient for protein folding [21]. The minimal set of representative residues with similar features can be achieved by grouping together the 20 amino acids by clustering. This method is called as reduced or simplified amino acid alphabet. Several simplified amino acid alphabets have been proposed, which have been applied to pattern recognition method in the prediction of protein structure [22], for generation of consensus sequences from multiple alignments, and for protein folding prediction [23]. Various computational predictor has used simplified amino acids to predict the solubility on overexpression, remote homology detection [19], and identify the defensin peptide family [24], effects of cofactors on conformation transition [25], DNA-binding proteins [26], heat shock protein families [27], inter-residue interaction [28], protein adaptation to mutation [29], and protein disorder [30]. In the present study, a random forest algorithm has been deployed to predict metal ion binding protein based on the simplified amino acids proposed by Murphy et al. [21].

Methods

Dataset construction

All the protein sequences were downloaded from the UniProt database [31] available at http://www.uniprot.org/. The downloaded sequences, annotated as metal containing, were grouped into eight subsets. Each of the subsets, containing one of the metal species viz., calcium, cobalt, copper, iron, magnesium, manganese, nickel, and zinc was considered to be metal-containing while all other entries were considered to be metal-free. Redundancy among the amino acid sequences was removed by clustering analysis using the cd-hit program [32] with the threshold of 50% level of percentage of identity, analogous by the UniRef 50 list [33] available in the UniProt database. This resulted in eight data sets containing 186 calcium-containing proteins, 69 cobalt-containing proteins, 215 copper-containing proteins, 315 iron-containing proteins, 961 magnesium-containing proteins, 386 manganese-containing proteins, 74 nickel-containing proteins, and 1,716 zinc-containing proteins. All proteins containing calcium, cobalt, copper, magnesium, manganese, nickel, or zinc were then subtracted from the UniRef50 list, resulting in a collection of non-metalloproteins. The workflow of dataset construction is shown in Fig. 1. The problem of the imbalanced dataset can be solved as proposed by Cohen et al. [34]. Firstly, they pre-processes the data to re-establish class balance (either by upsizing the minority class or downsizing the majority class). Secondly, they modify the learning algorithm itself to copy with imbalanced data. In this study, we pre-processed the data which contains a balanced set of metal and non-metal ions. For this construction, non–metallo-proteins datasets sequences were randomly selected in order to have balanced set of metal and non-metal binding proteins for each metal ion, respectively.
Fig. 1

Construction of dataset used for prediction.

Feature extraction by simplified amino acid alphabets

In order to investigate the effect of a particular class of amino acids on metal ion binding, the 20 amino acids were grouped into various classes based on certain common properties and the composition of the reduced sets of amino acids was considered. Feature extraction is done using the simplified amino acid alphabet. It estimates that reduced alphabets containing 10–12 letters can be used to design foldable sequences for a large number of protein families. This estimate is based on the observation that there is little loss of the information necessary to pick out structural homologs in a clustered protein sequence database when a suitable reduction of the amino acid alphabet from 20 to 10 letters is made. A simplified amino acid alphabet of 18 characters was used (Table 1). It is based on three independent amino acid classifications.
Table 1

The 18 variables, obtained by merging three simplified alphabets of amino acid residues used to represent protein sequences

VariableResidues
V1CMQLEKRA
V2P
V3ND
V4G
V5HWFY
V6S
V7TIV
V8CFILMVW
V9AG
V10PH
V11EDRK
V12NQSTY
V13FWY
V14CILMV
V15H
V16ST
V17EDNQ
V18KR

Conformational similarity

Conformational similarity indices are proposed by Chakrabarti and Pal [28] based on different residues are computed using the distribution of the main-chain and side-chain torsion angles and values have been used to cluster amino acids in proteins. In this method, the conformational similarity of the 20 amino acids based on torsion angles, which contains seven clusters ([CMQLEKRA], [P], [ND], [G], [HWFY], [S], and [TIV]) are used to represent variables.

BLOSUM 50 substitution matrix

The BLOSUM-50 matrix is proposed by Cannata et al. [35]. The matrix is deduced from amino acid pair frequencies in aligned blocks of a protein sequence database and is widely used for sequence alignment and comparison. The BLOSUM 50 matrix that they group together on the basis of the possibility of foldable structures and consists of the clusters: [P], [KR], [EDNQ], [ST], [AG], [H], [CILMV], and [YWF].

Hydrophobicity

The hydrophobicity scale by Rose et al. [36] is correlated to the average area of buried amino acids in globular proteins. This results in a scale which is not showing the helices of a protein, but rather the surface accessibility. It is based on the hydrophobicity scale which consists of the following cluster: [CFILMVW], [AG], [PH], [EDRK], and [NQSTY].

Random forest predictions

Random forest is a classification algorithm [37] that uses an ensemble of tree-structured classifiers. The random forest is a popular algorithm that has been used in designing computational predictors for various biological problems. Random forest is an ensemble learning method for classification. The random forest classifies a new object with an input vector, the input vector is predicted by each decision tree in the forest. Each tree provides a classification with votes and the class with most votes will be output as the predicted class. It is implemented by using Weka package [38, 39]. To ensure that parameter estimation and model generation of random forest is completely independent of the test data, a nested cross-validation procedure is performed. Nested cross-validation [40] means that there is an outer cross-validation loop for model assessment and an inner loop for model selection. In this study, the original samples are randomly divided into k = 10 parts in the outer loop. Each of these parts is chosen one by one for assessment, and the remaining nine of 10 samples are for model selection in the inner loop where a type of cross-validation using the so-called out-of-bag samples is performed.

Measurement of classifier’s performance

When the predictor was focused on the problem of distinguishing proteins containing a certain type of metal ion from proteins that do not contain any type of metal, it is important that both sets contain the same number of proteins; otherwise, several figures of merit that are commonly used to monitor the prediction reliability would be seriously biased. The reliability of the predictions was monitored with the following quantities. If a protein of type 1 must be distinguished from a protein of type 2, a prediction was considered to be a true-positive if type 1 was correctly predicted; it was considered to be a true-negative if type 2 was correctly predicted; it was considered to be a false-negative if a type 1 protein was predicted to be a type 2 protein; and it was considered to be a false-positive if a type 2 protein was predicted to be a type 1 protein. Consequently, the following figures of merit, the sensitivity, the specificity, the accuracy, the Mathews correlation are computed [41] as shown in the Eq. (1) below.

Results and Discussion

By using a simplified amino acid alphabet based on three independent amino acid classifications, amino acid cluster variables were obtained. Conformational similarity contains seven clusters: [CMQLEKRA], [P], [ND], [G], [HWFY], [S], and [TIV]. BLOSUM 50 substitution matrix contain [P], [KR], [EDNQ], [ST], [AG], [H], [CILMV], and [YWF]. The hydrophobicity scale contains [CFILMVW], [AG], [PH], [EDRK], and [NQSTY]. Out of 20 amino acid clusters, cluster [P] and [AG] which are present in more than one simplified alphabet were considered only once and these results in 18 variables (Table 1). The 18 variables are represented with percentage of occurrence as follows. The percentage of occurrence pc of the amino acid aa in the i protein was computed for each of the 20 types of amino acids in each protein as per Eq. (2). The protein sequences represented by the amino acid percentage of occurrence using 18 variables were employed with random forest algorithm using Weka suite. The metallo-proteins were identified using all the 18 variables with high accuracy ranging from 69% for zinc and 90% for nickel (Table 2). Moreover, prediction performance was studied by feature selection method by removing one variable at a time and maintaining the highest value in performance indices. Measurements are removed until there is an unacceptable degradation in system performance. The use of feature selection method will eliminate alphabets which are irrelevant or redundant features, and thus it improves the accuracy of the learning algorithm. To select an optimal subset of variables, we first analyzed how individual attributes from the initial set of 18 variables, contributed to predictive accuracy. For feature selection, we employed the wrapper approach as it uses the learning algorithm to test all existing feature subsets. The wrapper method will use a subset of features to train the model. Based on the inferences, the feature can be added or removed to improve the accuracy of the learning algorithm. We used a backward feature elimination, by starting with the full set and deleting attributes one at a time for searching the feature space [42, 43].
Table 2

Overall prediction performance of the classifier in predicting individual metal ion binding sites

MetalSensitivitySpecificityMathews correlationAccuracy
Ca0.7690.7390.5070.754
Co0.8840.8230.7080.853
Cu0.7460.8150.5630.781
Fe0.7720.7400.5120.756
Mg0.7660.7140.4810.740
Mn0.7290.6470.3780.688
Ni0.9450.8690.8170.907
Zn0.7400.6400.3820.690
The specific steps of the wrapper approach followed in this study. Partitioning the data with 10-fold cross-validation (k = 10). On each cross-validation training set, the learning machine was trained by using all 18 variables, to produce a ranking of the variables according to the importance. The cross-validation test set predictions were recorded. Then the variables are removed which are least important one by one and another learning machine was trained based on remaining variables, the cross-validation test set predictions were once again recorded. This step is repeated by removing each variable until at small number remain. Aggregate the predictions from all 10 cross-validation test sets and compute the aggregate accuracy at each step down in a number of variables. By the following the above steps, feature selection of variables was done by wrapper approach employing random forest machine learning algorithm. Based on aggregate accuracy, the important variables for copper ion prediction are PH variable and least preferred variables are AG and CMQLEKRA (Table 3). Based on Table 3, it is understood that removing PH variable decrease the accuracy of the classifier whereas removing AG and CMQLEKRA improves the accuracy of the classifier. For calcium ion prediction, the least important variable is P and EDNQ; removing these variable improves the performance of the classifier (Table 4). Similarly, for cobalt ion prediction, the variable CILMV is the least preferred variable as it affects the performance of the classifier (Table 5). For iron ion prediction, removing variable CFILMVW improves the performance of the classifier (Table 6). For magnesium, ion prediction variable ST and ND are least preferred variables (Table 7). For manganese ion prediction, removing variable FWY improves the accuracy of the classifier (Table 8). For nickel ion prediction, variable EDRK is the least preferred one (Table 9). For zinc ion prediction, the least preferred variable is HWFY (Table 10).
Table 3

Feature selection of variables in improving the performance of copper ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.7460.8150.7810.563
AG0.7620.8090.7860.571
CMQLEKRA0.7940.8040.7990.599
NQSTY0.7790.8140.7960.593
EDNQ0.7960.7970.7960.592
CFILMVW0.7850.8030.7940.588
TIV0.7850.7980.7920.583
PH0.7740.8010.7880.576
Table 4

Feature selection of variables in improving the performance of calcium ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.7690.7380.7540.507
P0.7830.7580.7700.541
EDNQ0.7880.7510.7700.541
EDRK0.7960.7580.7770.554
PH0.7850.7560.7700.541
CILMV0.8010.7540.7770.556
AG0.7900.7490.7700.539
CFILMVW0.7890.7650.7770.554
NQSTY0.7850.7670.7760.552
CMQLEKRA0.7800.7650.7720.545
Table 5

Feature selection of variables in improving the performance of cobalt ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.8840.8230.8530.708
CILMV0.9030.8420.8720.747
CFILMVW0.8990.8370.8680.737
ND0.8940.8280.8610.724
EDNQ0.8840.8330.8580.717
PH0.8940.8470.8700.741
ST0.9030.8370.8700.742
NQSTY0.8600.8330.8460.693
Table 6

Feature selection of variables in improving the performance of iron ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.7720.7400.7560.512
NQSTY0.7780.7310.7540.509
S0.7860.7270.7570.514
PH0.7860.7240.7550.511
CMQLEKRA0.7850.7200.7530.507
CFILMVW0.7870.7340.7610.523
AG0.7900.7200.7550.511
TIV0.7800.7250.7530.507
HWFY0.7900.7350.7620.525
Table 7

Feature selection of variables in improving the performance of magnesium ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.7660.7140.7400.481
ST0.7790.7140.7460.494
ND0.7740.7200.7470.494
NQSTY0.7670.7170.7420.485
S0.7720.7110.7420.484
HWFY0.7700.7160.7430.487
PH0.7770.7090.7430.487
CMQLEKRA0.7750.7080.7410.484
Table 8

Feature selection of variables in improving the performance of manganese ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.7290.6470.6880.378
FWY0.7310.7170.7340.474
EDNQ0.7410.6560.6980.398
CMQLEKRA0.7500.6470.6980.399
AG0.7500.6430.6970.396
S0.7390.6600.7000.400
Table 9

Feature selection of variables in improving the performance of nickel ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.9450.8690.9070.817
EDRK0.9500.8870.9180.838
G0.9310.8920.9170.824
NQSTY0.9230.8870.9050.810
ST0.9410.8780.9090.821
EDNQ0.9360.8650.9000.803
FWY0.9180.8600.8890.780
HWFY0.9310.8650.8980.800
TIV0.9270.8690.8980.797
Table 10

Feature selection of variables in improving the performance of zinc metal ion prediction against proteins that lack metal ions

Variable removedAverage sensitivityAverage specificityAverage accuracyAverage Mathews correlation
None0.7400.6400.6900.382
HWFY0.7510.6380.6950.391
CMQLEKRA0.7500.6360.6920.386
AG0.7470.6380.6930.388
ST0.7430.6440.6930.389
EDNQ0.7430.6360.6890.381
For example, cobalt metal binding protein can be discriminated from non-metal ions with all 18 variables with the accuracy of 85% (Fig. 2). It can be seen that, on removing variable V14 (CILMV) from the subset, the accuracy of the predictor improves from 85% to 87%. After removing of variables V8 (CFILMVW), V3 (ND), V17 (EDNQ), V10 (PH), and V16 (ST), the accuracy values are in the range from 86% to 87%. There is a drastic decrease in accuracy of the classifier by removing the variable V12 (NQSTY) to 84%. No further reduction of the set was possible, as the performance of random forest classifier dropped if any further attributes were eliminated. It can be seen that accuracy of prediction of metal binding proteins can be improved (e.g., calcium from 74% to 77%, cobalt from 83% to 85%, and nickel from 69% to 77%) by elimination of certain noisy features, up to certain limit and further improvement is then impossible. According to this backward strategy of feature selection, it can be observed that the prediction performance can be slightly improved. Some common variables rejected are V14 (CILMV) in calcium and cobalt, V8 (CFILMVW) in copper and iron.
Fig. 2

The performance graph of the Random forest classifier using feature selection (10-fold cross validation for cobalt ion prediction).

In this work, a new random forest based approach is developed combining hybrid feature of simplified amino acid alphabets for prediction of metal ion binding sites of iron, copper manganese, magnesium, nickel, calcium, cobalt, and zinc from amino acid sequence data. The result indicates that the random forest model has a high prediction accuracy in predicting metal ion binding sites. These metal binding prediction methods are helpful to avoid the selection of ‘impossible’ targets in structural biology and proteomics.
  43 in total

1.  Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein.

Authors:  Edward A Weathers; Michael E Paulaitis; Thomas B Woolf; Jan H Hoh
Journal:  FEBS Lett       Date:  2004-10-22       Impact factor: 4.124

2.  Predicting calcium-binding sites in proteins - a graph theory and geometry approach.

Authors:  Hai Deng; Guantao Chen; Wei Yang; Jenny J Yang
Journal:  Proteins       Date:  2006-07-01

3.  Absorbance spectroscopy-based examination of effects of coagulation on the reactivity of fractions of natural organic matter with varying apparent molecular weights.

Authors:  Gregory Korshin; Christopher W K Chow; Rolando Fabris; Mary Drikas
Journal:  Water Res       Date:  2009-01-03       Impact factor: 11.236

4.  Classifier performance prediction for computer-aided diagnosis using a limited dataset.

Authors:  Berkman Sahiner; Heang-Ping Chan; Lubomir Hadjiiski
Journal:  Med Phys       Date:  2008-04       Impact factor: 4.071

5.  Electrophoretic Mobility Shift Assay (EMSA).

Authors:  M F Smith; S Delbary-Gossart
Journal:  Methods Mol Med       Date:  2001

6.  Effects of Cofactors on Conformation Transition of Random Peptides Consisting of a Reduced Amino Acid Alphabet.

Authors:  Ming-Feng Lu; Ying Xie; Yue-Jie Zhang; Xue-Yan Xing
Journal:  Protein Pept Lett       Date:  2015       Impact factor: 1.890

7.  Prediction of water and metal binding sites and their affinities by using the Fold-X force field.

Authors:  Joost W H Schymkowitz; Frederic Rousseau; Ivo C Martins; Jesper Ferkinghoff-Borg; Francois Stricher; Luis Serrano
Journal:  Proc Natl Acad Sci U S A       Date:  2005-07-08       Impact factor: 11.205

8.  Hydrophobicity of amino acid residues in globular proteins.

Authors:  G D Rose; A R Geselowitz; G J Lesser; R H Lee; M H Zehfus
Journal:  Science       Date:  1985-08-30       Impact factor: 47.728

9.  Bias in error estimation when using cross-validation for model selection.

Authors:  Sudhir Varma; Richard Simon
Journal:  BMC Bioinformatics       Date:  2006-02-23       Impact factor: 3.169

10.  iDPF-PseRAAAC: A Web-Server for Identifying the Defensin Peptide Family and Subfamily Using Pseudo Reduced Amino Acid Alphabet Composition.

Authors:  Yongchun Zuo; Yang Lv; Zhuying Wei; Lei Yang; Guangpeng Li; Guoliang Fan
Journal:  PLoS One       Date:  2015-12-29       Impact factor: 3.240

View more
  4 in total

1.  mebipred: identifying metal binding potential in protein sequence.

Authors:  A A Aptekmann; J Buongiorno; D Giovannelli; M Glamoclija; D U Ferreiro; Y Bromberg
Journal:  Bioinformatics       Date:  2022-05-27       Impact factor: 6.931

2.  Zincbindpredict-Prediction of Zinc Binding Sites in Proteins.

Authors:  Sam M Ireland; Andrew C R Martin
Journal:  Molecules       Date:  2021-02-12       Impact factor: 4.411

Review 3.  A Comprehensive Review of Computation-Based Metal-Binding Prediction Approaches at the Residue Level.

Authors:  Nan Ye; Feng Zhou; Xingchen Liang; Haiting Chai; Jianwei Fan; Bo Li; Jian Zhang
Journal:  Biomed Res Int       Date:  2022-03-31       Impact factor: 3.411

4.  Prediction of Metal Ion Binding Sites of Transmembrane Proteins.

Authors:  Jing Qu; Sheng S Yin; Han Wang
Journal:  Comput Math Methods Med       Date:  2021-10-22       Impact factor: 2.238

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.