Literature DB >> 29989085

HBPred: a tool to identify growth hormone-binding proteins.

Hua Tang¹, Ya-Wei Zhao², Ping Zou¹, Chun-Mei Zhang¹, Rong Chen¹, Po Huang¹, Hao Lin².

Abstract

Hormone-binding protein (HBP) is a kind of soluble carrier protein and can selectively and non-covalently interact with hormone. HBP plays an important role in life growth, but its function is still unclear. Correct recognition of HBPs is the first step to further study their function and understand their biological process. However, it is difficult to correctly recognize HBPs from more and more proteins through traditional biochemical experiments because of high experimental cost and long experimental period. To overcome these disadvantages, we designed a computational method for identifying HBPs accurately in the study. At first, we collected HBP data from UniProt to establish a high-quality benchmark dataset. Based on the dataset, the dipeptide composition was extracted from HBP residue sequences. In order to find out the optimal features to provide key clues for HBP identification, the analysis of various (ANOVA) was performed for feature ranking. The optimal features were selected through the incremental feature selection strategy. Subsequently, the features were inputted into support vector machine (SVM) for prediction model construction. Jackknife cross-validation results showed that 88.6% HBPs and 81.3% non-HBPs were correctly recognized, suggesting that our proposed model was powerful. This study provides a new strategy to identify HBPs. Moreover, based on the proposed model, we established a webserver called HBPred, which could be freely accessed at http://lin-group.cn/server/HBPred.

Entities: Chemical Disease Gene Mutation Species

Keywords: Benchmark dataset; Dipeptide composition; Feature selection; Hormone-binding protein; Webserver

Mesh：

Substances：

Year: 2018 PMID： 29989085 PMCID： PMC6036759 DOI： 10.7150/ijbs.24174

Source DB: PubMed Journal: Int J Biol Sci ISSN： 1449-2288 Impact factor: 6.580

Introduction

Hormone-binding proteins (HBPs) are proteins that selectively and non-covalently bind to hormone (as shown in Figure 1) and carry hormone to target tissues to produce a desired effect 1. HBPs were first recognized in plasma of pregnant mouse, rabbit and man a decade ago. They are associated with the regulation of the hormone supply in the circulatory system and affect the metabolism or behavior of other cells possessing functional receptors for the hormone. The sex HBPs produced mainly in the liver bind to sex steroid hormones and thereby regulate their bioavailability 2. The abnormal expression of HBPs always causes various diseases3. Thus, it is important to clarify the function of HBPs and their regulation mechanisms.

Figure 1

Schematic diagram of human growth hormone (red) binding to two HBPs (yellow) 4

The first step to study HBPs' function is to accurately identify HBPs. However, with more and more proteins generated in the postgenomic age, it is difficult to determine HBPs with biochemical experiments due to expensive experimental materials and long experimental period. Computational methods are a good choice for timely and accurately identifying HBPs. Several machine learning methods, such as support vector machine (SVM), Mahalanobis discriminant (MD), increment of diversity (ID), neural network (NN) and random forest (RF), have been widely used in immunoglobulin prediction 5, apolipoprotein prediction 6, cell-penetrating peptides prediction 7, protein subcellular localization 8-14, conotoxin classification 15-17, ion channel prediction 18, 19, protein structure prediction 20-25, promoter prediction 26, 27, prediction of the origin of replication 28, 29 and the prediction of protein, DNA and RNA modification sites 30-33. These methods do provide a great convenience to scholars. However, to the best of our knowledge, there is no computational method for HBP identification. The study aims to develop a new predictor for identifying HBPs. According to previous comprehensive methods 34, the following five steps were conducted in this work to establish a statistical predictor for HBP identification. Firstly, functional HBPs were selected to construct a valid benchmark dataset to train and test the proposed method. Secondly, dipeptide composition which could truly reflect the residue correlation was extracted to formulate the protein samples. Thirdly, analysis of various (ANOVA)-based technique was used to rank these features. Fourthly, a widely used engine in bioinformatics, support vector machine, was selected to perform the prediction. Fifthly, the jackknife cross-validation was then used to objectively evaluate the anticipated accuracy of the predictor. In addition, based on the proposed model, we established a user-friendly web-server called HBPred for the identification of HBPs. These steps are introduced below.

Materials and Methods

Benchmark Dataset

In a statistical predictor, enough related functional data should be collected to obtain prior knowledge. Thus, it is important to construct an objective benchmark dataset to guarantee the robustness of the model. However, to our knowledge, no database for HBP was published. Thus, we searched and collected HBPs from the Universal Protein Resource (UniProt) 35, which provide a stable, comprehensive, and freely accessible central resource of protein sequences and functional annotations. Firstly, we selected the hormone-binding keyword in molecular function item of Gene Ontology (GO) to generate original HBP dataset. Then, a total of 2460 HBPs were obtained. Subsequently, in order to improve the reliability of the dataset, the 2104 HBPs which were not manually annotated or reviewed were excluded. Finally, in order to avoid the redundancy which affected the accuracy estimation of the prediction model, we used CD-HIT 36, which had been widely used to cluster and compare protein or nucleotide sequences, to remove highly similar HBP sequences by setting the cutoff threshold to 0.6. In fact, a more objective dataset could be produced when the cutoff threshold was set to 0.25. However, in this study, we did not use such a stringent criterion because the currently available data did not allow the strict criterion. Otherwise, the number of proteins would be too few to have statistical significance. As a result, a total of 123 HBPs were obtained and regarded as positive data. As a control, non-HBPs were obtained by using the similar selection strategy. For the purpose of keeping a balance between positive data and negative data and providing an objective evaluation model, 123 non-HBPs were randomly selected from UniProt as negative data. The identity between any two sequences in non-HBPs was also less than 60%. The positive and negative datasets can be formulated as where the subset contains 123 HBPs; contains 123 samples of non-HBPs; the symbol ⋃ represents the union in the set theory. All the data can be obtained from our website http://lin-group.cn/server/HBPred/download.html.

Sample descriptions

For a HBP P with L residues, how do we translate it into a mathematical expression for statistical prediction? This is the second important step to develop a predictor for identifying HBP. Based on a widely accepted viewpoint that the protein sequence contains key information which could determine the protein's structure and function, we extracted the features from the primary sequence of HBPs and non-HBPs. The most straightforward method is to formulate a HBP P with L residues by using the residue sequence as: where R1 represents the 1st residue of the HBP; R2 the 2nd residue of the protein, and so forth. A straightforward method to perform statistical prediction is to utilize the search tools based on sequence similarity, such as FASTA and BLAST. However, when there is no similar sequence in the training dataset for a query HBP, the similarity-based method fails. Machine learning methods can overcome such disadvantage. However, in these machine learning-based methods, protein samples should be translated into vectors with the same dimension. Generally, a simple vector used to represent a protein sample is its amino acid composition (AAC) or residue composition: where T is the transpose operator; is the normalized occurrence frequency of the i-th type of native residue in the protein chain and can be calculated as where is the occurrence number of i-th residue in the protein P. The AAC feature has been widely used in protein bioinformatics 12, 37-39. However, AAC feature does not contain the sequence order information so that the prediction quality is always far from satisfactory. To include the correlation information between two residues, we consider the dipeptide composition which describes the correlation between two most contiguous amino acid residues. Thus, a HBP P can be expressed as a 400-dimensional vector (20×20=400): where the component and T is the transpose operator. Each component is given by where A, C, …, W, and Y are respectively the single letter codes of 20 native amino acids; is the occurrence number for the dipeptide AA in the protein sequence (Eq. (2)); for the dipeptide AC, and so forth.

Feature ranking technique

From Eqs. (5-6), a total of 400 dipeptide frequencies were calculated. In previous studies 40-46, some features were noise or redundant information. In fact, in statistical learning, for high-dimensional features, it is widely accepted that many features have no or even negative contribution to the classification. Thus, it is necessary to rank the features and evaluate the contribution of every feature to the classification. According to the statistical theory, ANOVA can be used to investigate the statistical significance of ratio of between groups variance and within groups variance 47. Thus, the ratio called F-score is used to describe the contribution of each feature as: where , , and are the means of dipeptide k frequencies in all samples, HBP samples and non-HBP samples, respectively. Thus, the numerator and denominator in Eq. (7) denote the variances between groups and within groups, respectively. It is obvious that the larger the F(k) is, the better prediction capability the feature k has. Thus, the 400 dipeptides can be ranked according to their F-scores.

Support vector machine (SVM)

In the construction of a predictor of HBPs, the third important step is to discriminate HBPs from non-HBPs with a powerful predictive algorithm. The powerful and popular SVM in bioinformatics 48-56 was utilized in the study. The method was developed by Vapnik and his colleagues based on the statistical learning theory 57. By projecting samples with low-dimensional feature into a high-dimension Hilbert space, it searches and constructs a separating hyperplane which could classify positive and negative samples with the maximal margin in the space by using the decision function: where is the i-th training vector; represents the type of the i-th training vector; is called a kernel function which defines an inner product in a high dimensional feature space. The radial basis kernel function (RBF) defined as was used in the work because it was more suitable for nonlinear classification than other kernel functions. A free software package LibSVM, which could be freely downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm 58, was used to implement the SVM. Grid search was performed with a miscellaneous tool based on LIBSVM called grid.py for optimizing the regularization parameter C and kernel parameter . The search spaces for C and are: where and denote the step gaps for C and , respectively.

Performance Evaluation

A suitable statistical test is extremely important in the performance evaluation of the proposed model. In the study, the jackknife cross-validation test is used to evaluate the proposed model because it is more suitable for small sample sizes and always yields a unique result for a given benchmark dataset 59-62. The following three indexes called Sensitivity (Sn), Specificity (Sp) and Overall Accuracy (OA) were used: where and are the number of the correctly identified HBPs (also called true positives) and the number of the correctly identified non-HBPs (also called true negatives), respectively.

Results

Prediction Performance

We firstly investigated the prediction performance of 400 dipeptide compositions on the discrimination between HBPs and non-HBPs through the jackknife cross-validation test. We found that the overall accuracy reached maximum (75.6%) when C=2 and . Generally, high-dimensional features contain more information for HBPs. However, these features also contain noise or redundant information, which results in the poor predictive capabilities on HBP prediction in the cross-validation test 11. We thought that the HBP prediction accuracy could be further improved by noise exclusion. Therefore, we used ANOVA-based feature selection technique to find out the best feature subset which produced the maximum accuracy for distinguishing HBPs from non-HBPs. The F-scores of 400 dipeptides were calculated according to Eq. (7). Then, we ranked the 400 dipeptides according to the decreasing order of their F-scores: where the is the first dipeptide with the maximum F-score; is the second dipeptide with the second maximum F-score; is the third dipeptide with the third maximum F-score and so forth; T is the transpose operator. Subsequently, we utilized the incremental feature selection (IFS) strategy 5, 18, 19 to find out the optimal features which are the best for HBP prediction based on the following steps. Firstly, we obtained 400 feature subsets. The first feature subset only contained the first dipeptide in the ranked set D and arbitrary sample can be formulated as . The second feature subset contains the first and second dipeptides in the ranked set and arbitrary sample can be formulated as , and so on. It is obvious that the 400th feature subset contains 400 dipeptides whose accuracy has been achieved above. Secondly, all the 400 feature subsets were inputted into SVM for classification. The jackknife cross-validation test was used to evaluate all 400 models. A total of 400 OAs were obtained. The maximum OA can be easily observed by plotting the ISF curve in Figure . When the top 73 dipeptides were used as inputs, the maximum OA of 84.9% could be obtained. We also noticed that the 86th feature subset could also produce the OA of 84.9% in the jackknife cross-validation test (Blue dot in Figure ). Here, we used the 73th feature subset to construct the final prediction model because it contained fewer features than the 86th feature subset. These 73 dipeptides had the higher F-scores, meaning that they had the high confidence level and could give more reliable information for classification. In addition, we investigated the Sn and Sp, which were 88.6% and 81.3%, respectively. The parameters C and were 8 and 0.03125, respectively. In general, the dipeptides with high F-score give more reliable information for classification. Thus, we extracted the top 20 dipeptides with the maximum F-score to investigate their performance on HBP prediction. The OA reached 80.1% in jackknife cross-validation test (Green dot in Figure ). However, the number of features is too small to provide enough information, thus resulting in the poor performance of 20 best dipeptides compared with 73 best dipeptides.

Feature analysis

To provide a visible and direct analysis on the contributions of different dipeptides in the prediction model, we drew a heat map (Figure ) representing a matrix in which the elements represented the features and were encoded with different colors according to their defined as 6, 47 where Fmin and Fmax are the minimum and maximum F-scores of the 400 dipeptides; and are the average frequencies of the kth dipeptide in HBP dataset and non-HBP dataset, respectively; sgn is the sign function. Thus, the upper limit and lower limit of are 1 and -1, respectively. The first and second residues of 400 dipeptides are respectively listed in the row and column of the heat map. It is obvious that if , the kth dipeptide prefers HBP, otherwise it prefers non-HBP. In Figure 3, the dipeptides in red and blue boxes are positively and negatively correlated with HBPs, respectively. The redder the element is, the more highly relevant with HBPs it is, and vice versa. From the figure, we found that HBPs contained the more abundant residues of Cys (C), His (H), Lys (K), Thr (T), Asn (N) and Arg (R) (red) than non-HBPs, whereas non-HBPs contained the more abundant residues of Leu (L), Phe (F), Trp (W), and Tyr (Y) (blue).

Figure 3

Heat map or chromaticity diagram for the F-scores of the 400 dipeptides. Red elements indicate the dipeptides enriched in HBPs, whereas blue elements indicate the dipeptides enriched in non-HBPs.

Discussion

The purpose of the work is to develop a powerful tool to accurately recognize HBPs. Currently, the approaches for protein function prediction mainly contain two kinds of strategies. The one is based on similarity search. Another is on the basis of machine learning method. In the first strategy, the query sequence is aligned with the sequences in benchmark dataset to find out highly similar sequences or homologues. Some famous tools such as BLAST and FASTA are generally used to perform the sequence alignment. Their advantage is not affected by sequence length. Although this kind of sequence model is straightforward and intuitive, unfortunately, it fails when a query sequence does not have significant similarity to any of the peptide sequences in the training dataset. The machine learning-based method can overcome the disadvantage by transferring any sequence into a vector with the same dimension. Many feature models, such as amino acid composition (AAC) 37, n-mer peptide composition 8, 50, 63, 64, g-gap dipeptide composition 6, 12, 47, and pseudo amino acid composition (PseAAC) 5, 9, 10, 43, 65, 66, have been proposed to formulate protein sequences. For the purpose of improving protein function prediction, some scholars used Position-Specific Scoring Matrix (PSSM) 3, 67-71 and gene ontology (GO) 72-74 to describe protein samples. Although PSSM and GO always produced the high accuracy for protein classification, formulating protein samples with the methods generally led to significant flaws. PSSM is generated with the software PSI-BLAST 75, a similarity search tool. Therefore, it is necessary to search for a query protein in a big dataset (usually UniProt or SwissProt) by using PSI-BLAST. In most cases, the big dataset contains the query protein. Thus, the cross-validated results with machine learning method are not objective or strict. If the dataset did not contain the query sequence, but there was similar sequence in the dataset, we accepted the cross-validated results. However, it is time-consuming and not necessary to input PSSM into classifier because the BLAST or FASTA can give more accurate and straightforward results. Furthermore, if the dataset did not contain query sequence or similar sequence, the PSSM could not correctly reflect the consensus motif, thus resulting in wrong prediction. We also thought that GO information was not suitable for the HBP prediction due to the following factors. The GO is designed to describe gene function along three aspects: molecular functions (molecular activities of gene products), cellular components (where gene products are active) and biological processes (pathways and processes of the activities of multiple gene products). The computational approaches of identifying protein type aim to determine protein functions. In other words, our computational approaches should be able to predict the GO information of proteins. If the GO information of one protein or its homologues has been annotated, it is not necessary to predict the function of the protein. Thus, using GO information to predict protein function likes putting the cart before the horse. Besides, the dimension of GO information can increase when new GO node is added. Thus, any old GO-based model cannot handle such feature. Therefore, the two features are not adopted in our model. In fact, the sequence information is the most objective feature in sample descriptions, which also obey the theoretical biology route (also called reverse biology route) that sequence determines structure, and structure determines function. To provide the convenience for the most of wet-experimental users, a user-friendly web-server called HBPred was established based on above calculations. The web server can be freely accessed at http://lin-group.cn/server/HBPred. The prediction page is shown in Figure . One may firstly upload a sequence file or paste protein sequences in the FASTA format into the input box. Then, after clicking the button of “submit”, the predicted results will be obtained.

Conclusion

We constructed an effective predictor to identify HBPs. Encouraging accuracy was achieved. We also discussed why PSSM or GO information was not suitable for HBP prediction. A free webserver could provide convenience to most of wet-experimental scholars 76-80. Thus, finally, we established a new tool, called HBPred, to accurately predict potential novel HBPs. We expect that the tool will help scholars to improve drug development in relevant diseases. In the future, we will perform the prediction on the subtypes of HBPs.

69 in total

1. PHYPred: a tool for identifying bacteriophage enzymes and hydrolases.

Authors: Hui Ding; Wuritu Yang; Hua Tang; Peng-Mian Feng; Jian Huang; Wei Chen; Hao Lin
Journal: Virol Sin Date: 2016-08 Impact factor: 4.327

2. Identify and analysis crotonylation sites in histone by using support vector machines.

Authors: Wang-Ren Qiu; Bi-Qian Sun; Hua Tang; Jian Huang; Hao Lin
Journal: Artif Intell Med Date: 2017-03-07 Impact factor: 5.326

3. Identify Golgi protein types with modified Mahalanobis discriminant algorithm and pseudo amino acid composition.

Authors: Hui Ding; Li Liu; Feng-Biao Guo; Jian Huang; Hao Lin
Journal: Protein Pept Lett Date: 2011-01 Impact factor: 1.890

4. Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition.

Authors: Hao Lin; Hui Ding; Feng-Biao Guo; An-Ying Zhang; Jian Huang
Journal: Protein Pept Lett Date: 2008 Impact factor: 1.890

5. Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition.

Authors: Hui Ding; Liaofu Luo; Hao Lin
Journal: Protein Pept Lett Date: 2009 Impact factor: 1.890

6. Predicting residue-wise contact orders in proteins by support vector regression.

Authors: Jiangning Song; Kevin Burrage
Journal: BMC Bioinformatics Date: 2006-10-03 Impact factor: 3.169

7. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

8. Identification of antioxidants from sequence information using naïve Bayes.

Authors: Peng-Mian Feng; Hao Lin; Wei Chen
Journal: Comput Math Methods Med Date: 2013-08-24 Impact factor: 2.238

9. SABinder: A Web Service for Predicting Streptavidin-Binding Peptides.

Authors: Bifang He; Juanjuan Kang; Beibei Ru; Hui Ding; Peng Zhou; Jian Huang
Journal: Biomed Res Int Date: 2016-08-17 Impact factor: 3.411

10. Prediction of phosphothreonine sites in human proteins by fusing different features.

Authors: Ya-Wei Zhao; Hong-Yan Lai; Hua Tang; Wei Chen; Hao Lin
Journal: Sci Rep Date: 2016-10-04 Impact factor: 4.379

35 in total

1. XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites.

Authors: Kewei Liu; Wei Chen; Hao Lin
Journal: Mol Genet Genomics Date: 2019-08-07 Impact factor: 3.291

2. Identification of Sub-Golgi protein localization by use of deep representation learning features.

Authors: Zhibin Lv; Pingping Wang; Quan Zou; Qinghua Jiang
Journal: Bioinformatics Date: 2020-12-26 Impact factor: 6.937

3. Special issue on Computational Resources and Methods in Biological Sciences.

Authors: Hao Lin; Shaoliang Peng; Jian Huang
Journal: Int J Biol Sci Date: 2018-07-01 Impact factor: 6.580

4. Identification of Human Enzymes Using Amino Acid Composition and the Composition of k-Spaced Amino Acid Pairs.

Authors: Lifu Zhang; Benzhi Dong; Zhixia Teng; Ying Zhang; Liran Juan
Journal: Biomed Res Int Date: 2020-05-22 Impact factor: 3.411

5. A Hybrid Deep Learning Model for Predicting Protein Hydroxylation Sites.

Authors: Haixia Long; Bo Liao; Xingyu Xu; Jialiang Yang
Journal: Int J Mol Sci Date: 2018-09-18 Impact factor: 5.923

6. iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree.

Authors: Shaherin Basith; Balachandran Manavalan; Tae Hwan Shin; Gwang Lee
Journal: Comput Struct Biotechnol J Date: 2018-10-24 Impact factor: 7.271

7. M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species.

Authors: Xiaoli Qiang; Huangrong Chen; Xiucai Ye; Ran Su; Leyi Wei
Journal: Front Genet Date: 2018-10-25 Impact factor: 4.599

8. 4mCPred-MTL: Accurate Identification of DNA 4mC Sites in Multiple Species Using Multi-Task Deep Learning Based on Multi-Head Attention Mechanism.

Authors: Rao Zeng; Song Cheng; Minghong Liao
Journal: Front Cell Dev Biol Date: 2021-05-10

9. Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods.

Authors: Jiu-Xin Tan; Fu-Ying Dao; Hao Lv; Peng-Mian Feng; Hui Ding
Journal: Molecules Date: 2018-08-10 Impact factor: 4.411

10. Predicting Diabetes Mellitus With Machine Learning Techniques.

Authors: Quan Zou; Kaiyang Qu; Yamei Luo; Dehui Yin; Ying Ju; Hua Tang
Journal: Front Genet Date: 2018-11-06 Impact factor: 4.599