| Literature DB >> 30425802 |
Shaherin Basith1, Balachandran Manavalan1, Tae Hwan Shin1,2, Gwang Lee1,2.
Abstract
A soluble carrier growth hormone binding protein (GHBP) that can selectively and non-covalently interact with growth hormone, thereby acting as a modulator or inhibitor of growth hormone signalling. Accurate identification of the GHBP from a given protein sequence also provides important clues for understanding cell growth and cellular mechanisms. In the postgenomic era, there has been an abundance of protein sequence data garnered, hence it is crucial to develop an automated computational method which enables fast and accurate identification of putative GHBPs within a vast number of candidate proteins. In this study, we describe a novel machine-learning-based predictor called iGHBP for the identification of GHBP. In order to predict GHBP from a given protein sequence, we trained an extremely randomised tree with an optimal feature set that was obtained from a combination of dipeptide composition and amino acid index values by applying a two-step feature selection protocol. During cross-validation analysis, iGHBP achieved an accuracy of 84.9%, which was ~7% higher than the control extremely randomised tree predictor trained with all features, thus demonstrating the effectiveness of our feature selection protocol. Furthermore, when objectively evaluated on an independent data set, our proposed iGHBP method displayed superior performance compared to the existing method. Additionally, a user-friendly web server that implements the proposed iGHBP has been established and is available at http://thegleelab.org/iGHBP.Entities:
Keywords: Extremely randomised tree; Growth hormone binding protein; Machine learning; Random forest; Support vector machine
Year: 2018 PMID: 30425802 PMCID: PMC6222285 DOI: 10.1016/j.csbj.2018.10.007
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Overview of the proposed methodology for predicting GHBPs that involved the following steps. (i) data set construction; (ii) feature extraction; (iii) feature ranking; (iv) exploration of various machine learning algorithms and an appropriate selection based on the performance produced using sequential forward search; (v) construction of the final prediction model that separates the input into putative GHBPs and non-GHBPs.
Fig. 2Performance of different ML-based models using the benchmarking data set. AAC: amino acid composition; DPC: dipeptide composition; CTD: chain-transition-distribution; AAI: amino acid index; PCP: physicochemical properties; H1: AAC + AAI; H2: AAC + DPC + AAI; H3: AAC + DPC + AAI + CTD; H4: AAC + DPC + AAI + CTD + PCP; H5: AAC + DPC; H6: AAC + CTD; H7: AAC + PCP; H8: AAI + DPC; H9: AAI + DPC + CTD; H10: AAI + DPC + CTD + PCP; H11: AAI + CTD; H12: AAI + PCP; H13: DPC + CTD; H14: DPC + CTD + PCP; H15: DPC + PCP; and H16: CTD + DPC.
The performance of the best model for each ML method obtained from different feature encodings.
| Methods | Features | MCC | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|---|
| ERT | H8 (420) | 0.546 | 0.772 | 0.740 | 0.805 | 0.813 |
| RF | H5 (420) | 0.546 | 0.776 | 0.829 | 0.724 | 0.805 |
| GB | H10 (577) | 0.545 | 0.772 | 0.789 | 0.756 | 0.806 |
| AB | H5 (420) | 0.531 | 0.764 | 0.715 | 0.813 | 0.767 |
| SVM | H4 (597) | 0.457 | 0.728 | 0.772 | 0.683 | 0.746 |
The first column represents the method name developed in this study. The second column represents the hybrid model and its corresponding number of features. The third, fourth, fifith, sixth, and seventh columns, respectively, represent the MCC, accuracy, sensitivity, specificity, and AUC. RF: random forest; ERT: extra tree classifier; SVM: support vector machine; GB: gradient boosting; and AB: adaBoost.
Fig. 3Feature importance score computed for the hybrid features H5 (A), H8 (B) and H10 (C) using the RF algorithm.
Fig. 4SFS curve for discriminating GHBPs and non-GHBPs. (A) -. The maximum accuracy (i.e., SFS peak) obtained in leave-one-out cross-validation is shown in the red circle.
Fig. 5Performance comparison between the control (without feature selection) and optimal feature set-based models of four different ML algorithms. In the x-axis, normal and bold font respectively represent the control and the final model using the optimal feature set.
Fig. 6Distribution of the GHBPs and non-GHBPs in the benchmarking data set using our hybrid features (A) and the optimal feature set (B).
Performances of various methods on the independent data set.
| Methods | Features | MCC | Accuracy | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|---|
| ERT | 190 | 0.646 | 0.823 | 0.807 | 0.839 | 0.813 |
| RF | 241 | 0.472 | 0.726 | 0.871 | 0.581 | 0.777 |
| GB | 161 | 0.331 | 0.661 | 0.774 | 0.548 | 0.700 |
| AB | 167 | 0.324 | 0.661 | 0.613 | 0.710 | 0.675 |
| HBPred | 73 | 0.196 | 0.597 | 0.677 | 0.516 | 0.600 |
The first column represents the method name as used in this study. The second column represents the number of features present in the optimal feature set. The third, fourth, fifth, sixth and seventh columns, respectively, represent the MCC, accuracy, sensitivity, specificity, and AUC.
Fig. 7Receiver operating characteristic curves of the various prediction models. (A) Leave-one-out cross-validation on the benchmarking data set and (B) independent data set. Higher AUC value indicates better performance of a particular method.