| Literature DB >> 32296690 |
Chaolu Meng1,2, Yang Hu3, Ying Zhang4, Fei Guo1.
Abstract
Polystyrene binding peptides (<span class="Chemical">PSBPs) play a key role in the immobilization process. The correct identification of PSBPs is the first step of all related works. In this paper, we proposed a novel support vector machine-based bioinformatic identification model. This model contains four machine learning steps, including feature extraction, feature selection, model training and optimization. In a five-fold cross validation test, this model achieves 90.38, 84.62, 87.50, and 0.90% SN, SP, ACC, and AUC, respectively. The performance of this model outperforms the state-of-the-art identifier in terms of the SN and ACC with a smaller feature set. Furthermore, we constructed a web server that includes the proposed model, which is freely accessible at http://server.malab.cn/PSBP-SVM/index.jsp.Entities:
Keywords: bioinformatic; identifier; machine learning; polystyrene binding peptides; support vector machine
Year: 2020 PMID: 32296690 PMCID: PMC7137786 DOI: 10.3389/fbioe.2020.00245
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1The framework and identification process of the PSBP-SVM. (A) Data collection. The benchmark dataset consists of 104 positive samples and 104 negative samples. (B) Feature extraction. A 420-dimensional feature is extracted from the benchmark dataset. (C) Feature selection. The optimal feature set is generated by the ANOVA ranking algorithm and the IFS process. (D) Model training and optimization. The optimal feature set is used to train and optimize the model. The PSBP identification process is based on the four parts in yellow boxes. These parts are ➀ feature extraction, ➁ feature selection, ➂ PSBP identification, and ➃ the result.
FIGURE 2Comparison of the PSBP-SVM and other identifiers. (A) Comparison with other feature extraction identifiers. (B) Comparison with other classification algorithm identifiers. (C) Comparison with other feature selection identifiers. (D) Comparison with the state-of-the-art identifier.
FIGURE 3Analysis of the 123D optimal feature set. (A) Plot of the accuracy of incremental feature selection. (B) Composition of the optimal feature set. (C) DPC and AAC occurrences. (D) Number of dipeptide types in the DPC.