| Literature DB >> 33245691 |
Chao Wang1, Jin Wu2, Lei Xu3, Quan Zou1,4.
Abstract
Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew's correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.Entities:
Keywords: feature selection; imbalanced dataset; machine learning; model ensemble; non-classically secreted proteins
Year: 2020 PMID: 33245691 PMCID: PMC8116686 DOI: 10.1099/mgen.0.000483
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Framework of NonClasGP-Pred.
Descriptor feature dimensions and parameter search range
|
Feature descriptor |
Parameter |
Feature dimension |
Search range |
Optimal value |
|---|---|---|---|---|
|
PAAC |
λ |
λ+20 |
[1, 2, 3, …, 50] |
11 |
|
CKSAAP |
K |
(K+1)*400 |
[0,1, 2, 3, …, 9] |
9 |
|
NMBroto |
nlag |
nlag*8 |
[1, 2, 3, …, 50] |
19 |
|
QSOrder |
nlag |
nlag*2+40 |
[1, 2, 3, …, 50] |
4 |
|
AAC |
– |
20 |
– |
– |
|
CTDC |
– |
39 |
– |
– |
|
CTDT |
– |
39 |
– |
– |
|
CTriad |
– |
343 |
– |
– |
|
DDE |
– |
400 |
– |
– |
|
DPC |
– |
400 |
– |
– |
Fig. 2.ROC curve of four feature descriptors with different algorithm parameters.
Fig. 3.Feature dimension and model performance (ACC) before and after feature selection. Dim-BFS: feature dimension before feature selection, Dim-AFS: feature dimension after feature selection, ACC-BFS: ACC of model before feature selection, and ACC-AFS: ACC of model after feature selection.
Fig. 4.Subdataset-specific optimal feature combination. The black squares represent the composition of the best feature combination for a specific training subdataset based on the metric ACC, and the grey squares represents the alternative feature of the best model. For instance, QSOrder and AAC are alternatives for each other for the optimal feature subset of TD4; in other words, the combination of NMBroto + QSOrder + CTDT + CTriad achieved an ACC value equal to that of the combination of NMBroto + AAC + CTDT + CTriad.
Fig. 5.Performance comparison between the models built on individual training subsets and the ensemble model by ten-fold cross validation.
Fig. 6.Performance comparison between PeNGaRoo, SecretomeP and NonClasGP-Pred on independent test data.