| Literature DB >> 30416498 |
Yi Xiong1, Qiankun Wang1, Junchen Yang1, Xiaolei Zhu2, Dong-Qing Wei1.
Abstract
Gram-negative bacteria use various secretion systems to deliver their secreted effectors. Among them, type IV secretion system exists widely in a variety of bacterial species, and secretes type IV secreted effectors (T4SEs), which play vital roles in host-pathogen interactions. However, experimental approaches to identify T4SEs are time- and resource-consuming. In the present study, we aim to develop an in silico stacked ensemble method to predict whether a protein is an effector of type IV secretion system or not based on its sequence information. The protein sequences were encoded by the feature of position specific scoring matrix (PSSM)-composition by summing rows that correspond to the same amino acid residues in PSSM profiles. Based on the PSSM-composition features, we develop a stacked ensemble model PredT4SE-Stack to predict T4SEs, which utilized an ensemble of base-classifiers implemented by various machine learning algorithms, such as support vector machine, gradient boosting machine, and extremely randomized trees, to generate outputs for the meta-classifier in the classification system. Our results demonstrated that the framework of PredT4SE-Stack was a feasible and effective way to accurately identify T4SEs based on protein sequence information. The datasets and source code of PredT4SE-Stack are freely available at http://xbioinfo.sjtu.edu.cn/PredT4SE_Stack/index.php.Entities:
Keywords: machine learning; position specific scoring matrix; sequence information; stacked ensemble method; type IV secreted effector
Year: 2018 PMID: 30416498 PMCID: PMC6212463 DOI: 10.3389/fmicb.2018.02571
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
FIGURE 1The illustration of PSSM-composition profile calculation for a query sequence.
FIGURE 2The framework of the stacked ensemble scheme proposed in PredT4SE-Stack.
Performance comparison of eight types of base-classifiers in the first stage on Train-915 dataset using 5-fold cross validation.
| Method | Parameter | ACC(%) | SE (%) | SP (%) | PR(%) | MCC | F1 |
|---|---|---|---|---|---|---|---|
| NB | laplace = 0 | 73.2 | 81.0 | 69.3 | 57.0 | 0.476 | 0.669 |
| KNN | k = 10 | 85.5 | 82.0 | 87.2 | 76.3 | 0.680 | 0.790 |
| LR | family = “binomial” | 87.9 | 74.8 | 94.4 | 87.1 | 0.722 | 0.803 |
| RF | ntree = 500 | 88.5 | 72.5 | 96.6 | 91.4 | 0.738 | 0.807 |
| ERT | numRandomCuts = 9 | 89.4 | 74.8 | 96.7 | 92.1 | 0.759 | 0.824 |
| SVM | cost = 1, gamma = 2−8, kernel = “radial” | 90.2 | 78.0 | 96.2 | 91.6 | 0.777 | 0.839 |
| XGB | eta = 0.3, max_depth = 6, nrounds = 500, objective = “binary:logistic” | 90.1 | 78.7 | 95.7 | 90.4 | 0.774 | 0.840 |
| GBM | learn_rate = 0.7, max_depth = 9, ntrees = 50 | 90.5 | 80.0 | 95.7 | 90.7 | 0.784 | 0.847 |
FIGURE 3ROC curves of base-classifiers in the first stage on Train-915 dataset using 5-fold cross validation.
Performance comparison of eight types of meta-classifiers in the second stage on Train-915 dataset using 5-fold cross validation.
| Method | Parameter | ACC(%) | SE (%) | SP (%) | PR(%) | MCC | F1 |
|---|---|---|---|---|---|---|---|
| ERT | numRandomCuts = 9 | 88.9 | 80.3 | 93.1 | 86.5 | 0.752 | 0.828 |
| RF | ntree = 500 | 90.4 | 81.0 | 95.1 | 89.7 | 0.783 | 0.847 |
| SVM | cost = 10, gamma = 2−10, kernel = “radial” | 90.6 | 80.3 | 95.7 | 90.7 | 0.787 | 0.849 |
| GBM | learn_rate = 0.1, max_depth = 3, ntrees = 50 | 90.6 | 82.0 | 94.9 | 89.3 | 0.788 | 0.851 |
| XGB | eta = 0.1, max_depth = 2, nrounds = 100, objective = “binary:logistic” | 90.7 | 81.3 | 95.4 | 90.4 | 0.791 | 0.852 |
| NB | laplace = 0 | 90.9 | 82.3 | 95.2 | 89.9 | 0.795 | 0.857 |
| KNN | k = 19 | 91.0 | 82.0 | 95.6 | 90.5 | 0.797 | 0.857 |
| LR | family = “binomial” | 91.1 | 81.0 | 96.2 | 91.9 | 0.800 | 0.858 |
FIGURE 4ROC curves of meta-classifiers in the second stage on Train-915 dataset using 5-fold cross validation.
Performance comparison of eight types of meta-classifiers in the second stage on the independent dataset Test-850.
| Method | ACC(%) | SE (%) | SP (%) | PR(%) | MCC | F1 |
|---|---|---|---|---|---|---|
| XGB | 92.4 | 85.3 | 93.0 | 54.2 | 0.643 | 0.663 |
| GBM | 93.1 | 88.0 | 93.5 | 56.9 | 0.674 | 0.691 |
| KNN | 93.5 | 88.0 | 94.1 | 58.9 | 0.688 | 0.706 |
| RF | 93.8 | 86.7 | 94.5 | 60.2 | 0.691 | 0.710 |
| NB | 93.8 | 88.0 | 94.3 | 60.0 | 0.696 | 0.714 |
| ERT | 94.0 | 88.0 | 94.6 | 61.1 | 0.703 | 0.721 |
| LR | 94.4 | 88.0 | 95.0 | 62.9 | 0.715 | 0.733 |
| SVM | 94.5 | 86.7 | 95.2 | 63.7 | 0.715 | 0.734 |
Performance comparison between our method with the other method on the independent dataset Test-850.
| Method | ACC(%) | SE (%) | SP (%) | PR(%) | MCC | F1 |
|---|---|---|---|---|---|---|
| 85.3 | 90.7 | 84.8 | 36.6 | 0.518 | 0.521 | |
| PredT4SE-Stack (SVM, 0.23) | 87.5 | 90.7 | 87.2 | 40.7 | 0.556 | 0.562 |
| PredT4SE-Stack (SVM, 0.50) | 94.5 | 86.7 | 95.2 | 63.7 | 0.715 | 0.734 |
| PredT4SE-Stack (LR, 0.11) | 88.7 | 90.7 | 88.5 | 43.3 | 0.579 | 0.586 |
| PredT4SE-Stack (LR, 0.50) | 94.4 | 88.0 | 95.0 | 62.9 | 0.715 | 0.733 |