| Literature DB >> 29742157 |
Zhila Esna Ashari1, Nairanjana Dasgupta2, Kelly A Brayton1,3,4, Shira L Broschat1,3,4.
Abstract
Type IV secretion systems (T4SS) are multi-protein complexes in a number of bacterial pathogens that can translocate proteins and DNA to the host. Most T4SSs function in conjugation and translocate DNA; however, approximately 13% function to secrete proteins, delivering effector proteins into the cytosol of eukaryotic host cells. Upon entry, these effectors manipulate the host cell's machinery for their own benefit, which can result in serious illness or death of the host. For this reason recognition of T4SS effectors has become an important subject. Much previous work has focused on verifying effectors experimentally, a costly endeavor in terms of money, time, and effort. Having good predictions for effectors will help to focus experimental validations and decrease testing costs. In recent years, several scoring and machine learning-based methods have been suggested for the purpose of predicting T4SS effector proteins. These methods have used different sets of features for prediction, and their predictions have been inconsistent. In this paper, an optimal set of features is presented for predicting T4SS effector proteins using a statistical approach. A thorough literature search was performed to find features that have been proposed. Feature values were calculated for datasets of known effectors and non-effectors for T4SS-containing pathogens for four genera with a sufficient number of known effectors, Legionella pneumophila, Coxiella burnetii, Brucella spp, and Bartonella spp. The features were ranked, and less important features were filtered out. Correlations between remaining features were removed, and dimensional reduction was accomplished using principal component analysis and factor analysis. Finally, the optimal features for each pathogen were chosen by building logistic regression models and evaluating each model. The results based on evaluation of our logistic regression models confirm the effectiveness of our four optimal sets of features, and based on these an optimal set of features is proposed for all T4SS effector proteins.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29742157 PMCID: PMC5942808 DOI: 10.1371/journal.pone.0197041
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Workflow used to identify optimal features for predicting T4SS effector proteins.
Fig 2The PCA scree plots show the values of an eigenvalue versus its factor or principal component number.
The dashed vertical line in each plot shows a cut-off value of one for the eigenvalue. Factors to the right of each line were discarded. The number of factors used for each pathogen is given in the top right corner of each plot. (A) L. pneumophila, (B) C. burnetii, (C) Brucella spp, and (D) Bartonella spp.
Hosmer-Lemeshow goodness-of-fit test: Concordant percentages between effector predictions using our built logistic regression models and known effectors.
| Concordant percentage | |||
|---|---|---|---|
| 97.8 | 95.8 | 98.4 | 98.0 |
Fig 3Histogram of residual values showing the frequency of each value interval versus residual values.
Residuals represent the difference between true and predicted values using our final logistic regression models. It can be seen that residuals are concentrated around zero and have a normal distribution and also are not skewed and contain no outliers. (A) L. pneumophila, (B) C. burnetii, (C) Brucella spp, and (D) Bartonella spp.
Features in different steps: Features are ranked based on p-values and the ones selected using filtering method are underlined for each pathogen.
The ones selected using logistic regression are in bold. The last column shows the selected features for T4SS prediction. Upper part of table shows vector features and the bottom part shows the features that have not been selected for any of the pathogens.
| No. | Features | Selected | ||||
|---|---|---|---|---|---|---|
| 1 | AA composition | * | ||||
| 2 | Auto-covariance of PSSM | * | ||||
| 3 | PSSM composition | * | ||||
| 4 | Dipeptide composition | * | ||||
| 5 | Homology to known effectors | * | ||||
| 6 | Average hydropathy | 13 | * | |||
| 7 | Total Hydropathy | 8 | * | |||
| 8 | Hydropathy of C terminal | 21 | 23 | * | ||
| 9 | Pepcoil hitcount | 7 | 19 | * | ||
| 10 | Hydropathy of N terminal | 28 | * | |||
| 11 | Pepcoil length | 8 | 20 | * | ||
| 12 | Charge of C terminal | 35 | 3 | 9 | ||
| 13 | Coiled coil domain | 11 | 22 | * | ||
| 14 | Signal peptide probability | 37 | 2 | 27 | * | |
| 15 | Polarity | 29 | 29 | 15 | * | |
| 16 | Molecular mass | 28 | 23 | 16 | * | |
| 17 | Maximum cleavage site probability | 36 | 16 | 24 | * | |
| 18 | Transmembrane helices | 14 | 30 | 14 | ||
| 19 | Length | 15 | 24 | 18 | * | |
| 20 | Isoelectric point | 30 | 25 | 17 | * | |
| 21 | Ank domain | 31 | 28 | |||
| 22 | Basicity of N terminal | 34 | 22 | 8 | * | |
| 23 | E-Block | 18 | 10 | 28 | ||
| 24 | Coiled coils secondary structure | 40 | 17 | 25 | ||
| 25 | 38 | 19 | 11 | * | ||
| 26 | 24 | 26 | ||||
| 27 | Transmembrane prediction by philius | 35 | 13 | 20 | 6 | |
| 28 | Total charge | 21 | 39 | 18 | 7 | |
| 29 | Charge of N terminal | 23 | 17 | 27 | 10 | |
| 30 | Basicity of C terminal | 31 | 38 | 4 | 21 | |
| 31 | Combined content of I, L, V and F | 41 | 32 | 9 | 28 | |
| 32 | Combined content of D and E | 42 | 22 | 31 | 28 | |
| 33 | Combined content of N and Q | 28 | 33 | 31 | 28 | |
| 34 | Combined content of R, K and H | 37 | 25 | 6 | 28 | |
| 35 | Combined content of S and T | 43 | 27 | 31 | 28 | |
| 36 | Combined content of S, N, E, and K | 27 | 19 | 12 | 28 | |
| 37 | Combined content of V, A, G and I | 43 | 23 | 31 | 28 | |
| 38 | protein subcellular localization | 23 | 13 | 5 | 13 | |
| 39 | DUF domain | 33 | 26 | 15 | 28 | |
| 40 | TM domain | 25 | 16 | 14 | 12 | |
| 41 | F-box domain | 26 | 31 | 31 | 28 | |
| 42 | F-box like domain | 29 | 40 | 31 | 28 | |
| 43 | U-box domain | 34 | 40 | 31 | 28 | |
| 44 | Pkinase domain | 39 | 20 | 31 | 28 | |
| 45 | LLR domain | 43 | 40 | 31 | 28 | |
| 46 | TPR domain | 43 | 40 | 31 | 28 | |
| 47 | Sel1 domain | 32 | 21 | 31 | 28 | |
| 48 | Patatin domain | 22 | 40 | 31 | 28 | |
| 49 | NLS domain | 20 | 24 | 31 | 26 | |
| 50 | MLS domain | 36 | 40 | 31 | 28 | |
| 51 | Prenylation domain | 30 | 30 | 31 | 28 |