| Literature DB >> 27993130 |
Christopher K Hobbs1, Vanessa L Porter2, Maxwell L S Stow1, Bupe A Siame2, Herbert H Tsang3, Ka Yin Leung4,5.
Abstract
BACKGROUND: Many gram-negative bacteria use type III secretion systems (T3SSs) to translocate effector proteins into host cells. T3SS effectors can give some bacteria a competitive edge over others within the same environment and can help bacteria to invade the host cells and allow them to multiply rapidly within the host. Therefore, developing efficient methods to identify effectors scattered in bacterial genomes can lead to a better understanding of host-pathogen interactions and ultimately to important medical and biotechnological applications.Entities:
Keywords: Effector prediction; Gram-negative bacteria; Machine learning; Type III secretion system
Mesh:
Substances:
Year: 2016 PMID: 27993130 PMCID: PMC5168842 DOI: 10.1186/s12864-016-3363-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
A summary of all 21 attributes used in the project
| Program | Attribute | Full length or N30 region |
|---|---|---|
| Pepstatsa | Peptide properties: tiny, small, aliphatic, aromatic, polar, non-polar, charged, basic, acidic | N30 region |
| Charge | Full length | |
| CAIa | Codon adaption index | Full length |
| ProtParamb | Isoelectric point | Full length |
| POODLE-Sc | N-terminal disorder (N30 disorder) | N30 region |
| This study | G + C content | Full length |
Web links of the above programs
a http://emboss.sourceforge.net/
b http://web.expasy.org/protparam/
c www.cbrc.jp/cbrc-software
Fig. 1An Overview of GenSET Phase 1 selection of the training and testing sets for T3SS effector prediction. Protein or nucleotide sequences from each genome were grouped into three categories that included (i) all known T3SS effectors, (ii) non-effectors including non-T3SS annotated proteins, and (iii) all unannotated hypothetical proteins including all T3SS-related proteins. Fifteen randomly picked effectors (E. coli, S. dysenteriae, and S. Typhimurium) or 21 effectors (P. syringae) from (i) became the positive set. The negative training set was 10-fold larger group of non-effector randomly selected from (ii) of the same genome. GenSET was trained on the positive and negative sets (training set) using unfiltered attributes and filtered attributes and then applied to all remaining sequences of the whole genome (testing set)
Statistical analysis of attributes chosen by the feature selection methods for the four organisms. Four attributes (PEPIB, CAI, N30 disorder, and G + C content) out of the 21 appeared in at least three of the four organisms tested. Actual values for all attributes are given in Additional file 1)
| Organism | Attribute | Positive set | Negative set | ||
|---|---|---|---|---|---|
| Average | SDb | Average | SD | ||
|
| Non-Polar | 48.67 | 8.89 | 58.42 | 10.70 |
| PEPIB | 0.86 | 0.07 | 0.77 | 0.14 | |
| CAI | 0.58 | 0.02 | 0.71 | 0.06 | |
| N30 disorder | 0.53 | 0.11 | 0.38 | 0.13 | |
|
| Tiny | 41.11 | 9.45 | 30.32 | 9.09 |
| Chargea | 0.93 | 1.25 | 0.60 | 2.40 | |
| pIa | 7.74 | 1.42 | 6.71 | 1.76 | |
| PEPIB | 0.90 | 0.11 | 0.74 | 0.16 | |
| Aliphatic index | 61.19 | 18.33 | 100.96 | 32.15 | |
| CAI | 0.51 | 0.06 | 0.68 | 0.07 | |
| G + C content | 51.62 | 5.77 | 59.04 | 3.91 | |
| N30 disorder | 0.67 | 0.07 | 0.41 | 0.14 | |
|
| Non-Polar | 45.11 | 9.07 | 56.09 | 10.11 |
| Tinya | 24.22 | 10.87 | 29.16 | 9.29 | |
| G + C content | 34.62 | 1.80 | 51.65 | 3.70 | |
| N30 disorder | 0.47 | 0.12 | 0.22 | 0.13 | |
|
| PEPIB | 0.88 | 0.12 | 0.74 | 0.21 |
| Instability index | 65.21 | 22.55 | 38.43 | 22.54 | |
| CAI | 0.58 | 0.04 | 0.69 | 0.05 | |
| G + C content | 43.99 | 6.18 | 52.18 | 5.74 | |
aAttributes that are not statistically different between the positive and negative sets
b SD standard deviation
Average performance of the algorithms on the four organisms using unfiltered (U) and filtered (F) attributes. The PPV (positive predictive value or precision), TPR (True positive rate or sensitivity), SPC (specificity or true negative rate), and AUC (area under the curve) values were calculated using the trained (voting) algorithm (actual values for all algorithms are given in Additional file 2)
| GenSET | Organism/Attributesa | PPV | TPR | SPC | AUC | |
|---|---|---|---|---|---|---|
| Phase 1 |
| U | 0.068 | 1.000 | 0.980 | 0.998 |
| F | 0.082 | 1.000 | 0.984 | 0.993 | ||
|
| U | 0.300 | 0.750 | 0.989 | 0.988 | |
| F | 0.520 | 0.542 | 0.997 | 0.970 | ||
|
| U | 0.086 | 1.000 | 0.984 | 0.999 | |
| F | 0.106 | 1.000 | 0.987 | 0.997 | ||
|
| U | 0.025 | 1.000 | 0.962 | 0.987 | |
| F | 0.022 | 1.000 | 0.956 | 0.984 | ||
| Phase 2 |
| U | 0.112 | 0.950 | 0.957 | 0.980 |
| F | 0.132 | 0.800 | 0.970 | 0.979 | ||
|
| U | 0.429 | 0.273 | 0.999 | 0.943 | |
| F | 0.000 | 0.000 | 0.999 | 0.882 | ||
|
| U | 0.330 | 0.800 | 0.982 | 0.981 | |
| F | 0.337 | 0.711 | 0.984 | 0.976 | ||
|
| U | 0.364 | 0.444 | 0.998 | 0.967 | |
| F | 0.000 | 0.000 | 0.999 | 0.955 | ||
aThe voting algorithm used unfiltered (U) attributes (utilize all 21 attributes) and filtered (F) attributes (utilizing a subset of attributes selected by the feature selection methods)
GenSET prediction of known effectors that were included in the testing set. The number of known effector (and percentage) in the top 40 proteins candidates and the overall prediction by GenSET are shown (see Additional file 3 for all top 40 proteins predicted to be T3SS effectors)
| GenSET | Organism | Unfiltered set | Filtered set | ||
|---|---|---|---|---|---|
| Top 40 | Overallc | Top 40 | Overall | ||
| Phase 1 |
| 5/6b
| 5/6 | 4/6 | 5/6 |
|
| 16/30 | 16/30 | 13/30 | 13/30 | |
|
| 5/9 | 8/9 | 5/9 | 5/9 | |
|
| 8/9 | 8/9 | 6/9 | 6/9 | |
| Phase 1 Averagea | All organisms | 70.3% | 78.6% | 58.1% | 62.2% |
| Phase 2 |
| 14/21 | 19/21 (90.5%) | 12/21 (57.1%) | 16/21 (76.2%) |
|
| 2/8 | 2/8 | 0/8 | 0/8 | |
|
| 20/51 (39.2%) | 36/51 (70.6%) | 20/51 (39.2%) | 32/51 (62.7%) | |
|
| 4/9 | 4/9 | 0/9 | 0/9 | |
| Phase 2 Average | All organisms | 43.8% | 57.6% | 24.1% | 34.7% |
aAverage scores were calculated by averaging the unfiltered and filtered percent values of the four organisms
bEffector prediction rates were calculated by the number of effectors in the top 40 positive prediction over the total number of known effectors in the testing set
cOverall denotes all confirmed effectors proteins that were predicted to be effectors with true probabilities ≥ 0.5
GenSET performance was compare to six other machine learning programs using the average TPR (sensitivity), SPC (specificity), and AUC (area under the curve) values obtained with unfiltered attributes from the four organisms. GenSET 1 performed better than the other six programs in all areas whereas GenSET 2 gave better specificities
| Program | TPR | SPC | AUC | Reference |
|---|---|---|---|---|
| EffectiveT3 | ~0.710 | ~0.850 | 0.85–0.86 | [ |
| T3MM | ~0.839 | ~0.903 | N/Ac | [ |
| SIEVE | 0.9 | 0.88 | 0.95–0.96 | [ |
| BPBAac | ~0.910 | ~0.974 | 0.989 | [ |
| T3SEpre | 0.927 | 0.945 | N/A | [ |
| Meta-analytic | ~0.90 | ~0.90 |
| [ |
| GenSET 1a |
|
|
| This study |
| GenSET 2 | 0.617 |
| 0.968 | This study |
aGenSET 1 and GenSET 2 denote average results from Phase 1 and Phase 2 respectively
bBold number denotes the highest value in a given column
c N/A Not applicable or not available
GenSET 1 T3SS effector prediction on four organisms were compared to three other available machine learning programs. GenSET 1 performed better than other programs in three out of the four organisms tested except for S. dysenteriae (see Additional files 3 and 4 for the actual data)
| Top 40 positive prediction out of total confirmed effectors | ||||
|---|---|---|---|---|
| Programa |
|
|
|
|
| EffectiveT3 | 7/24 (29.2%) | 11/21 (52.4%) | 9/51 (17.7%) | 2/24 (8.3%)d
|
| T3MM |
| 5/21 (23.8%) | 21/51 (41.2%) | 4/24 (16.7%)d
|
| BPBAac | 13/24 (54.2%) | 12/21 (57.1%) | 20/51 (39.2%) | 7/24 (29.2%)d
|
| GenSET 1b | 5/9 (55.6%) |
|
|
|
aOther programs namely SIEVE, T3SEpre, and Meta-analytic were not available or accessible at the time of investigation
bFor the GenSET 1 method, 15 or 21 effectors were taken out from the total effectors as the positive data sets. Thus, the totals were less in numbers when compared to others
cBold number denotes the highest value in a given column
dTop 40 positive prediction for SPI-2 effectors
eTop 40 positive prediction for SPI-1 effectors
Fig. 2An Overview of GenSET Phase 2 selection of the training and testing sets for T3SS effector prediction. The positive set comprised 30 known effectors (15 each from S. dysenteriae and S. Typhimurium). The negative set was a 10-fold larger group of randomly selected non-effectors from the two organisms. Machine learning was similar to that in GenSET Phase 1 and used unfiltered and filtered attributes. The trained algorithm was applied to the genome of a third organism for T3SS effector prediction. The testing sets comprised all sequences in the test organism’s genome (i.e. E. coli, Y. pestis, P. syringae, and S. fredii)