| Literature DB >> 24952641 |
Bing Ma1, Amy O Charkowski, Jeremy D Glasner, Nicole T Perna.
Abstract
BACKGROUND: A wealth of genome sequences has provided thousands of genes of unknown function, but identification of functions for the large numbers of hypothetical genes in phytopathogens remains a challenge that impacts all research on plant-microbe interactions. Decades of research on the molecular basis of pathogenesis focused on a limited number of factors associated with long-known host-microbe interaction systems, providing limited direction into this challenge. Computational approaches to identify virulence genes often rely on two strategies: searching for sequence similarity to known host-microbe interaction factors from other organisms, and identifying islands of genes that discriminate between pathogens of one type and closely related non-pathogens or pathogens of a different type. The former is limited to known genes, excluding vast collections of genes of unknown function found in every genome. The latter lacks specificity, since many genes in genomic islands have little to do with host-interaction. RESULT: In this study, we developed a supervised machine learning approach that was designed to recognize patterns from large and disparate data types, in order to identify candidate host-microbe interaction factors. The soft rot Enterobacteriaceae strains Dickeya dadantii 3937 and Pectobacterium carotovorum WPP14 were used for development of this tool, because these pathogens are important on multiple high value crops in agriculture worldwide and more genomic and functional data is available for the Enterobacteriaceae than any other microbial family. Our approach achieved greater than 90% precision and a recall rate over 80% in 10-fold cross validation tests.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24952641 PMCID: PMC4079955 DOI: 10.1186/1471-2164-15-508
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Genome-wide target class label assignment to each protein coding gene as a data point for 3937 and WPP14
| Total # CDS* | IF** | CF** | Training data set | Testing data set | Pseudogene | |
|---|---|---|---|---|---|---|
|
| 4520 | 267 | 1264 | 1531 | 2989 | 28 |
|
| 4590 | 233 | 1111 | 1344 | 3246 | 174 |
*we only use protein coding genes and pseudogenes are not included.
**IF stands for host-microbe interaction factor; CF stands for genes involved in core biological processes
Ontology for host-microbe interaction, and category assignment genome-wide for data points in (Dd3937) and (WPP14)
| GO term and name | Dd3937 | WPP14 |
|---|---|---|
| GO:0052192 movement in environment of other organism involved in symbiotic interaction; | 41 | 41 |
| GO:0052048 interaction with host via secreted substance involved in symbiotic interaction | 54 | 53 |
| GO:0051816 acquisition of nutrients from other organism during symbiotic interaction | 103 | 81 |
| GO:0044413 avoidance of host defenses | 43 | 34 |
| GO:0043903 regulation of symbiosis, encompassing mutualism through parasitism | 13 | 9 |
| *GO:0044403 symbiosis, encompassing mutualism through parasitism | 13 | 15 |
| Total | 267 | 233 |
*this term is a parent term for all others listed in this table and is used as a generic catch all for host-microbe interaction factors lacking more specific GO term annotations.
List of all attributes categories used in data set formation in this study, and number of attributes in each categories for all data points in training data set for (Dd3937) and (WPP14)
| Category | Subcategory | Dd3937 | WPP14 | Reference |
|---|---|---|---|---|
|
|
|
|
| |
| Gamma strains | 239 | 239 | Additional file | |
| Non-gamma strains | 58 | 58 | Additional file | |
|
|
|
|
| |
| Taxonomy Statistics | 76 | 76 | Additional file | |
| Lifestyle Statistics | 118 | 118 | Additional file | |
|
|
|
|
| |
| GC content | 1 | 1 | This study | |
| subcellular localization | 1 | 1 | [ | |
| phylogenetic profile | 6 | 6 | [ | |
| fingerprints scanning | 3 | 3 | [ | |
| codon adaptation index (CAI) | 3 | 3 | [ | |
| physical adjacency (gene neighbor) | 2 | 2 | [ | |
| Operon prediction | 1 | 1 | [ | |
| phylogenetic conservation | 1 | 1 | This study | |
| COG functional category | 1 | 1 | [ | |
| Genomic island | 4 | 1 | [ | |
| computed structural and physicochemical features of proteins and peptides | 40 | 66 | [ | |
|
|
|
|
| |
| binding site prediction | 32 | 0 | Additional file | |
| Gene expression | 14 | 3 | Additional file | |
| proteomics | 6 | 0 | Additional file | |
|
|
|
|
Figure 1Flow chart of the procedures in performing supervised machine learning tasks of host-microbe interaction factor prediction.
Figure 2ROC curve to compare classifier performance of different data sets containing various types of attributes as listed in Table 3 . (TPR: True Positive Rate; FPR: False Positive Rate).
Figure 3PR (Precision-Recall) curve to evaluate strategies for boosting classifier performance.
Statistics for positive class object prediction and parameters used in selected learning schemes for both 3937 and WPP14
| Classifiers | Precision | TPR/recall/sensitivity | specificity/TNR | accuracy | F-measure | AUC |
|---|---|---|---|---|---|---|
|
| ||||||
| Random Forest | 0.93 | 0.81 | 0.98 | 0.94 | 0.87 | 0.97 |
| Bayesian Network | 0.91 | 0.85 | 0.97 | 0.94 | 0.88 | 0.97 |
| SMO using RBF kernels | 0.93 | 0.85 | 0.98 | 0.95 | 0.89 | 0.92 |
| SMO using polynormial kernels | 0.91 | 0.87 | 0.97 | 0.95 | 0.95 | 0.89 |
| Adaptive Boosting (Naïve Bayes)* | 0.84 | 0.89 | 0.95 | 0.93 | 0.87 | 0.96 |
| Adaptive Boosting (Decision Tree)* | 0.96 | 0.91 | 0.99 | 0.97 | 0.93 | 0.98 |
| Adaptive Boosting (IBK)* | 0.96 | 0.84 | 0.99 | 0.95 | 0.90 | 0.99 |
| Adaptive Boosting (Decision Stump)* | 0.92 | 0.87 | 0.98 | 0.95 | 0.89 | 0.97 |
| Multi-Boosting (Decision Tree)* | 0.97 | 0.91 | 0.99 | 0.97 | 0.94 | 0.98 |
| Multi-Boosting (IBK)* | 0.91 | 0.77 | 0.98 | 0.93 | 0.84 | 0.93 |
| Multi-Boosting (Naïve Bayes)* | 0.90 | 0.91 | 0.97 | 0.95 | 0.91 | 0.96 |
| Logit-Boosting (Decision Stump)* | 0.91 | 0.90 | 0.97 | 0.96 | 0.91 | 0.98 |
|
| ||||||
| Random Forest | 0.89 | 0.81 | 0.97 | 0.93 | 0.85 | 0.97 |
| Bayesian Network | 0.90 | 0.83 | 0.97 | 0.94 | 0.87 | 0.97 |
| SMO using RBF kernels | 0.94 | 0.84 | 0.98 | 0.95 | 0.89 | 0.91 |
| SMO using polynormial kernels | 0.93 | 0.86 | 0.98 | 0.95 | 0.95 | 0.89 |
| Adaptive Boosting (Naïve Bayes)* | 0.89 | 0.89 | 0.97 | 0.95 | 0.89 | 0.96 |
| Adaptive Boosting (Decision Tree)* | 0.95 | 0.86 | 0.99 | 0.96 | 0.90 | 0.98 |
| Adaptive Boosting (IBK)* | 0.87 | 0.83 | 0.96 | 0.93 | 0.85 | 0.92 |
| Logit-Boosting (Decision Stump)* | 0.90 | 0.85 | 0.97 | 0.94 | 0.88 | 0.97 |
| Multi-Boosting (Decision Tree)* | 0.94 | 0.86 | 0.98 | 0.96 | 0.90 | 0.98 |
| Multi-Boosting (Decision Stump)* | 0.91 | 0.75 | 0.98 | 0.93 | 0.82 | 0.97 |
| Multi-Boosting (Naïve Bayes)* | 0.90 | 0.89 | 0.97 | 0.95 | 0.89 | 0.96 |
| Logit-Boosting (Decision Stump)* | 0.90 | 0.87 | 0.97 | 0.95 | 0.89 | 0.97 |
*: denote ensemble classifiers, with base learner being shown within parenthesis.
Abbr: SMO: Support Vector Machine using Sequential Minimal Optimization; IBK: instance based learner with K-nearest neighbor classifier; RBF: Radial Basis Function kernels.
Figure 4Comparison of the selected learning schemes. (a) ROC curve for Dickeya dadantii 3937, (b) ROC curve for Pectobacterium carotovorum WPP14. (TPR: True Positive Rate; FPR: False Positive Rate).
Top 50 predicted host-microbe interaction factors from 3937
| FeatureID | Prob | Name | Annotation |
|---|---|---|---|
| ABF-0018715 | 0.922 | virB8 | Inner membrane protein forms channel for type IV secretion of T-DNA complex (VirB8) |
| ABF-0020188 | 0.922 | Predicted cell-wall-anchored protein SasA (LPXTG motif) this is up-regulated by hrpY; we have a mutation in this gene. | |
| ABF-0019950 | 0.922 | Putative multicopper oxidase | |
| ABF-0019360 | 0.922 | hypothetical protein | |
| ABF-0019151 | 0.922 | chrysobactin synthetase cbsF | |
| ABF-0019124 | 0.922 | Biopolymer transport protein ExbD/TolR | |
| ABF-0019122 | 0.922 | MotA/TolQ/ExbB proton channel family protein | |
| ABF-0019117 | 0.922 | sftP | TonB-dependent receptor |
| ABF-0019116 | 0.922 | hypothetical protein | |
| ABF-0018783 | 0.922 | putative transmembrane protein | |
| ABF-0018775 | 0.922 | Holin | |
| ABF-0018724 | 0.922 | putative ATP/GTP-binding protein remnant | |
| ABF-0018722 | 0.922 | virB2 | Major pilus subunit of type IV secretion complex (VirB2) |
| ABF-0047137 | 0.922 | hypothetical protein | |
| ABF-0018717 | 0.922 | virB6 | Integral inner membrane protein of type IV secretion complex (VirB6) |
| ABF-0018716 | 0.922 | virB7 | TriF protein |
| ABF-0018713 | 0.922 | virB10 | Inner membrane protein forms channel for type IV secretion of T-DNA complex (VirB10) |
| ABF-0018712 | 0.922 | virB11 | ATPase provides energy for both assembly of type IV secretion complex and secretion of T-DNA complex (VirB11) |
| ABF-0018601 | 0.922 | hypothetical protein | |
| ABF-0018207 | 0.922 | hypothetical protein | |
| ABF-0018199 | 0.922 | ganC | putative truncated PTS system EIIBC component |
| ABF-0017777 | 0.922 | hecA2 | Putative member of ShlA/HecA/FhaA exoprotein family |
| ABF-0015606 | 0.922 | ABC transporter permease protein | |
| ABF-0015604 | 0.922 | Amino acid ABC transporter, periplasmic amino acid-binding protein | |
| ABF-0015543 | 0.922 | hypothetical protein 15544 is up-regulated by hrpY. Is 15543 in the same operon? We have a mutation in 15544 | |
| ABF-0015387 | 0.922 | nipE | necrosis-inducing protein |
| ABF-0014838 | 0.922 | putative exported protein | |
| ABF-0014623 | 0.922 | Type IV pilus biogenesis protein PilN | |
| ABF-0018720 | 0.922 | virB4 | ATPase provides energy for both assembly of type IV secretion complex and secretion of T-DNA complex (VirB4) |
| ABF-0018714 | 0.922 | virB9 | VirB9 |
| ABF-0047204 | 0.922 | hypothetical protein | |
| ABF-0015913 | 0.921 | ppdA | Prepilin peptidase dependent protein A |
| ABF-0017252 | 0.921 | Conjugative transfer protein TrbG | |
| ABF-0018195 | 0.921 | ganG | galactan ABC transport system, permease component |
| ABF-0016407 | 0.921 | hypothetical protein | |
| ABF-0018205 | 0.921 | Pirin | |
| ABF-0019418 | 0.921 | Cellulose 1, 4-beta-cellobiosidase precursor | |
| ABF-0019468 | 0.921 | hypothetical protein | |
| ABF-0019566 | 0.921 | hypothetical protein | |
| ABF-0016680 | 0.921 | Iron utilization protein | |
| ABF-0020727 | 0.921 | sttG | General secretion pathway protein G |
| ABF-0019115 | 0.921 | hypothetical protein | |
| ABF-0015381 | 0.921 | avrM | Avirulence protein |
| ABF-0018723 | 0.921 | virB1 | VirB1 |
| ABF-0015598 | 0.921 | hypothetical protein | |
| ABF-0015609 | 0.921 | Branched-chain amino acid aminotransferase | |
| ABF-0018193 | 0.921 | ganF | galactan ABC transport system, permease component |
| ABF-0017097 | 0.921 | Methyl-accepting chemotaxis protein | |
| ABF-0020433 | 0.921 | hypothetical protein | |
| ABF-0019153 | 0.921 | cbsH | chrysobactin oligopeptidase CbsH |
Top 50 predicted host-microbe interaction factors from WPP14
| ID | Prob | Name | Product |
|---|---|---|---|
| ADT-0001591 | 0.912 | hypothetical protein | |
| ADT-0003750 | 0.912 | putative exported protein | |
| ADT-0000805 | 0.912 | dltB | peptidoglycan biosynthesis protein |
| ADT-0003928 | 0.911 | pectate lyase | |
| ADT-0003247 | 0.911 | methyl-accepting chemotaxis protein | |
| ADT-0000806 | 0.911 | dltD | poly(glycerophosphate chain) D-alanine transfer protein |
| ADT-0003745 | 0.911 | ABC transporter ATP binding protein | |
| ADT-0002063 | 0.911 | hypothetical protein | |
| ADT-0000400 | 0.911 | hasE | HlyD family secretion protein |
| ADT-0003089 | 0.911 | N-terminal fragment of a diguanylate cyclase (pseudogene) | |
| ADT-0003418 | 0.910 | methyl-accepting chemotaxis protein | |
| ADT-0000941 | 0.910 | methyl-accepting chemotaxis protein | |
| ADT-0006368 | 0.910 | hypothetical protein | |
| ADT-0005582 | 0.910 | hypothetical protein | |
| ADT-0000983 | 0.910 | methyl-accepting chemotaxis protein | |
| ADT-0001252 | 0.910 | ABC transporter permease protein | |
| ADT-0003245 | 0.910 | methyl-accepting chemotaxis protein | |
| ADT-0000027 | 0.910 | methyl-accepting chemotaxis protein | |
| ADT-0003542 | 0.909 | putative type IV pilus protein | |
| ADT-0001195 | 0.909 | LysR-family transcriptional regulator | |
| ADT-0004315 | 0.909 | astB | sulfate ester ABC transporter permease protein |
| ADT-0003152 | 0.909 | methyl-accepting chemotaxis protein | |
| ADT-0002357 | 0.908 | methyl-accepting chemotaxis protein | |
| ADT-0000543 | 0.908 | ABC transporter, substrate binding protein | |
| ADT-0001392 | 0.908 | putative exported protein | |
| ADT-0002087 | 0.908 | putative signaling protein | |
| ADT-0001868 | 0.908 | LysR-family transcriptional regulator | |
| ADT-0000803 | 0.908 | acyl carrier protein | |
| ADT-0000571 | 0.908 | putative cellulase | |
| ADT-0000535 | 0.908 | putative lipoprotein | |
| ADT-0001404 | 0.907 | hypothetical protein | |
| ADT-0004320 | 0.907 | sftP | TonB-dependent receptor |
| ADT-0001744 | 0.907 | putative exported protein | |
| ADT-0003391 | 0.907 | putative membrane protein | |
| ADT-0003535 | 0.907 | hypothetical protein | |
| ADT-0003563 | 0.907 | LysR-family transcriptional regulator | |
| ADT-0001980 | 0.907 | hypothetical protein | |
| ADT-0000804 | 0.907 | dltA | putative D-alanine--poly(phosphoribitol) ligase subunit 1 |
| ADT-0001616 | 0.906 | putative transport system membrane protein | |
| ADT-0001394 | 0.906 | hypothetical protein | |
| ADT-0001320 | 0.906 | methyl-accepting chemotaxis protein | |
| ADT-0001567 | 0.906 | putative exported protein | |
| ADT-0005614 | 0.906 | hypothetical protein | |
| ADT-0001436 | 0.906 | putative component of polysulfide reductase | |
| ADT-0004253 | 0.906 | occQ | octopine transport system permease protein |
| ADT-0001493 | 0.905 | hypothetical protein | |
| ADT-0001492 | 0.905 | putative lipoprotein | |
| ADT-0002704 | 0.905 | putative lipoprotein | |
| ADT-0002584 | 0.905 | ABC transporter, membrane spanning protein |
List of 56 genes predicted host-microbe interaction factors in both 3937 and WPP14
| Dd3937 | WPP14 | ||||
|---|---|---|---|---|---|
| FeatureID | Name | Product | FeatureID | Name | Product |
| ABF-0019117 | sftP | TonB-dependent receptor | ADT-0004320 | sftP | TonB-dependent receptor |
| ABF-0019116 | hypothetical protein | ADT-0004318 | unknown | ||
| ABF-0018207 | hypothetical protein | ADT-0001980 | hypothetical protein | ||
| ABF-0015604 | Amino acid ABC transporter | ADT-0000748 | putative extracellular solute-binding protein | ||
| ABF-0015387 | nipE | necrosis-inducing protein | ADT-0000781 | putative exported protein | |
| ABF-0014838 | putative exported protein | ADT-0002655 | putative exported protein | ||
| ABF-0019124 | Biopolymer transport protein ExbD/TolR | ADT-0002263 | putative biopolymer transport protein | ||
| ABF-0019115 | hypothetical protein | ADT-0002265 | hypothetical protein | ||
| ABF-0017097 | Methyl-accepting chemotaxis protein | ADT-0003418 | methyl-accepting chemotaxis protein | ||
| ABF-0019566 | hypothetical protein | ADT-0001832 | putative exported protein | ||
| ABF-0016407 | hypothetical protein | ADT-0001404 | hypothetical protein | ||
| ABF-0015906 | 6-phosphogluconolactonase | ADT-0003106 | putative exported protein | ||
| ABF-0019118 | atsR | Alkanesulfonates-binding protein | ADT-0001174 | atsR | putative sulfate ester binding protein |
| ABF-0019125 | astB | Alkanesulfonates transport system permease protein | ADT-0004315 | astB | sulfate ester ABC transporter permease protein |
| ABF-0017125 | inh | Alkaline proteinase inhibitor precursor | ADT-0001911 | inh | protease inhibitor |
| ABF-0019002 | hypothetical protein | ADT-0001744 | putative exported protein | ||
| ABF-0019205 | ABC transporter | ADT-0002584 | ABC transporter | ||
| ABF-0014642 | hypothetical protein | ADT-0000571 | putative cellulase | ||
| ABF-0019092 | Transcriptional activator protein lysR | ADT-0001195 | LysR-family transcriptional regulator | ||
| ABF-0016585 | Methyl-accepting chemotaxis protein | ADT-0001320 | methyl-accepting chemotaxis protein | ||
| ABF-0019383 | D-alanyl transfer protein DltB | ADT-0000805 | dltB | peptidoglycan biosynthesis protein | |
| ABF-0019855 | Methyl-accepting chemotaxis protein II (aspartate chemoreceptor protein) | ADT-0001887 | putative methyl-accepting chemotaxis protein | ||
| ABF-0015168 | chmX | Methyl-accepting chemotaxis protein III (ribose and galactose chemoreceptor protein) | ADT-0003152 | methyl-accepting chemotaxis protein | |
| ABF-0018737 | DNA-binding protein | ADT-0003335 | putative regulatory protein | ||
| ABF-0019933 | hypothetical protein | ADT-0003354 | hypothetical protein | ||
| ABF-0014645 | Paraquat-inducible protein A | ADT-0002701 | putative membrane protein | ||
| ABF-0017674 | Methyl-accepting chemotaxis protein | ADT-0003245 | methyl-accepting chemotaxis protein | ||
| ABF-0020681 | hypothetical protein | ADT-0002418 | RES domain-containing protein | ||
| ABF-0015907 | TonB-dependent hemin | ADT-0002398 | TonB-dependent hemin | ||
| ABF-0018934 | 4-aminobutyrate aminotransferase | ADT-0002845 | putative class-III aminotransferase | ||
| ABF-0014824 | Methyl-accepting chemotaxis protein II (aspartate chemoreceptor protein) | ADT-0002104 | methyl-accepting chemotaxis protein | ||
| ABF-0018178 | Iron(III) dicitrate-binding protein | ADT-0002009 | putative periplasmic substrate-binding transport protein | ||
| ABF-0019391 | Pectate lyase | ADT-0003928 | pectate lyase | ||
| ABF-0015887 | hypothetical protein | ADT-0002063 | hypothetical protein | ||
| ABF-0016115 | Methyl-accepting chemotaxis protein | ADT-0000027 | methyl-accepting chemotaxis protein | ||
| ABF-0019101 | atsB | Alkanesulfonates transport system permease protein | ADT-0003749 | atsB | putative sulfate ester transporter |
| ABF-0019214 | Glucosamine kinase GpsK | ADT-0003604 | hypothetical protein | ||
| ABF-0016752 | Ferric siderophore transport system | ADT-0003559 | TonB-like protein | ||
| ABF-0016218 | Fosmidomycin resistance protein | ADT-0001196 | MFS efflux transporter | ||
| ABF-0046571 | Putative DNA-binding transcriptional regulatory family of the TetR family | ADT-0003719 | TetR-family transcriptional regulator | ||
| ABF-0014644 | Probable lipoprotein | ADT-0000406 | putative lipoprotein | ||
| ABF-0015918 | ppdC | Putative prepilin peptidase dependent protein | ADT-0002557 | ppdC | putative prepilin peptidase dependent protein c precursor |
| ABF-0018572 | ABC transporter | ADT-0001164 | putative iron (III) ABC transporter | ||
| ABF-0017527 | Lysophospholipase | ADT-0001494 | putative lipoprotein | ||
| ABF-0047106 | putative lipoprotein | ADT-0002704 | putative lipoprotein | ||
| ABF-0016810 | Drug resistance transporter | ADT-0001435 | putative membrane protein | ||
| ABF-0019088 | Dihydrodipicolinate synthase | ADT-0002292 | putative dihydrodipicolinate synthetase | ||
| ABF-0014868 | Ferrichrome-iron receptor | ADT-0004187 | TonB dependent receptor | ||
| ABF-0017095 | hypothetical protein | ADT-0000555 | putative exported protein | ||
| ABF-0018540 | Oxidoreductase | ADT-0000962 | probable short-chain dehydrogenase | ||
| ABF-0014948 | hypothetical protein | ADT-0002252 | putative exported protein | ||
| ABF-0020431 | Methyl-accepting chemotaxis protein I (serine chemoreceptor protein) | ADT-0000661 | methyl-accepting chemotaxis protein | ||
| ABF-0019851 | Methyl-accepting chemotaxis protein III (ribose and galactose chemoreceptor protein) | ADT-0001602 | methyl-accepting chemotaxis protein | ||
| ABF-0020368 | hypothetical protein | ADT-0002020 | putative exported protein | ||
| ABF-0016058 | Poly(glycerophosphate chain) D-alanine transfer protein DltD | ADT-0000806 | dltD | poly(glycerophosphate chain) D-alanine transfer protein | |
| ABF-0019212 | N-Acetyl-D-glucosamine ABC transport system | ADT-0002138 | extracellular solute-binding protein | ||