| Literature DB >> 19054114 |
Juan Javier Díaz-Mejía1, Mohan Babu, Andrew Emili.
Abstract
The bacterial cell-envelope consists of a complex arrangement of lipids, proteins and carbohydrates that serves as the interface between a microorganism and its environment or, with pathogens, a human host. Escherichia coli has long been investigated as a leading model system to elucidate the fundamental mechanisms underlying microbial cell-envelope biology. This includes extensive descriptions of the molecular identities, biochemical activities and evolutionary trajectories of integral transmembrane proteins, many of which play critical roles in infectious disease and antibiotic resistance. Strikingly, however, only half of the c. 1200 putative cell-envelope-related proteins of E. coli currently have experimentally attributed functions, indicating an opportunity for discovery. In this review, we summarize the state of the art of computational and proteomic approaches for determining the components of the E. coli cell-envelope proteome, as well as exploring the physical and functional interactions that underlie its biogenesis and functionality. We also provide a comprehensive comparative benchmarking analysis on the performance of different bioinformatic and proteomic methods commonly used to determine the subcellular localization of bacterial proteins.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19054114 PMCID: PMC2704936 DOI: 10.1111/j.1574-6976.2008.00141.x
Source DB: PubMed Journal: FEMS Microbiol Rev ISSN: 0168-6445 Impact factor: 16.408
Fig. 1A general functional classification of the Escherichia coli cell-envelope related proteome. A set of 1179 proteins tentatively forming the cell-envelope proteome of E. coli K-12 (substrain W3110) was selected combining the results of four different predictors of protein global subcellular localization by ‘Majority Consensus’ (see section ‘Majority Consensus’ improves the prediction of global subcellular localization for details). The number of proteins for each compartment forming the ‘Majority Consensus’ is shown in parentheses. Fractions represent the number of proteins in each functional category – according to the COGs database (Tatusov ) – divided by the total number of E. coli proteins in the respective category. In comparison with the cytoplasmic proteins (the remaining fraction not shown in each functional category), the cell-envelope proteome is markedly enriched in proteins with an unknown function (c. 70%). Two COG categories, namely Translation and DNA replication, recombination and repair, are not shown, as none of these 1179 proteins is classified into such categories. IM, inner membrane; PE, periplasmic; OM, outer membrane; EC, extracellular.
Fig. 2A middle-level functional classification of the E. coli cell-envelope-related proteome. The 1179 proteins in the ‘Majority Consensus’ tentatively forming the cell-envelope proteome of E. coli K-12 were mapped against the middle-level terms in the hierarchy of functional annotations in the database MultiFun (Serres ). Fractions represent the number of cell-envelope proteins for each MultiFun functional category, divided by the total number of E. coli proteins in the respective category. Only categories with fractions of tentative cell-envelope proteins >0.2 are shown. Subcellular localization acronyms are described as in Fig. 1. Struct, Structural components; Inf, inner membrane protein folding.
Data sources of known and predicted protein subcellular localization analyzed in this study
| Type of data source or program | Subcellular localization | References | Version | Batch |
|---|---|---|---|---|
| Predictors of α-helix topology | ||||
| MEMSAT3 | IM | 3.0 | S | |
| Phobius (PolyPhobius) | IM | – | M | |
| ConPredII | IM | 2005 | M | |
| TMHMM | IM | 2.0 | M | |
| HMMTOP | IM | 2.0 | M | |
| Discriminators of OM proteins and predictors of β-barrel topology | ||||
| BOMP | OM* | – | M | |
| TMB-Hunt | OM* | – | S, M | |
| TMBETADISC-RBF | OM* | – | M, G | |
| TMBETA-NET | OM* and OM** | – | S | |
| PRED-TMBB | OM** | – | S | |
| PROFtmb | OM* and OM** | – | M (10) | |
| Predictors of signal peptides | ||||
| DOLOP | LP | 1.0 | G | |
| TatP | PE, EC | 1.0 | M | |
| SignalP | PE, OM, EC | 3.0 | M | |
| LipoP | LP | 1.0 | M | |
| Predictors of global subcellular localization | ||||
| Majority consensus | CY, IM, PE, OM, EC | This study | NA | NA |
| Gneg-PLoc | CY, IM, PE, OM, EC, FB, FG, NC | 2.5 | S | |
| CELLO II | CY, IM, PE, OM, EC | 2.5 | M | |
| PSORTb | CY, IM, PE, OM, EC | 2.0 | M, G | |
| P-CLASSIFIER | CY, IM, PE, OM, EC | 2005 | M (100) | |
| Proteome Analyst | CY, IM, PE, OM, EC | 2.5 | M, G | |
| Proteomic studies | ||||
| Zhang | CY, IM, PE, OM | NA | NA | |
| Lopez-Campistrous | CY, IM, PE, OM | NA | NA | |
| Daley | IM | NA | NA | |
| Mori and colleagues | M | Unpublished | NA | NA |
| Molloy | OM | NA | NA | |
| Knowledge databases | ||||
| TOPDB | IM, OM | 2007 | M | |
| EcoCyc | CY, IM, PE, OM, EC, LP | 11.6 | M | |
| Riley | CY, IM, PE, OM, LP | NA | M | |
| ePSORTdb | CY, IM, PE, OM, EC | 2.0 | M | |
| MultiFun | CY, IM, PE, OM, EC | 2007 | M | |
| CCDB | CY, IM, M, PE, OM, EC, LP | 2006 | M | |
| Uniprot | CY, IM, M, PE, OM, EC, LP | 55.5 | M, G | |
| DOLOP | LP | 2005 | M | |
Corresponding websites are provided in Table S1.
Some programs or databases do not provide a version other than referring to the year of the last webpage update. In all these cases, the data were collected in May, 2008. NA, not available; –, no version availability.
Corresponding websites allow the submission of multiple sequences (M), provides precomputed genomic results (G), or allow only submission of single sequences (S). PROFtmb and P-CLASSIFIER allows submitting of up to 10 and 100 sequences per run, respectively. TMB-Hunt allows submitting of multiple sequences if a homology-based step is turned-off.
Some α-helix programs such as Phobius, and Conpred II has its own signal peptide predictors.
DOLOP detects potential lipoprotein features at the NH3-terminus of protein sequences (not necessarily signal peptides). Also provides a list of experimentally verified lipoproteins.
The ‘Majority Consensus’ is not a predictor itself, is just the integration of results from the four global predictors of subcellular localization with predictions available in batch mode (PSORTb, Proteome Analyst, CELLO II and P-CLASSIFIER).
Gneg-PLoc provides precomputed results for proteins that have no subcellular localization annotations or annotated with uncertain terms such as ‘probable’, ‘potential’, ‘likely’, or ‘by similarity’ in Swiss-Prot.
http://ecoli.naist.jp/GFP/gfp_top.jsp
CY, cytoplasmic; PE, periplasmic; OM*, discriminator of outer membrane β-barrels; OM**, β-barrel topology predictor; M, membrane (undefined if IM or OM); EC, extracellular; FB, fimbriae; FG, flagellum; NC, nucleoid; LP, lipoproteins (might be part of different cell-envelope compartments).
‘Performance’ comparison of predictors of global protein subcellular localization, α-helices (TMHs) and β-barrels (TMBs)†
| Predictor | TP | FP | FN | TN | Precision (%) | Sensitivity (%) | MCC |
|---|---|---|---|---|---|---|---|
| Cytoplasmic | |||||||
| Majority Consensus* | 131 | 3 | 14 | 151 | 97.76 | 90.34 | 0.89 |
| Proteome Analyst* | 119 | 7 | 26 | 147 | 94.44 | 82.07 | 0.78 |
| CELLO II* | 135 | 24 | 10 | 130 | 84.91 | 93.10 | 0.78 |
| PSORTb* | 108 | 2 | 37 | 152 | 98.18 | 74.48 | 0.76 |
| P-CLASSIFIER* | 135 | 29 | 10 | 125 | 0.82 | 0.93 | 0.75 |
| GnegPLoc* | 132 | 50 | 6 | 95 | 72.53 | 95.65 | 0.64 |
| Inner membrane and α-helices | |||||||
| Proteome Analyst* | 65 | 10 | 4 | 220 | 86.67 | 94.20 | 0.87 |
| Majority Consensus* | 54 | 1 | 15 | 229 | 98.18 | 78.26 | 0.85 |
| Phobius [≥1 TMHs] | 55 | 2 | 14 | 228 | 96.49 | 79.71 | 0.85 |
| PSORTb* | 53 | 2 | 16 | 228 | 96.36 | 76.81 | 0.83 |
| TMHMM [≥2 TMHs] | 43 | 1 | 26 | 229 | 97.73 | 62.32 | 0.74 |
| GnegPLoc* | 48 | 6 | 19 | 210 | 88.89 | 71.64 | 0.74 |
| TMHMM [≥1 TMHs] | 53 | 12 | 16 | 218 | 81.54 | 76.81 | 0.73 |
| Phobius [≥2 TMHs] | 43 | 2 | 26 | 228 | 95.56 | 62.32 | 0.72 |
| ConPredII [≥2 TMHs] | 43 | 2 | 26 | 228 | 95.56 | 62.32 | 0.72 |
| CELLO II* | 43 | 2 | 26 | 228 | 95.56 | 62.32 | 0.72 |
| P-CLASSIFIER* | 41 | 2 | 28 | 228 | 0.95 | 0.59 | 0.70 |
| MEMSAT3 [≥2 TMHs] | 42 | 3 | 27 | 227 | 93.33 | 60.87 | 0.70 |
| ConPredII [≥1 TMHs] | 56 | 21 | 13 | 209 | 72.73 | 81.16 | 0.69 |
| HMMTOP [≥2 TMHs] | 42 | 9 | 27 | 221 | 82.35 | 60.87 | 0.64 |
| HMMTOP [≥1 TMHs] | 54 | 76 | 15 | 154 | 41.54 | 78.26 | 0.38 |
| MEMSAT3 [≥1 TMHs] | 69 | 230 | 0 | 0 | 23.08 | 100.00 | NA |
| Periplasmic | |||||||
| Majority Consensus* | 21 | 6 | 8 | 264 | 77.78 | 72.41 | 0.72 |
| PSORTb* | 17 | 2 | 12 | 268 | 89.47 | 58.62 | 0.70 |
| Proteome Analyst* | 21 | 13 | 8 | 257 | 61.76 | 72.41 | 0.63 |
| CELLO II* | 22 | 22 | 7 | 248 | 50.00 | 75.86 | 0.57 |
| P-CLASSIFIER* | 19 | 21 | 10 | 249 | 0.48 | 0.66 | 0.50 |
| GnegPLoc* | 8 | 7 | 20 | 248 | 53.33 | 28.57 | 0.34 |
| Outer membrane and β-barrels | |||||||
| PSORTb* | 30 | 0 | 8 | 261 | 100.00 | 78.95 | 0.88 |
| Proteome Analyst* | 30 | 0 | 8 | 261 | 100.00 | 78.95 | 0.88 |
| Majority Consensus* | 29 | 0 | 9 | 261 | 100.00 | 76.32 | 0.86 |
| PRED-TMBB [≥3 TMBs] | 26 | 6 | 12 | 117 | 81.25 | 68.42 | 0.68 |
| PROFtmb [≥3 TMBs] | 19 | 0 | 19 | 123 | 100.00 | 50.00 | 0.66 |
| PROFtmb [≥2 TMBs] | 19 | 0 | 19 | 123 | 100.00 | 50.00 | 0.66 |
| BOMP [ | 20 | 1 | 18 | 122 | 95.24 | 52.63 | 0.65 |
| GnegPLoc* | 18 | 4 | 15 | 246 | 81.82 | 54.55 | 0.63 |
| TMBETA-NET [≥3 TMBs] | 31 | 24 | 7 | 99 | 56.36 | 81.58 | 0.56 |
| TMBETA-NET [≥2 TMBs] | 31 | 24 | 7 | 99 | 56.36 | 81.58 | 0.56 |
| CELLO II* | 21 | 10 | 17 | 251 | 67.74 | 55.26 | 0.56 |
| PRED-TMBB [≥2 TMBs] | 35 | 40 | 3 | 83 | 46.67 | 92.11 | 0.51 |
| P-CLASSIFIER* | 20 | 13 | 18 | 248 | 0.61 | 0.53 | 0.51 |
| TMB-Hunt | 18 | 10 | 20 | 251 | 64.29 | 47.37 | 0.50 |
| TMBETADISC-RBF | 28 | 36 | 10 | 225 | 43.75 | 73.68 | 0.49 |
| Extracellular | |||||||
| Majority Consensus* | 7 | 0 | 11 | 281 | 100.00 | 38.89 | 0.61 |
| Proteome Analyst* | 10 | 8 | 8 | 273 | 55.56 | 55.56 | 0.53 |
| PSORTb* | 5 | 0 | 13 | 281 | 100.00 | 27.78 | 0.52 |
| CELLO II* | 8 | 12 | 10 | 269 | 40.00 | 44.44 | 0.38 |
| P-CLASSIFIER* | 6 | 13 | 12 | 268 | 0.32 | 0.33 | 0.28 |
| GnegPLoc* | 4 | 6 | 13 | 260 | 40.00 | 23.53 | 0.27 |
A set of 299 proteins from Gram-negative bacterial species was used as reference gold standard, with exception of PRED-TMBB, PROFtmb and TMBBETA-NET, which allow the submission of only one or few sequences at a time. For these programs, we randomly selected a subset of 161 proteins, restricting the subsets of CY, IM and OM to 38 proteins each. Sixteen out of the 299 proteins predicted by Gneg-PLoc as part of nucleoid, flagellum or fimbriae were excluded from Gneg-PLoc performance analysis. All predictions were run in September 2008 (see Table S2 for details).
TP, true positives; FP, false positives; FN, false negatives; TN, true negatives. Precision = TP / (TP+FP); Sensitivity = TP / (TP+FN). Sections for each subcellular localization in this table show predictors from higher to lower Matthews Correlation Coefficient (MCC), using the following formula:
Predictors of global subcellular localization are denoted by (*).
Predictors of TMHs and TMBs were analyzed twice to filter the minimal number of trans-membrane elements required to count as a true hit (shown in square parentheses).
PRED-TMBB includes three methods; only the Viterbi method is shown here. The other two methods resulted in similar performance (Table S2).
Fig. 3‘Agreement’ analysis between pairs of bioinformatic predictors of protein subcellular localization. The 4220 proteins forming the E. coli K-12 proteome were subjected to prediction of global subcellular localization (*) and specific features (α-helices, β-barrels and signal peptides) by different computational methods. Each square in the matrix represents the number of proteins predicted to be located in a given compartment by any two predictors (P1 and P2). Results from P1 are plotted on the x-axis, while predictions of P2 are plotted on the y-axis. The number of predicted proteins for each subcellular location by each method is shown in parentheses. The darker the square intersecting any two methods, the higher the ‘Agreement’ between them (see section ‘Statistical parameters to evaluate the performance of predictors of subcellular localization’ for details). Major discrepancies between methods are highlighted in red frames. TIMP α-helix predictors were evaluated for one or more helices (≥1 TMHs) and for two or more helices (≥2 TMHs); only the option with a higher ‘Performance’ (Table 2) is shown. CY, cytoplasmic; SP, signal peptide;‘?’ refers to proteins with no predicted localization. Other subcellular localization acronyms are described as in Fig. 1. Subcellular localization predictions and ‘Agreement’ values used to construct this plot are available in Table S1.
Fig. 4‘Agreement’ analysis between pairs of proteomic, bioinformatic tools and knowledge databases predicting or describing the E. coli cell-envelope-related proteome. Bioinformatic methods are represented by the ‘Majority Consensus’ of predictors of global subcellular localization (*). Proteomic studies are denoted by ‘p’, gold standard reference databases of protein subcellular localization are denoted by ‘g’ and other databases by ‘d’. Each square in the matrix represents the number of proteins predicted or described to be located in a given compartment by any two data sources. The darker the square intersecting any two data sources (D1 and D2), the higher the ‘Agreement’ between them (see section ‘Statistical parameters to evaluate the performance of predictors of subcellular localization’ for details). Predictions or descriptions of D1 are plotted on the x-axis, while predictions or descriptions of D2 are plotted on the y-axis. The number of predicted proteins for each subcellular location is shown in parentheses. Major discrepancies between datasources are highlighted in red frames. The list of cell-envelope proteins according to different proteomic methods is shown in Table S1. Subcellular localization acronyms are described as in Figs 1 and 3.
Data sources of experimental and bioinformatic PPI and protein functional interactions
| Source | Data provided in each study | Reference |
|---|---|---|
| PPI and protein complexes | ||
| DIP | Manually and automatically curated PPI | |
| IntAct | Manually, automatically curated, and directly submitted biomolecular interactions | |
| BIND | Manually, automatically curated, and directly submitted biomolecular interactions, protein complexes and pathway information | |
| Butland | A high-throughput PPI study | |
| TCDB | Manually curated transporter complexes classified functionally and evolutionarily | |
| Protein functional interactions | ||
| Najafabadi & Salavati | Sequence-based prediction of protein functional interactions by means of codon usage | |
| STRING | Known and predicted PPI and protein functional interactions derived from bioinformatic and experimental resources | |
| NEBULON | Protein functional interactions predicted from operon predictions and rearrangements | |
Fig. 5A census of the cell-envelope-related PPIs and protein complexes in knowledge databases. PPIs contained in the DIP, BIND and IntAct databases were filtered to obtain interactions derived from low-throughput (PPI_lt) and high-throughput (PPI_ht) experiments. Protein complex co-memberships (PCCM) annotated in the databases EcoCyc and TCDB are shown as edges connecting all-against-all proteins (nodes) forming a complex. Only interactions between proteins predicted as cell-envelope related according to the ‘Majority Consensus’ of predictors of global subcellular localization are shown. Node colors denote COG functional assignments, with the exception of grey nodes, where the poorly characterized proteins were assigned to categories ‘R and S, denoting proteins of no COG functional assignment. Proteins with grey nodes, depicted by blue labels, correspond to MultiFun functional assignments. Proteins depicted in red nodes were categorized under cell-envelope and OM biogenesis based on the COG functional assignment.
Fig. 6Selection of cell-envelope candidates for affinity tagging and purification using bioinformatic and proteomic data sources. (a) Western blotting of E. coli SPA-tagged TIMP and periplasmic proteins solubilized with eight different detergents, detected for the presence of the SPA-tag using an anti-FLAG antibody. The concentration of detergent used in the purification is shown in parentheses. The three detergents most effectively solubilizing the membrane proteins are indicated in a rectangular box with broken lines. The set of 34 candidates comprising of TIMP and periplasmic proteins was selected according to the predicted number of transmembrane α-helices and signal peptides, respectively, based on Phobius predictions (see Table S1 for the list). (b) SPA-purified E. coli membrane protein baits identified by mass spectrometry. The bar graph shows the recovery and detection coverage for affinity-tagged and -purified E. coli TIMP baits spanning both single membrane and polytopic (>10-TMH) transmembrane helices identified by MS. DM, n-dodecyl-β-d-maltoside. The acronyms of the other chemicals are described in the text.