| Literature DB >> 20532204 |
Kyle Ellrott1, Lukasz Jaroszewski, Weizhong Li, John C Wooley, Adam Godzik.
Abstract
The microbes that inhabit particular environments must be able to perform molecular functions that provide them with a competitive advantage to thrive in those environments. As most molecular functions are performed by proteins and are conserved between related proteins, we can expect that organisms successful in a given environmental niche would contain protein families that are specific for functions that are important in that environment. For instance, the human gut is rich in polysaccharides from the diet or secreted by the host, and is dominated by Bacteroides, whose genomes contain highly expanded repertoire of protein families involved in carbohydrate metabolism. To identify other protein families that are specific to this environment, we investigated the distribution of protein families in the currently available human gut genomic and metagenomic data. Using an automated procedure, we identified a group of protein families strongly overrepresented in the human gut. These not only include many families described previously but also, interestingly, a large group of previously unrecognized protein families, which suggests that we still have much to discover about this environment. The identification and analysis of these families could provide us with new information about an environment critical to our health and well being.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20532204 PMCID: PMC2880560 DOI: 10.1371/journal.pcbi.1000798
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Coverage of genomic and metagenomic datasets with protein families.
Sequence sets include Human Gut Related(A), Human Gut Unrelated(B) and Metagenomic sequences(C). The unassigned proteins (green) consist of singletons and small sequence clusters (see text for details).
Figure 2Size distribution of protein families in human gut metagenomics data, PfamA protein families (red) and new families found in this work (blue).
Figure 3The distribution of “essentiality coefficients” for protein families.
PFAM families [5] are shown on the left and the new families introduced in this manuscript on the right panel.
The 10 most overrepresented (Ov) PfamA families in human gut microbiome.
| Family Id | Family name | G | n | G | N | Ov/Ex/Es | PSI | |
|
| Domain of unknown function (DUF1735) | Conserved genomic neighbor of SusC/SusD, remote homology to SusE | 93 | 0 | 12 | 0 | 590.68/7.15/0.18 | 393045, JCSG, Diffraction-quality Crystals |
|
| Osmosensory transporter coiled coil | 90 | 0 | 8 | 0 | 571.62/10.00/0.12 | 2nocA, NESG | |
|
| Protein of unknown function (DUF1202) | Structural protein, probably involved in maintaining cell shape | 61 | 0 | 6 | 0 | 387.43/8.71/0.09 | N/A |
|
| VirE N-terminal domain | 72 | 1 | 12 | 1 | 228.65/5.04/0.18 | 388157, JCSG, Expressed | |
|
| GBS Bsp-like repeat | 33 | 0 | 6 | 0 | 209.59/4.71/0.09 | APC88089.1, Purified | |
|
| DNA binding domain of tn916 integrase | 33 | 0 | 16 | 0 | 209.59/1.94/0.25 | N/A | |
|
| Protein of unknown function (DUF2534) | Sensory, regulatory proteins | 29 | 0 | 7 | 0 | 184.19/3.63/0.11 | Z1342_ECO57, Montreal-Kingston, Expressed |
|
| Enterobacterial EspB protein | 25 | 0 | 1 | 0 | 158.78/12.50/0.02 | N/A | |
|
| Protein of unknown function (DUF1158) | Cytotoxic phage protein | 20 | 0 | 7 | 0 | 127.03/2.50/0.11 | ESSD_ECOLI,, Montreal-Kingston, Cloned |
|
| Excisionase from transposon Tn916 | 39 | 1 | 16 | 1 | 123.85/1.79/0.24 | N/A |
Exact definitions of the Ov category is given in the Methods section. Columns provide numerical values for: g (total number of representatives in genomes of human gut microbiome microbes), n (total number of representatives in genomes of microbes not associated with human gut microbiome), G (number of microbes from human gut microbiome with at least one representative of a family), and N (number of microbes not associated with human gut microbiome with at least one representative of a family). Complete statistics for all Pfam protein families analyzed in this study are provided in the Supplementary Material.
The 10 most expanded (Ex) PfamA families in human gut microbiome.
| Family Id | Family name | G | n | G | N | Ov/Ex/Es | PSI |
|
| RagB, SusD and hypothetical proteins | 784 | 54 | 13 | 6 | 90.54/48.29/0.19 | 3cghA, JCSG |
|
| TonB dependent receptor, branch of the SusC megafamily | 1,285 | 2679 | 23 | 171 | 3.05/37.97/0.01 | APC6611, MCSG, Purified |
|
| TonB-dependent Receptor Plug Domain, SusC domain and a branch of the SusC megafamily | 1,292 | 2686 | 24 | 172 | 3.05/36.15/0.02 | APC62280.2, MCSG, Crystallized |
|
| Fimbrial protein | 205 | 15 | 8 | 8 | 81.38/21.11/0.11 | APC6678, MCSG, Purified |
|
| Spore coat protein (Spore_GerQ) | 102 | 281 | 2 | 20 | 2.30/20.62/−0.01 | NYSGXRC-10075, SGX, Soluble |
|
| ABC transporter | 4,938 | 27601 | 65 | 493 | 1.14/18.95/0.00 | 282417, JCSG, PDB: 1VPL |
|
| Bacterial regulatory helix-turn-helix proteins, AraC family | 1,982 | 3901 | 62 | 294 | 3.23/18.24/0.36 | NYSGXRC-11003f, SGX, PDB: 3BT3 |
|
| MatE | 1,332 | 1965 | 63 | 347 | 4.30/15.17/0.27 | 282685, JCSG, Expressed |
|
| Y_Y_Y domain | 291 | 172 | 17 | 49 | 10.68/12.73/0.16 | APC81251, MCSG, Cloned |
|
| Enterobacterial EspB protein | 25 | 0 | 1 | 0 | 158.78/12.50/0.02 | N/A |
Exact definitions of the Ex category is given in the Methods section. For the details of the Table 2 columns see the legend for Table 1.
The 10 most essential (Es) PfamA families in human gut microbiome.
| Family Id | Family name | G | n | G | N | Ov/Ex/Es | PSI |
|
| Pyruvate formate lyase | 127 | 151 | 61 | 103 | 5.31/0.60/0.73 | NYSGXRC-12027a, SGX, Purified |
|
| Glycine radical | 179 | 219 | 62 | 114 | 5.17/0.94/0.72 | NYSGXRC-12027a, SGX, Purified |
|
| Homoserine O-succinyltransferase | 58 | 118 | 58 | 114 | 3.10/−0.04/0.66 | 2ghrA, JCSG |
|
| Domain of unknown function DUF | 131 | 296 | 54 | 92 | 2.80/−0.80/0.64 | N/A |
|
| Glycosyl hydrolases family 2, TIM barrel domain | 293 | 159 | 52 | 80 | 11.63/3.57/0.64 | NYSGXRC-12014c, SGX, PDB: 3BGA |
|
| Glycosyl hydrolases family 25 | 103 | 106 | 49 | 62 | 6.11/0.38/0.63 | 388675, JCSG, Crystallized |
|
| Protein of unknown function (DUF1212) | 90 | 203 | 59 | 139 | 2.80/0.05/0.63 | APC20809.1, MCSG, Expressed |
|
| Glycosyl hydrolases family 2, sugar binding domain | 397 | 218 | 52 | 92 | 11.51/5.15/0.61 | NYSGXRC-12014c, SGX, PDB: 3BGA |
|
| Glycosyl hydrolases family 2, immunoglobulin-like beta-sandwich domain | 266 | 126 | 48 | 69 | 13.30/3.63/0.60 | NYSGXRC-12014c, SGX, PDB: 3BGA |
|
| Galactokinase galactose-binding signature | 58 | 150 | 56 | 135 | 2.44/−0.09/0.59 | N/A |
Exact definitions of the Es category is given in the Methods section. For the details of the Table 3 columns see the legend for Table 1.
10 top most overrepresented (Ov) new families, from the set of over 180 curated novel families identified in this work.
| Family ID | Family description | g | n | G | N | Ov/Ex/Es | Most advanced PSI target (id, center, status) |
|
| No hypothesis about function | 105 | 0 | 14 | 0 | 666.89/7.00/0.22 | 390317, JCSG, Diffraction-quality Crystals |
|
| Contains putative lipoproteins | 60 | 0 | 13 | 0 | 381.08/4.29/0.20 | #N/A |
|
| No hypothesis about function | 92 | 1 | 21 | 1 | 292.16/3.68/0.32 | NYSGXRC-T1444, NYSGXRC, Work Stopped |
|
| Contains conserved hypothetical proteins found in conjugate transposon TraH. | 40 | 0 | 13 | 0 | 254.05/2.86/0.20 | 390153, JCSG, Diffraction-quality crystals |
|
| No hypothesis about function | 35 | 0 | 16 | 0 | 222.30/2.06/0.25 | #N/A |
|
| No hypothesis about function | 32 | 0 | 13 | 0 | 203.24/2.29/0.20 | NYSGXRC-12097b, NYSGXRC, Native diffraction data |
|
| No hypothesis about function | 31 | 0 | 15 | 0 | 196.89/1.94/0.23 | #N/A |
|
| No hypothesis about function | 28 | 0 | 13 | 0 | 177.84/2.00/0.20 | 393207, JCSG, Crystallized |
|
| No hypothesis about function | 27 | 0 | 18 | 0 | 171.49/1.42/0.28 | #N/A |
|
| Remote homology to HD domain (PF01966) | 24 | 0 | 22 | 0 | 152.43/1.04/0.34 | #N/A |
Exact definitions of the Ov category is given in the Methods section. Columns provide numerical values for: g (total number of representatives in genomes of human gut microbiome microbes), n (total number of representatives in genomes of microbes not associated with human gut microbiome), G (number of microbes from human gut microbiome with at least representative of a family) and N (number of microbes not associated with human gut microbiome with at least representative of a family).
10 top most expanded (Ex) new families, from the set of over 180 curated novel families identified in this work.
| Family ID | Family description | g | n | G | N | Ov/Ex/Es | Most advanced PSI target (id, center, status) |
|
| Contains putative TonB-linked outer membrane proteins, part of SusC? Remote homology to several outer membrane receptors | 930 | 158 | 14 | 58 | 37.15/59.32/0.10 | #N/A |
|
| Contains putative TonB-linked outer membrane proteins, part of SusC? | 908 | 76 | 14 | 8 | 74.90/52.09/0.20 | APC62223.1, MCSG, Purified |
|
| N-terminal subdomain of SusD | 819 | 56 | 14 | 6 | 91.26/46.60/0.20 | 3ejn, JCSG, In PDB |
|
| Branch of SusD family | 764 | 61 | 13 | 5 | 78.27/44.40/0.19 | 3cgh, JCSG, In PDB |
|
| No hypothesis about function | 105 | 0 | 14 | 0 | 666.89/7.00/0.22 | 390317, JCSG, Diffraction-quality Crystals |
|
| Contains putative lipoproteins | 60 | 0 | 13 | 0 | 381.08/4.29/0.20 | #N/A |
|
| No hypothesis about function | 92 | 1 | 21 | 1 | 292.16/3.68/0.32 | NYSGXRC-T1444, NYSGXRC, Work Stopped |
|
| Branch of SusD family | 98 | 17 | 13 | 4 | 34.58/3.60/0.19 | 3cgh, JCSG, In PDB |
|
| Remote homology to Flagellar basal body-associated protein | 21 | 0 | 5 | 0 | 133.38/3.50/0.08 | #N/A |
Exact definitions of the Ex category is given in the Methods section. For the details of the Table 5 columns see the legend for Table 4.
10 top most essential (Es) new families, from the set of over 180 curated novel families identified in this work.
| Family ID | Family description | g | n | G | N | Ov/Ex/Es | Most advanced PSI target (id, center, status) |
|
| No hypothesis about function | 41 | 53 | 41 | 24 | 4.82/−1.14/0.58 | 387995, JCSG, Diffraction-quality crystals |
|
| Putative glycosyl hydrolase, remote homology to dextranase, polygalacturonase | 204 | 186 | 44 | 88 | 6.93/2.44/0.50 | 281957, JCSG, Crystallized |
|
| Contains vancomycin b-type resistance proteins vanW, C-terminal domain homologous to L,D-transpeptidase | 68 | 105 | 37 | 61 | 4.07/0.10/0.45 | NYSGXRC-10212m, NYSGXRC, Purified |
|
| Contains ABC transporters, remote homology to predicted membrane proteins | 129 | 274 | 45 | 129 | 2.98/0.70/0.43 | #N/A |
|
| Small GTP-binding protein | 34 | 45 | 33 | 44 | 4.69/0.00/0.42 | 389596, JCSG, Diffraction-quality Crystals |
|
| Domain present in radical SAM domain proteins | 29 | 38 | 29 | 34 | 4.72/−0.12/0.38 | APC20476, MCSG, Purified |
|
| Contains sortases SrtC | 36 | 3 | 23 | 3 | 57.16/0.75/0.35 | #N/A |
|
| No hypothesis about function | 27 | 36 | 27 | 36 | 4.63/−0.01/0.34 | APC27927, MCSG, Work Stopped |
|
| Remote homology to HD domain (PF01966) | 24 | 0 | 22 | 0 | 152.43/1.04/0.34 | #N/A |
Exact definitions of the Es category is given in the Methods section. For the details of the Table 6 columns see the legend for Table 4.