Host-microbiome interactions and the microbial community have broad impact in human health and diseases. Most microbiome based studies are performed at the genome level based on next-generation sequencing techniques, but metaproteomics is emerging as a powerful technique to study microbiome functional activity by characterizing the complex and dynamic composition of microbial proteins. We conducted a large-scale survey of human gut microbiome metaproteomic data to identify generalist species that are ubiquitously expressed across all samples and specialists that are highly expressed in a small subset of samples associated with a certain phenotype. We were able to utilize the metaproteomic mass spectrometry data to reveal the protein landscapes of these species, which enables the characterization of the expression levels of proteins of different functions and underlying regulatory mechanisms, such as operons. Finally, we were able to recover a large number of open reading frames (ORFs) with spectral support, which were missed by de novo protein-coding gene predictors. We showed that a majority of the rescued ORFs overlapped with de novo predicted protein-coding genes, but on opposite strands or in different frames. Together, these demonstrate applications of metaproteomics for the characterization of important gut bacterial species.
Host-microbiome interactions and the microbial community have broad impact in human health and diseases. Most microbiome based studies are performed at the genome level based on next-generation sequencing techniques, but metaproteomics is emerging as a powerful technique to study microbiome functional activity by characterizing the complex and dynamic composition of microbial proteins. We conducted a large-scale survey of human gut microbiome metaproteomic data to identify generalist species that are ubiquitously expressed across all samples and specialists that are highly expressed in a small subset of samples associated with a certain phenotype. We were able to utilize the metaproteomic mass spectrometry data to reveal the protein landscapes of these species, which enables the characterization of the expression levels of proteins of different functions and underlying regulatory mechanisms, such as operons. Finally, we were able to recover a large number of open reading frames (ORFs) with spectral support, which were missed by de novo protein-coding gene predictors. We showed that a majority of the rescued ORFs overlapped with de novo predicted protein-coding genes, but on opposite strands or in different frames. Together, these demonstrate applications of metaproteomics for the characterization of important gut bacterial species.
It is now well established that microbial species inhabit many ecosystems, which drives diversity due to the need to adapt to these environments [1]. Molecular diversity and robustness also allows these species to inhabit microbiomes that are host associated, such as the human gut [2-7]. Due to its importance to human health and disease, the human gut microbiome has been extensively sequenced, leading to the identification of more than a thousand distinct species, with knowledge about diversity increased by every new study [8]. Culture based whole genome sequencing techniques [9, 10] have identified a few hundred human gut associated genomes. Culture free techniques, revolutionized by metagenomics coupled with computational genome binning methods, have resulted in many more metagenome-assembled genomes (MAGs) [1, 3, 5, 8–11]. A unified genome catalog contains more than 200,000 reference genomes from the human gut microbiome [8]. Maintaining a comprehensive genomic catalog of bacteria and archaea will provide the basis needed to perform large scale multi-omic comparative genomic studies. Large scale studies based on whole genome sequences will be central in understanding details about mechanisms of microbial interactions with each other and with their environments and hosts. These studies will also be critical for uncovering details about metabolic pathways and key functions at the protein level by examining proteome landscapes and employing various data-mining techniques to identify genes and functions of interest, such as CRISPR-cas systems [12] and anti-CRISPRs [13].Recent progress has dramatically increased the collection of microbial species that are related to human health and diseases, most notably the accumulation of MAGs, many of which represent new species. Experimental studies of these new species in terms of their expression and functions remains scarce. Computational gene predictors have become an essential first step in the annotation of these new genomes. De novo gene prediction techniques are commonly used because they are not constrained by sequence similarity with known ones [14, 15]. De novo gene prediction remains a unsolved problem, with different tools, such as FragGeneScan (FGS), Prokka, and GenMark, producing largely consistent but not perfect predictions because most of the predicted genes remain hypothetical without functional annotations. Proteomic studies have been used to improve understanding of the microbial world beyond genomics. Proteomics allows correction of bad gene predictions, and the discovery of protein products from the regions of the genome not yet predicted to be coding areas [16]. Proteomics has been used as a tool for studying bacterial virulence and antimicrobial resistance [17]. The Multidimensional Protein Identification Technology (MudPIT) approach was used to study Pseudomonas aeruginosa membrane-associated proteins, which contribute to P. aeruginosa cells’ antibiotic resistance and is involved in their interaction with host cells [18].Motivated by recent expansion of microbial genome catalogues, our previously defined reference based peptide identification pipeline (HAPiID) [19], and the increasing number of publicly available gut metaproteomics datasets, we conducted a large scale survey of metaproteomics data of the human gut microbiome to study the proteome landscapes of the various microbial species dominating the human gut. Our aim was to mimic targeted proteomics studies, which traditionally focused on single cultured species, by leveraging the available metaproteomics datasets. We were able to identify species that are ubiquitously expressed across all samples spanning various phenotypes. Furthermore, by focusing on the most highly expressed genome sequences at the protein level, we represented the expressed proteome as a network and extracted co-abundant protein modules. We used such network information to study the various metabolic pathways that a protein with unknown function might be involved in and also identified modules that were expressed in hosts with specific phenotypes. We also leveraged proteome information to identify and annotate potential operon structures within genomes and recover open reading frames (ORFs) with spectral evidence that were otherwise missed by computational protein coding gene predictors. Operon structures have been exploited for computational functional predictions (guilt by association) [20, 21]. We made the results from our analysis available at a publicly accessible website, providing the protein expression and putative operon structures with spectral support for many human gut related microbial species for the first time.
Results
Summary of the metaproteomic identification results
Using more than one thousand metaproteomic datasets, we were able to reveal information about the functional landscapes for many bacteria species that are important for human health and disease. Due to the throughput limitation of the metaproteomic experiments, protein expressions of the rare species were not observable for all the bacterial species that were identified through metagenomic research. In total, we were able to identify 2,511 distinct genomes that were expressed in at least one out of the total 1,276 samples (see Table 1; see details of the datasets in S1 Data). A total of 13,460,264 spectra were matched to peptides, out of which 12,950,155 spectra (96.2%) were matched to these 2,511 genomes. The remaining 510,109 identified spectra (which could either be from rarer species or a result of false identifications) were not considered further in this study. Fig 1 summarizes the spectral support for these genomes with x-axis showing the number of supporting samples and y-axis showing the total number of identified spectra for each genome (a data point in the plot).
Table 1
Summary of the metaproteomics datasets that were analyzed.
Study
# of MS files in original study
# samples for downsteam analysis
phenotypes
Rechenberger et al. 2019 [22]
424
318
AL
Tanca et al. 2015 [23]
5
5
healthy
Cerdo et al. 2018 [24]
56
8
healthy infants
Gavin et al. 2018 [25]
101
101
healthy/seronegative T1D/seropositive
Long et al. 2020 [26]
38
10
colorectal cancer/healthy
Lehmann et al. 2019 [27]
77
56
healthy/CD/UC/IBS colon adenoma gastric carcinoma
Zhang et al. 2018 [28]
203
167
healthy/CD/UC
Zhang et al. 2020 [29]
48
37
healthy/CD
Zhang et al. 2016 [30]
8
8
healthy
Lloyd-Price et al. 2019 [31]
641
493
healthy/CD/UC
Zhang et al. 2017 [32]
45
9
healthy/CD/UC
Hickl et al. 2019 [33]
30
3
healthy
Young et al. 2015 [34]
176
9
female preterm infant
Blakeley-Ruiz et al. 2019 [35]
572
52
CD (resection surgery)
Total
2424
1276
# samples: not all datasets were used in our downstream analyses. We discarded the datasets that had < 1000 identified peptides and for references 32–35, we grouped fractions into single samples.
Fig 1
Scatter plot summarizing the expression of 2,511 human associated microbial genomes in different samples.
X-axis shows the number of samples in which each genome was observed, and y-axis shows the total support spectra for each genome. The four “outlier” genomes that are highly expressed and/or broadly distributed are highlighted in red with their species names shown in the plot, followed by the number of samples supporting their expression in the parenthesis.
# samples: not all datasets were used in our downstream analyses. We discarded the datasets that had < 1000 identified peptides and for references 32–35, we grouped fractions into single samples.
Scatter plot summarizing the expression of 2,511 human associated microbial genomes in different samples.
X-axis shows the number of samples in which each genome was observed, and y-axis shows the total support spectra for each genome. The four “outlier” genomes that are highly expressed and/or broadly distributed are highlighted in red with their species names shown in the plot, followed by the number of samples supporting their expression in the parenthesis.
Not all highly-expressed species are equal: Some are generalist and some are specialist
We saw variation in metaproteomic support for the different genomes (see Fig 1) with the most highly expressed genome having over one million supporting spectra. The mean number of spectra expressed by each genome was 4,780 and the median was 385, which indicates very few highly expressed genomes and many low expression genomes at the protein level. S1 Fig shows a long-tailed abundance distribution for the human gut associated bacteria, with many species observed with very low abundance. The top 100 most abundant genomes (less than 4% of the total number of expressed genomes) contributed more than 61% of the identified spectra (7,327,171 spectra). Fig 2 summarizes the taxonomic composition at the genus level of these top 100 genomes. These 100 genomes represent a total of 34 genera in five most abundant phyla of the human gut Bacteriodetes, Firmicutes, Actinobacteria, Proteobacteria and Verrucomicrobiota (S2 Fig).
Fig 2
Piechart summarizing the taxonomic composition at the genus level for the top 100 highly expressed genomes.
Phocaeicola is the most abundant phylum (23.4%, purple), followed by Bacteroides (13.4%, red), Faecalibacterium (11.5%, green), and the rest can be followed by reading left to right in the legend and counter-countwise accordingly in the graph.
Piechart summarizing the taxonomic composition at the genus level for the top 100 highly expressed genomes.
Phocaeicola is the most abundant phylum (23.4%, purple), followed by Bacteroides (13.4%, red), Faecalibacterium (11.5%, green), and the rest can be followed by reading left to right in the legend and counter-countwise accordingly in the graph.The top two most abundant species belong to the genus Phocaeicola (Phocaeicola vulgatus and Phocaeicola dorei), which expressed more than 11.9% of the total number of spectra in a total of 543 and 308 samples respectively. We refer to these two species as generalists herein forward. These two species share similar phenotype expression patterns because they both appear in many healthy samples and in disease samples, including acute leukemia (AL), type-1 diabetes (T1D), Crohn’s disease (CD), irritable bowel syndrome (IBS), colorectcal cancer (CRC), colon adenoma and CD (followed by resection surgery), with the exception of gastric carcinoma where only Phocaeicola vulgatus was found to be expressed. We also identified two species Lactobacillus amylovorus and Limosilactobacillus mucosae, which were the 4 and the 11 most highly expressed genomes but are only expressed in 25 and 16 samples respectively with L. amylovorus identifications coming from three studies [22, 27, 35] and L. mucosae also coming from three studies [26, 27, 35]. We refer to these two species as specialists. They were not found to be expressed in a single healthy sample, but both were highly expressed in CD (followed by resection surgery) patients, and the former was also found in AL patients with high abundance. We also looked into the metagenomic support of these species. Analysis of the matched metagenomic data showed that the Lactobacillus genus has a relative abundance of 2% [26, 27, 35]. We also checked the presence of these two species using an independent collection of metagenomic datasets from CD patients, which contains a total of 108 metagenomic samples from four projects including [36-39]. Kraken2 (v2.0.8, [40]) and Bracken (v2.6.2, [41]) was used to quantify the relative abundances of the species. The results showed that these two species are of very low abundance in most of the CD datasets but are in relative high abundance (e.g.. 2% or 4%) in a small number of CD patients, which is in agreement with the observation based on the metaproteomic data. Further, our analyses suggested that this species had high protein expressions (quantified by metaproteomic approach) that was disproportional to their abundance (quantified by the metagenomic approach).We next checked whether or not these two pairs of species have similar functional profiles when compared with each other, over the samples in which they are expressed. To do that, we annotated their expressed proteins, as much as possible, with COG terms (see Methods for more details). For each of the four genomes mentioned above, we computed the relative abundances for the COG terms associated with their expressed protein sequences. We summarize the distribution of the high level COG categories, represented by 25 single letter groups, in Fig 3. We noticed that, overall, the two generalist species share more similar functional profiles, and the two specialist genomes share more similar functional profiles (Fig 3). The two specialist genomes had significantly higher relative protein expression levels for the COG categories G (Carbohydrate transfer and metabolism), J (Translation, including ribosome structure and biogenesis) and T (Signal transduction). On the other hand, the two generalist species had relatively higher expression levels in the COG categories P (inorganic ion transport and metabolism), M (Cell wall/membrane/envelope biogeneis), C (Energy production and conversion), U (Intracellular trafficking, secretion, and vesicular transport) and W (Extracellular structures). It should be noted that the latter functional category W had no observed expression within the two specialist species.
Fig 3
Barplot summarizing the relative abundances of the COG functional categories of the two generalist (blue/green) and two specialist (red/orange) highly abundant genomes.
Protein co-expression modules and their applications
We focused on the top 100 most highly expressed genomes in this section to assess the presence/expression of their proteins among the different host phenotypes (see S3 Fig for the summary). We derived groups of proteins (protein co-expression modules) that had similar presence/expression patterns across samples and were mostly found in one of the 11 phenotypes. We extracted a total of 854 such modules, composed of 3,697 protein sequences (see Table 2). Fig 4 shows two such protein modules. The first module containing proteins that were exclusively expressed in CD patients (the proteins were observed in high abundance although they were only observed in 16 CD samples), and the second one contains proteins mostly expressed (but in less abundance) in AL patients. We noticed that many of the proteins within the modules lack functional annotations but have connections to those whose functions are previously characterized.
Table 2
Phenotype specific protein co-expression modules.
Phenotype
# of modules
# of proteins
AL
232
1,250
CD
96
307
CD (resection surgery)
225
816
colon adenoma
6
17
CRC
0
0
gastric carcinoma
9
23
healthy & healthy seronegative
169
873
healthy infant
25
92
IBS
6
26
preterm infants
0
0
T1D
12
45
UC
74
248
Fig 4
Example protein co-abundant modules extracted from specialists.
(a) Limosilactobacillus mucosae a specialist found in CD samples (L. mucosae was one of the two most highly expressed specialists discussed in this paper), and (b) Escherichia flexneri a specialist found in acute leukemia patients. The nodes are proteins: unknown proteins are in blue, and they are connected to proteins with functional annotations (shown in other colors) through our co-occurrence analysis.
Example protein co-abundant modules extracted from specialists.
(a) Limosilactobacillus mucosae a specialist found in CD samples (L. mucosae was one of the two most highly expressed specialists discussed in this paper), and (b) Escherichia flexneri a specialist found in acute leukemia patients. The nodes are proteins: unknown proteins are in blue, and they are connected to proteins with functional annotations (shown in other colors) through our co-occurrence analysis.We explored the possibility of suggesting functions for proteins that lack COG or KEGG annotations, but are co-expressed with other annotated proteins (guilt-by-association). We report pathway associations when such proteins share edges with other annotated ones within the modules. By doing so we were able to suggest potential pathways for a total of 7,682 proteins using COG annotations and 5,800 proteins using KEGG annotations (9,566 receiving either COG or KEGG pathway). On the other hand, there were a total of 3,747 proteins using COG annotations and 2,080 proteins using KEGG annotations respectively, that either had no edges with any other annotated proteins or were found in modules that were completely composed of un-annotated protein sequences. We note just like any guilt-by-association approaches, our protein-coexpression based analysis can provide hints to potential biological processes (pathways) that these proteins might be involved, however, their exact functions may need to be studied further using other approaches.
Putative operon structures with spectral support
We applied the described pipeline to extract potential operon candidates from genomes that were expressed in relatively high number of samples (see Methods for more details). Two lists of potential operons were produced: one with spectral support (as described in the Methods section), and the other one of operons without consideration of spectral information. In total, we analyzed 278 genomes, and made the results available on the GutBac website. From these genomes, a total of 4,089 potential operon structures with spectral support and 36,633 suggested operon structures with or without spectral support were identified. We compared our predictions with those predicted using fgenesB for the top 10 most highly expressed genomes, and the results are summarized in Table 3. The number of potential operons with spectral support were always lower than those predicted by fgeneB and those suggested by our pipeline without spectral support. However, on average there was slightly better agreement between the fgenesB predictions and those with spectral support (> 92%) compared to the suggestions without spectral support (88%). It should be noted that the increase in the difference between the reported numbers is expected in this case, as we included more genomes with decreasing spectral coverage.
Table 3
Summary of predicted operons for the selected genomes by fgenesB and our approaches including the total with and without spectra support.
genome id
fgenesB
predicted operons
supported*
no-support**
GCF_003475695.1
875
817 (699#)
421 (375)
390 (324)
20287_6_9
1038
878 (759)
378 (349)
500 (410)
BackhedF_2015__SID70_4M__bin_5
675
533 (492)
185 (174)
348 (318)
GCF_003465905.1
454
301 (279)
125 (117)
176 (162)
LiJ_2017__H2M414927__bin_20
486
374 (353)
154 (143)
220 (210)
12718_7_31
910
766 (650)
289 (265)
477 (385)
GCF_003471795.1
747
549 (478)
274 (244)
275 (234)
GCF_000169015.1
835
694 (586)
299 (267)
395 (319)
GCF_000209425.1
560
400 (362)
114 (111)
286 (251)
GCF_003433995.1
667
511 (476)
164 (152)
347 (324)
*, ** putative operons with and without spectral support, respectively.
# numbers in parentheses reflect the number of operons that overlap with fgenesB predictions (we considered that two operons overlap if half of the smaller operon overlaps with the larger one).
*, ** putative operons with and without spectral support, respectively.# numbers in parentheses reflect the number of operons that overlap with fgenesB predictions (we considered that two operons overlap if half of the smaller operon overlaps with the larger one).
Recovery of missed ORFs with spectral support
We incorporated metaproteomics datasets to recover ORFs with spectral support which were otherwise missed by de novo gene predictors mentioned in the Methods section above. Novel ORFs in comparison to the two computational gene/protein prediction methods (FGS and Prokka) were extracted as explained in Methods. From the 2,511 distinct genomes that were identified to be expressed at the protein level in at least one sample, a total of 23,949 putative ORFs were identified which were otherwise missed, when FGS was employed for gene prediction. Similarly, a total of 22,658 novel ORFs were recovered when Prokka was used for gene prediction. The majority of these recovered ORFs were overlapping (at 84%); a more detailed results for this comparison is summarized in S4 Fig. Unsurprisingly, there is positive correlation between the number of ORFs recovered for genomes with their protein expression levels in both cases, as shown in S5 and S6 Figs. This further suggests potential improvement in bacterial annotations with the increase in the throughput of metaproteomics. The rescued ORFs tend to be short (which is expected as longer ORFs are harder to be missed by de novo gene predictors), albeit there are some very long ORFs (see S7 Fig). We compiled lists of rescued ORFs with metaproteomics support, along with putative annotations of the ORFs if available, and made all the results available for download through the GutBac website.We examined the relationship of the rescued ORFs with respect to the protein coding genes predicted by FGS or Prokka. We found that among the rescued ORFs (22,949) missed by FGS, the majority of them are either on the opposite strands (11,955, 53%) of already predicted protein coding genes by FGS, or the same strand but in different frames (7,347, 33%). Similarly, among the 22,658 rescued ORFs missed by Prokka, 10,807 (47.7%) are on the opposite strands of predicted protein coding genes by Prokka and 6,734 (29.7%) are of the same strand but in different frames. The genome that has the most number of rescued ORFs is Phocaeicola vulgatus (accession ID: GCF_003475695.1), one of the two generalists we discussed above. Of the 426 recused ORFs missed by Prokka and 437 missed by FGS, 232 and 249 cases respectively were found on the opposite strands of predicted genes, and 97 and 122 cases were found to be encoded by a different frame of overlapping genes. Fig 5 shows plots of three regions in this genome that contain rescued ORFs with spectral support (together with predicted operons). The first example (illustrated on the top in Fig 5) involves two rescued ORFs, among which one is a large ORF that was missed by Prokka (but was predicted as a protein coding gene by FGS), and could be re-identified using metaproteomic data (we note this ORF was missed by Prokka probably because it overlaps with a tRNA gene, shown as a red arrow in the plot). The second example (shown in the middle in Fig 5) also contains two rescued ORFs, and we note the first ORF (from 86972 to 87128 bp in NP_QRPW01000005.1) was missed by both FGS and Prokka but we found metaproteomic evidence for this ORF. In fact, this ORF is part of a large operon that involves genes encoding ribosomal proteins, and hmmscan search (https://www.ebi.ac.uk/Tools/hmmer/search/hmmscan) shows that the rescued ORF is ribosomal protein S11. In the last example, the larger rescued ORF of 199 aa overlaps (on the opposite strand) with two putative protein coding genes. This ORF is supported by one peptide of 2 peptide spectrum matches (PSMs) but similarity search (using hmmscan search) did not return similar sequences, so further investigation could be needed.
Fig 5
Selected cases of rescued ORFs using metaproteomic data in P. vulgatus genome.
The three blocks of arrows represent genes predicted from three regions in this genome: from the top to the bottom are contigs with IDs of NZ_QRPW01000017.1, NZ_QRPW01000005.1, and NZ_QRPW01000002.1, respectively. Genes predicted by FGS and/or Prokka are shown as purple arrows around the central lines, each representing a segment of the genome: dark purple and light purple are for genes with spectral support in at least two metaproteomic datasets, or only one metaproteomic dataset, respectively. The red small arrow represents a tRNA gene predicted by Prokka. Rescued ORFs are shown as green arrows above the lines. Genes in the same putative operon structure are surrounded in orange squares.
Selected cases of rescued ORFs using metaproteomic data in P. vulgatus genome.
The three blocks of arrows represent genes predicted from three regions in this genome: from the top to the bottom are contigs with IDs of NZ_QRPW01000017.1, NZ_QRPW01000005.1, and NZ_QRPW01000002.1, respectively. Genes predicted by FGS and/or Prokka are shown as purple arrows around the central lines, each representing a segment of the genome: dark purple and light purple are for genes with spectral support in at least two metaproteomic datasets, or only one metaproteomic dataset, respectively. The red small arrow represents a tRNA gene predicted by Prokka. Rescued ORFs are shown as green arrows above the lines. Genes in the same putative operon structure are surrounded in orange squares.Further confirmation of the rescued ORFs will require experimental validation. Here we used three indirect evidences to support the validity of some of the rescued ORFs. First, 25% of the rescued ORFs have some function annotations (to COG categories in this paper) and they are likely to be true. S8 Fig summaries the functional distributions of the rescued ORFs (details of the functional annotations can be found on GutBac). Second, clustering of ORFs revealed 1,346 groups of ORFs each having at least 2 members, among which 37% received functional annotations and the remaining 63% did not (but we would argue they are also likely to be true as they are non-unique events). For example, the largest group of ORFs without any functional annotation (the fourth largest group of rescued ORFs with or without functional annotations) includes 61 rescued ORFs from various of genomes that share sequence similarity. Third, we compared rescued ORFs to putative proteins annotated by NCBI for the top 5 genomes that encode the highest number of novel ORFs and also have a RefSeq genome in the NCBI database (the five NCBI genomes are GCF_000159855.2, GCF_000169015.1, GCF_003433995.1, GCF_003464525.1 and GCF_003475695.1), and the results showed that about 6% of the rescued ORFs had perfect match with these proteins.
Discussions
By taking advantage of the availability of many metaproteomics datasets, we were able to probe the protein landscapes for many human-associated microbial species. However, due to the relatively low throughput of metaproteomics comparing to metagenomics (and metatranscriptomics), the number of genomes we studied (in depth) was rather limited, even though a huge number of reference genomes for human gut microbiome were available. The HAPiID pipeline includes more than 6000 genomes for peptide and protein identification from metaproteomic data, and we were only able to provide genome-level protein landscape analysis for 40% of these genomes.Using metaproteomic data, we were able to identify a large number of ORFs that had spectral support but were missed by de novo gene predictors. Many of these rescued ORFs are relatively short (otherwise they are unlikely to be missed by protein coding gene predictors), and we want to emphasize that they need to be interpreted cautiously. Some of them could be false identifications and some of them may reflect translation that does not result in proteins with biological significance. In the case of the 199aa ORF discussed in the results, evidence of this ORF was based on a single peptide of 2 PSMs and more focused mass spectrometry analysis (higher throughput techniques or a focus on this species) could be used to improve support for this existence of this ORF and eliminate the possibility of it being a false identification. Finally, we hope the finding of “rescued” ORFs (such as the one that overlaps with a tRNA gene) and their analysis can inspire ideas for improving de novo protein coding gene predictions.With the increasing number of microbial genomes being sequenced, functional annotation becomes an immediate need. Numerous genomes are being computationally assembled as a result of these metagenomic sequencing efforts coupled with emerging computational genome assembly and binning tools. This is expanding the gap between the amount of whole microbial genomes recovered and the fraction of annotation the community has concerning these newly discovered genomes. While metaproteomics based techniques are still low throughput compared to the sequencing base techniques, we believe there is great value in studying the proteome of these microbial communities directly within their environment and augmenting a third level of information on top of metagenome and metatranscriptome, to capture such microbial interactions at a more granular scale, i.e., both functional and pathway levels, rather than a mere interpretation of the genome abundance levels of the different microbial entities at some taxonomic level.We hope that our results would serve as a resource for the study of gut microbial community in particular to cast more light over the microbial dark matter, and broaden our understanding at the functional level. Furthermore, we also hope that this work would inspire others and serve as an example on how to utilize metaproteomics as a tool for large scale analysis to study the microbial functional landscapes, especially as metaproteomics throughput improves and different mass spectrometry methods that may improve results, such as multiplexing and data independent approaches, become more common.
Methods
Metaproteomics samples
Human gut metaproteomic datasets were obtained from the publicly accessible proteome exchange database [42]. We extracted a total of 2,424 Thermo Fisher RAW files, from 14 recent studies, spanning 11 distinct phenoypes [22-35]. A more detailed summary of the datasets used can be found in Table 1. Four of the studies were based on fractionation approaches to increase sequencing depth. For peptide-spectral-matching and to identify expressed proteins, we ran these individual samples through our previously developed HAPiID pipeline [19]. For fractionated samples, we performed peptide spectral matching for each individual fraction separately and then combined the identification results from every fraction into results for a single sample.
Peptide identification and proteome quantification
To identify peptides and quantify proteome content of our proteomics dataset, we first ran each individual sample through our previously implemented HAPiID proteomics framework (https://github.com/mgtools/HAPiID) [19]. HAPiID uses two steps to optimize the use of the reference genomes and metagenome-assembled genomes (MAGs) as the universal reference for human gut metaproteomic MS/MS data analysis. For this work, we expanded the search protein database used by HAPiID to include proteins predicted from 6,160 non-redundant microbial whole genome sequences or MAGs, and used it as a reference to compute theoretical spectra, for peptide spectral matching. These microbial genomes were collected from five recent studies [3, 5, 9–11], and then filtered and dereplicated by dRep [43], using 90% sequence identity for the primary clusters, and 99% sequences identity for the secondary clusters with a minimum of 60% genome alignment coverage. Default parameters for the two-step search strategy were used over all samples. For the peptide spectral matching step, the MSGF+ search engine was used with the following settings [44]: high-resolution LTQ as the instrument type, precursor mass tolerance of 15 PPM, isotope error range between -1 and 2, a maximum of 3 fixed modifications, variable oxidation of methionine, fixed carboamidomethyl of cysteine, maximum charge of 7 and a minimum charge of 1 (S1 Table shows the breakdown of the identified peptides with different charges), and allowing for semi-tryptic fragmentation up to two missed-cleavages. A strict FDR cutoff at 1% was enforced by using target-decoy database approach with reverse protein sequences as decoy. A final set of samples was maintained by discarding the ones where we identify less than 1,000 unique peptide sequences. After combining fractions for the fractionated samples, our final dataset was composed of 1,276 samples.Genomes identified by the profiling step of HAPiID were considered as expressed genomes and their relative abundances at proteome level were estimated using all identified spectra (from the two steps of identifications by HAPiID) as following. All uniquely mapped spectra (i.e., their corresponding peptides are unique to individual genomes) get assigned to their corresponding genomes, followed by partial allocations of the multi-mapped spectra (shared by multiple genomes). Quantities from unique spectra to genomes are used for weighted assignment of the multi-mapped spectra shared between multiple genomes, similar to the approach proposed in Qin et al. [45] (genomes that do not receive unique spectra are not considered further). In addition to the absolute counts of spectra assigned to each genome, we also computed the normalized spectra counts to consider the differences of proteome sizes for the various genomes and the different metaproteomic throughput for comparative analysis. Spectral counts were first normalized (per million amino acids) by the respective proteome sizes (calculated as the total combined lengths of protein sequences for each genome), and then normalized by the total number of identified spectra for each sample to account for sequencing depths. Using normalized spectra counts is important for downstream comparative analyses of the expression of identified species.
Protein function annotation
We used both COG [46] and KEGG [47] databases to assign functional annotations to protein sequences. The latest version of clusters of orthologs (COG) database (release 2020), was used to assign COG terms, with significant hits, to the predicted protein sequences from whole genome sequences. A COG protein reference database was first created from the curated list of 3,213,025 protein sequences with COG assignments. Diamond blast [48], with the –more-sensitive setting, was employed to assign COG terms to our query sequences. A protein sequence was annotated with a COG term if the alignment between the query protein sequence and the reference protein sequence covers at least 50% of the total length of the COG domain, with a minimum e-value score of 0.01. A total of 10,546,105 protein sequences (out of 15,072,008 protein sequences), each had at least one COG hit, using these criteria. Protein sequences that did not receive any COG hit were assigned to the category S (i.e., no COG-term assignments). We also performed COG based pathway annotation, for protein sequences with COG hits that participate in certain pathways. To transfer KEGG annotations to our protein sequences we first annotated protein sequences with KO terms using KEGG’s blast-Koala tool, over the latest version of KEGG database [47], followed by mapping putative pathways that each protein participates in using KEGG’s pathway mapper tool. Furthermore, we post-processed protein to KEGG pathway maps by using our previously developed tool, MinPath [49], to overcome potential overestimation of pathways by the naive function-to-pathway mapping method.
Inference of protein co-expression network
For the top 100 most highly expressed genomes, we created protein expression profile M x N matrices where M represents the list of expressed proteins and N represents the list of samples where the particular genome is expressed in, and a particular entry m, n represents the number of spectra expressed by the protein m in the sample n. From this matrix, protein expression profile vectors were then extracted for each expressed protein and all against all pairwise correlation coefficients were calculated using the program fastspar [50], which is a fast and scalable implementation of the original sparCC correlation measure [51]. The resulting M x M correlation matrix is used to construct a protein co-abudnance network. Based on author recommendations, we used a minimum sparCC correlation coefficient of 0.3 or higher to infer an edge between two expressed proteins. To extract proteome connected components, we developed an in house script using python’s networkX module [52]. A graph node in this context is an expressed protein sequence, and an edge between two protein sequences represents a sparCC correlation coefficient of 0.3 or higher, between the expression profile of these two proteins. We first identified all the maximum cliques within the network by an iterative approach. For all the remaining nodes that did not form cliques, we identified the best candidate connected components, by adding them to the clique where they share the highest number of edges with. Cytoscape was used for network visualization [53].
Host phenotype specificity of gut bacterial proteins
To quantify if a protein’s expression is only observed in gut microbiomes with a specific host phenotype or broadly found in samples with different phenotypes, we defined the host phenotype specificity (HPS) of a protein using Shannon’s entropy measure [54], as following:
where N is the total number of possible phenotypes that a protein i was found to be expressed, and p represents the proportion of the protein i being expressed in sample with phenotype j. The proportions for each protein to be found expressed in different phenotypes are calculated based on the spectral counts, normalized by sequencing depth. Using this metric, phenotypic entropy closer to zero indicates phenotype specific expression patterns, and those with values closer to 1 indicate more broadly distributed protein expressions, likely to be expected in a broad range of phenotypes.
Inference of operon structures with protein expression support
Protein spectral support was integrated with genomic context to extract potential candidates for operon structures within highly expressed genomes. For each genome, we extracted clusters of proteins from the same contig and strand containing member protein sequences within 100 bases. We integrated spectral support to filter out clusters that did not have at least half of its protein members expressed in 2 or more samples and required at least 2 spectra per sample. To validate our results, we also predicted operon structures using the fgenesB program [55, 56]. FgenesB is a bacterial operon and gene prediction program, based on pattern Markov chains. The FgenesB software suite, however, only provides a web interface with limited submissions per day per user, therefore an automated method to detect operon structures, that could be integrated in pipelines and run offline would be highly desirable. For comparison, we counted a predicted operon as overlapping with those of fgenesB if the two suggested operon regions on the genome overlap by more than 70%.
Recovering of proteins missed by gene predictors but are supported by metaproteomics data
FragGeneScan (FGS) [57] was used to predict protein coding genes for the genomes used in our reference protein database. For each of the highly abundant genomes identified in each sample, by selecting top N genome sequences covering 80% of the identified spectra in step 1 of our HAPiID pipeline (see [19] for more details), we construct a protein database using 6 frame genome translation sequences instead of predicted protein sequences as a reference database. We then identified peptides from each sample based on this method of database construction. For each new peptide we extracted open reading frames (ORFs) that surround them, and then filtered out those that did not have a blast hit of 70% or higher with the respective predicted protein predicted by FGS. Similarly, we repeated the same task by using Prokka [58] for protein prediction instead of FGS.
Availability of the results
Results are available through the GutBac website at https://omics.informatics.indiana.edu/GutBac/. The website includes GFF files with predicted proteins for each genome, FNA and FFN files containing relevant genomic sequences, GFF files containing both the predicted proteins and the missed ORFs for each genome, and CSV files containing the list of predicted operons for each genome. The website includes contig-specific plots generated by DNA Features Viewer [59] containing the missed ORFs, predicted proteins, and predicted operons for both FGS and Prokka. The website also features genome-specific searching and filtering. Finally, the website has the MSGF+ identifications, networks and annotations for each genome, and the genome to species mapping.
An Excel file with details of the metaproteomics datasets used in this analysis.
(XLSX)Click here for additional data file.
Summary of parent ion charges.
(PDF)Click here for additional data file.
Number of normalized spectra per genome.
For clarity, we only showed the top 500genomes. Purple represents the top 100 genomes and pink the top 500.(TIF)Click here for additional data file.(TIF)Click here for additional data file.
Boxplots summarizing the host phenotype specificity of expressed proteins encoded by the top 100 most abundant genomes.
Gray line indicates the average host phenotype specificity of the proteins in each genome.(TIF)Click here for additional data file.
Venn diagram summarizing the overlap between the rescued ORFs that were missed by FGS and Prokka.
(TIF)Click here for additional data file.
Scatter plots showing the relationship between the expression levels of the different genomes at the protein level with the total number of rescued ORFs missed by FGS.
(TIF)Click here for additional data file.
Scatter plots showing the relationship between the expression levels of the different genomes at the protein level with the total number of rescued ORFs missed by Prokka.
(TIF)Click here for additional data file.
Length distribution of the rescued ORFs.
(TIF)Click here for additional data file.
Barplot summarizing the functional annotations (in COG) of rescued ORFs.
Authors: Lucas B Harrington; David Paez-Espino; Brett T Staahl; Janice S Chen; Enbo Ma; Nikos C Kyrpides; Jennifer A Doudna Journal: Nat Commun Date: 2017-11-10 Impact factor: 14.919
Authors: Robert D Stewart; Marc D Auffret; Amanda Warr; Andrew H Wiser; Maximilian O Press; Kyle W Langford; Ivan Liachko; Timothy J Snelling; Richard J Dewhurst; Alan W Walker; Rainer Roehe; Mick Watson Journal: Nat Commun Date: 2018-02-28 Impact factor: 14.919
Authors: Julia Rechenberger; Patroklos Samaras; Anna Jarzab; Juergen Behr; Martin Frejno; Ana Djukovic; Jaime Sanz; Eva M González-Barberá; Miguel Salavert; Jose Luis López-Hontangas; Karina B Xavier; Laurent Debrauwer; Jean-Marc Rolain; Miguel Sanz; Marc Garcia-Garcera; Mathias Wilhelm; Carles Ubeda; Bernhard Kuster Journal: Proteomes Date: 2019-01-08
Authors: Ramnik J Xavier; Curtis Huttenhower; Jason Lloyd-Price; Cesar Arze; Ashwin N Ananthakrishnan; Melanie Schirmer; Julian Avila-Pacheco; Tiffany W Poon; Elizabeth Andrews; Nadim J Ajami; Kevin S Bonham; Colin J Brislawn; David Casero; Holly Courtney; Antonio Gonzalez; Thomas G Graeber; A Brantley Hall; Kathleen Lake; Carol J Landers; Himel Mallick; Damian R Plichta; Mahadev Prasad; Gholamali Rahnavard; Jenny Sauk; Dmitry Shungin; Yoshiki Vázquez-Baeza; Richard A White; Jonathan Braun; Lee A Denson; Janet K Jansson; Rob Knight; Subra Kugathasan; Dermot P B McGovern; Joseph F Petrosino; Thaddeus S Stappenbeck; Harland S Winter; Clary B Clish; Eric A Franzosa; Hera Vlamakis Journal: Nature Date: 2019-05-29 Impact factor: 49.962
Authors: Stephen Nayfach; Simon Roux; Rekha Seshadri; Daniel Udwary; Neha Varghese; Frederik Schulz; Dongying Wu; David Paez-Espino; I-Min Chen; Marcel Huntemann; Krishna Palaniappan; Joshua Ladau; Supratim Mukherjee; T B K Reddy; Torben Nielsen; Edward Kirton; José P Faria; Janaka N Edirisinghe; Christopher S Henry; Sean P Jungbluth; Dylan Chivian; Paramvir Dehal; Elisha M Wood-Charlson; Adam P Arkin; Susannah G Tringe; Axel Visel; Tanja Woyke; Nigel J Mouncey; Natalia N Ivanova; Nikos C Kyrpides; Emiley A Eloe-Fadrosh Journal: Nat Biotechnol Date: 2020-11-09 Impact factor: 54.908
Authors: Cindy J Castelle; Laura A Hug; Kelly C Wrighton; Brian C Thomas; Kenneth H Williams; Dongying Wu; Susannah G Tringe; Steven W Singer; Jonathan A Eisen; Jillian F Banfield Journal: Nat Commun Date: 2013 Impact factor: 14.919