| Literature DB >> 34763597 |
Nicolas Nalpas1, Lesley Hoyles2,3, Viktoria Anselm1, Tariq Ganief1, Laura Martinez-Gili2, Cristina Grau4, Irina Droste-Borel1, Laetitia Davidovic5, Xavier Altafaj4,6, Marc-Emmanuel Dumas2,7,8, Boris Macek1.
Abstract
Intestinal microbiota plays a key role in shaping host homeostasis by regulating metabolism, immune responses and behavior. Its dysregulation has been associated with metabolic, immune and neuropsychiatric disorders and is accompanied by changes in bacterial metabolic regulation. Although proteomics is well suited for analysis of individual microbes, metaproteomics of fecal samples is challenging due to the physical structure of the sample, presence of contaminating host proteins and coexistence of hundreds of taxa. Furthermore, there is a lack of consensus regarding preparation of fecal samples, as well as downstream bioinformatic analyses following metaproteomics data acquisition. Here we assess sample preparation and data analysis strategies applied to mouse feces in a typical mass spectrometry-based metaproteomic experiment. We show that subtle changes in sample preparation protocols may influence interpretation of biological findings. Two-step database search strategies led to significant underestimation of false positive protein identifications. Unipept software provided the highest sensitivity and specificity in taxonomic annotation of the identified peptides of unknown origin. Comparison of matching metaproteome and metagenome data revealed a positive correlation between protein and gene abundances. Notably, nearly all functional categories of detected protein groups were differentially abundant in the metaproteome compared to what would be expected from the metagenome, highlighting the need to perform metaproteomics when studying complex microbiome samples.Entities:
Keywords: Metaproteomics; Mus musculus; mass spectrometry; microbiome; proteogenomics
Mesh:
Substances:
Year: 2021 PMID: 34763597 PMCID: PMC8726736 DOI: 10.1080/19490976.2021.1994836
Source DB: PubMed Journal: Gut Microbes ISSN: 1949-0976
Figure 1.Low speed centrifugation impacts protein identification and taxonomic representation. (a) Number of MS/MS spectra, peptides and protein groups per samples for the comparison between LSC (red) and nLSC (blue) methods. (b) Number of identified MS/MS spectra, peptides and protein groups per samples for the comparison between LSC-in solution digestion (red), LSC-FASP (gray), nLSC-in solution digestion (blue) and nLSC-FASP (Orange) methods. (a-b) Represented significance results correspond to t-test on N = 12 (a) or N = 6 (b): * p- value ≤ .05, ** ≤.01, *** ≤.001. (c) Hierarchical representation of Unipept-derived taxonomy (down to phylum level) for the peptide identified in the LSC and nLSC. The barplot represent the taxonomic abundance for LSC (red) and nLSC (blue) methods based on peptide counts (only for taxon identified with 3 or more peptides). (d) Overlap in the overall identified peptides or protein groups between the LSC and nLSC methods. (e) Volcano plot of the protein abundance comparison between LSC and nLSC approaches. Significant protein groups based on paired t-test from N = 12 with FDR ≤ .01 and absolute fold-change ≥2.5. (f) KEGG pathways over-representation testing for the protein groups that significantly increase (red) or decrease (blue) in abundance between LSC and nLSC sample preparation approaches. Fisher exact-test threshold (gold dotted line) set to adjusted p-value ≤ .05
Figure 2.Two-step database search in combination with target-decoy strategy leads to a dramatic increase in false positive rate. (a) The protein groups count is shown for single- or two-step search strategies across increasingly large protein sequence databases. Counts are color-coded per category, with eukaryote (gray), bacteria (red), contaminant (blue) and reverse (Orange) hits. (b) The FDR is calculated for single- or two-step search strategies across increasingly large protein sequence databases. The FDR is calculated based on reverse hits only (circle shape) or reverse plus bacterial hits (triangle shape). (c & d) The sensitivity (c) and factual FDR (d) based on protein groups identification across increasingly large protein sequence databases. The compared database search strategies are single-step (blue), two-step taxon filtering (gray) and two-step protein filtering without (red) or with (Orange) database sectioning. Lines represent the median (and the shading corresponds to the standard error) from N = 8 LC-MS/MS runs. (e) The true positive count based on protein groups identified with a minimum of one (shaded coloring) or two (unshaded coloring) unique peptides for the largest database (i.e. 20). The compared database search strategies are single-step (blue), two-step taxon filtering (gray) and two-step protein filtering without (red) or with (Orange) database sectioning. Bars and numbers indicate the median count, while error bars correspond to the standard deviation, from N = 8 LC-MS/MS runs. The overall maxima of true positive count based on single-step search is indicated as a horizontal dotted line (gold)
Figure 3.Unipept software provides the most precise taxonomic annotation of MS-based peptide identification. (a) Percentage of taxon-annotated peptides at each taxonomic level for the comparison between Kraken2 (red), Diamond (gray) and Unipept (blue) software. (b) Assessment of the impact of the minimum number of PSM count per taxon onto the F-measure for taxonomic annotation. The F-measure was compared between Kraken2 (red), Diamond (gray) and Unipept (blue) software. (c) Heatmap representing the correlation (Spearman ρ) in taxonomic abundance between sample input protein (expectation) and different taxonomic annotation software (i.e. Kraken2, Diamond and Unipept). The correlation was performed overall, as well as for each taxonomic level. (d) Organisms pooled in artificial samples are ranked based on the protein material input, as displayed in the left-most barplots (x-axis in log10 scale). The proteome size (ORFs) for these organisms on UniProt web resource is displayed in the right-most barplot (x-axis in log10 scale). The heatmap compares the taxon identification across samples between Kraken2, Diamond and Unipept. (a-d) Samples from the study by Kleiner and colleagues, with N = 8. (e) Overlap in the overall identified taxa between the Kraken2 (red), Diamond (gray) and Unipept (blue) software. (f) A comparison of the F-measure distribution for taxonomic annotation between the Kraken2 (red), Diamond (gray) and Unipept (blue) software. Each point represents an individual mouse. (e-f) Samples from this study using mouse fecal material, with N = 38
Figure 4.Functionally active pathways derived from the metaproteome differs from the metagenome potential. (a) Correlation is shown between each protein groups (metaproteome) and corresponding gene “groups” (metagenome) abundances. Correlation was tested using Spearman’s rank correlation and p-value was adjusted for multiple testing using Benjamini-hochberg correction. Significantly positively correlating protein/gene groups are in red colors, while significantly negatively correlating protein/gene groups are in green colors (adjusted p-value ≤ .05). (b) GSEA of KEGG pathways based on ranking of the protein/gene groups correlation. Pathway node color corresponds to GSEA results adjusted p-value and node size matches the number of protein/gene group assigned to the pathway. (c) Comparison in the proportion of selected KEGG functional categories (level 2) between metaproteome (red) and metagenome (gray). Paired t-test p-values are indicated (N = 38). (d) GSEA of KEGG pathways based on ranking of t-test results from KEGG orthology proportion between metaproteome and metagenome. KEGG pathways are color-coded based on KEGG functional categories (level 2). Only significantly over-represented KEGG pathways are shown with adjusted p-value ≤ .05. (e) Interaction network between KEGG orthologies and KEGG pathways for the KEGG functional category “Protein families: genetic information processing”. Pathway node size corresponds to number of KEGG orthologies associated to it. KEGG orthologies are color-coded based on directional adjusted p-value from the t-test comparison between metaproteome and metagenome
Performance comparison of different sample preparation and data analysis steps. In bold are the best methods according to assessed criteria: peptide/protein count, host/dietary contamination, Firmicutes or Bacteroidetes representation, time efficiency, FDR, identification rate, taxon-assigned peptides and number of taxonomic identification precision. The performance status is displayed using minus sign for poor, equal sign for similar/no difference or plus sign for good performance
| Peptide/protein count | Host/dietary contamination | Firmicutes | Bacteroidetes | Time efficiency | FDR | Identification rate | Taxon assigned peptides | Precision | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Centrifugation | + | - | - | + | - | |||||
| nLSC | - | + | + | - | + | |||||
| Digestion | + | - | ||||||||
| FASP | - | + | ||||||||
| Search strategy | - | + | + | - | ||||||
| Two-step protein | + | - | – | + | ||||||
| Two-step sections | + | – | - | + | ||||||
| Two-step taxa | - | - | + | - | ||||||
| Taxon quantification | Kraken2 | + | + | – | ||||||
| Diamond | – | - | - | |||||||
| + | + | + |