| Literature DB >> 35242286 |
J Alfredo Blakeley-Ruiz1,2, Manuel Kleiner2.
Abstract
Mass spectrometry-based metaproteomics has emerged as a prominent technique for interrogating the functions of specific organisms in microbial communities, in addition to total community function. Identifying proteins by mass spectrometry requires matching mass spectra of fragmented peptide ions to a database of protein sequences corresponding to the proteins in the sample. This sequence database determines which protein sequences can be identified from the measurement, and as such the taxonomic and functional information that can be inferred from a metaproteomics measurement. Thus, the construction of the protein sequence database directly impacts the outcome of any metaproteomics study. Several factors, such as source of sequence information and database curation, need to be considered during database construction to maximize accurate protein identifications traceable to the species of origin. In this review, we provide an overview of existing strategies for database construction and the relevant studies that have sought to test and validate these strategies. Based on this review of the literature and our experience we provide a decision tree and best practices for choosing and implementing database construction strategies.Entities:
Keywords: Metagenomics; Metaproteome; Microbial community; Microbial ecology; Microbiome; Microbiota; Multi-omics
Year: 2022 PMID: 35242286 PMCID: PMC8861567 DOI: 10.1016/j.csbj.2022.01.018
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Overview of how protein sequence database construction impacts protein identification. The figure shows an overview of the wetlab experimental portion of acquiring metaproteomic mass spectrometry data and the computational steps done on the database that mirror the wet lab steps. In the database search step the experimental MS2 spectra are then compared with the in silico generated spectra from the database. We differentiate in the figure between two distinct database quality levels (curated and uncurated) and their ultimate impact on the output and interpretation of metaproteomic experiments.
Characteristics, advantages and disadvantages of sequence sources for metaproteomic databases.
| Matched metagenome | Unmatched metagenome | Unrestricted reference database | Restricted database amplicon sequencing | Restricted database defined community | |
|---|---|---|---|---|---|
| Monetary cost | Sample type dependent $100-$2,000/sample or pooled samples | Free | Free | $50-$100/sample | Free |
| Time cost (labor & computation) | Genome-resolved month-year, otherwise weeks | Days | Days | Weeks | Days |
| Presence of sequences representing proteins not actually in the sample | Low, sequences are derived from sample | Medium, sequences are derived from system but not specific sample | High, sequences represent all of sequenced life | Medium, sequences are derived from same taxa as the sample, but not the same genomes | Low, exact composition is known and reference database is used |
| Likelihood of sequences missing | Low to medium, Dependent on depth of sequencing and inclusion of unbinned sequences. | Medium to high, dependent on similarity between previously sequenced samples and samples measured by metaproteomics. | Medium to high, even if relatives of community members are present in public repositories, even closely related strains differ significantly in gene content. | Medium to high, even if representative genomes for identified taxa are available, closely related strains differ significantly in gene content. | None to low |
| Potential sources for redundant (highly similar or identical) sequences | Artificial: bringing together sequences from sequential gene prediction and multiple assemblies. Biological: similar genes in different strains from the same species or genus. | Artificial: bringing together sequences from sequential gene prediction and multiple assemblies. Biological: similar genes in different strains from the same species or genus. | Artificial: bringing together sequences from multiple sources. | Artificial: bringing together sequences from multiple sources. Biological: similar genes in different strains from the same species or genus. | Biological: similar genes in different strains from the same species or genus. |
| Taxonomic resolution | If genome-resolved subspecies to species, otherwise genus to phylum based on LCA to reference databases | If genome-resolved subspecies to species, otherwise genus to phylum based on LCA to reference databases | Genus to phylum based on LCA of all matches in the reference databases | Genus to phylum based on LCA to reference databases | Subspecies to species |
| Likelihood of misidentifying taxa | Low | Medium, dependent on relevance of metagenome to sample | High, many sequences missing from database and many sequences in the database are not in the sample | Medium, dependent on relevance of selected reference genomes to actual genomes in sample | Low |
Fig. 2Decision tree reflecting the steps to take when constructing a protein sequence database for metaproteomics. We define a synthetic community as one that is designed by the researcher (e.g. defined communities, mock communities, gnotobiotic systems). We define a natural community as a community taken from the environment (e.g. soil, fecal, ocean).