Literature DB >> 35242286

Considerations for constructing a protein sequence database for metaproteomics.

J Alfredo Blakeley-Ruiz^1,2, Manuel Kleiner².

Abstract

Mass spectrometry-based metaproteomics has emerged as a prominent technique for interrogating the functions of specific organisms in microbial communities, in addition to total community function. Identifying proteins by mass spectrometry requires matching mass spectra of fragmented peptide ions to a database of protein sequences corresponding to the proteins in the sample. This sequence database determines which protein sequences can be identified from the measurement, and as such the taxonomic and functional information that can be inferred from a metaproteomics measurement. Thus, the construction of the protein sequence database directly impacts the outcome of any metaproteomics study. Several factors, such as source of sequence information and database curation, need to be considered during database construction to maximize accurate protein identifications traceable to the species of origin. In this review, we provide an overview of existing strategies for database construction and the relevant studies that have sought to test and validate these strategies. Based on this review of the literature and our experience we provide a decision tree and best practices for choosing and implementing database construction strategies.

Entities: Chemical

Keywords: Metagenomics; Metaproteome; Microbial community; Microbial ecology; Microbiome; Microbiota; Multi-omics

Year: 2022 PMID： 35242286 PMCID： PMC8861567 DOI： 10.1016/j.csbj.2022.01.018

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Metaproteomics is an umbrella term encompassing approaches for the large-scale identification and quantification of proteins from microbial communities [1]. Metaproteomics provides insights into the expressed genes and thus actual phenotypes on the molecular level whereas the more popular DNA sequencing approaches can only determine functional potential by profiling gene content [2], [3], [4]. As part of the molecular phenotype of the cell, proteins provide more direct insight into what is happening physiologically in the cells within microbial communities [5]. For example, a study of symbiotic marine worms discovered abundant enzymes for the use of carbon monoxide as an energy source in the symbionts, revealing the first animal known to be able to use this poisonous gas [6]. Another study identified an increase in the abundances of iron sequestration enzymes in the microbiota of human preterm infants with necrotizing enterocolitis (NEC), suggesting an association between iron homeostasis and NEC [7]. In mouse gut communities, Patnode et al. found distinct expression of polysaccharide utilization loci (PUL) in the presence of different food grade fibers and showed that these PULs were necessary for the competitive fitness of specific Bacteroides species in the presence of these fibers [8]. Finally, Li et al. confirmed the assimilation of methanol by microbes in the plant rhizosphere by first detecting abundant methanol dehydrogenases and associated oxidation pathways, then using 13C-labeled methanol to confirm the incorporation of labeled carbon into proteins of organisms that expressed these proteins [9]. The leading technique for identifying and quantifying proteins in biological samples is called shotgun proteomics [10]. For shotgun proteomics proteins are first isolated from samples and then (sometimes) separated by gel electrophoresis. Isolated proteins are digested into peptides using trypsin, and these peptides are separated by liquid chromatography based on physicochemical properties before analysis in a mass spectrometer. Both intact peptide masses (MS1) and, after fragmentation, the masses of their fragments (MS2) are measured in the mass spectrometer. This technique is called tandem mass spectrometry. To identify proteins, a bioinformatics method called database search matches peptide tandem mass spectra to theoretical spectra derived from an in silico digested protein database (Fig. 1) [11], [12], [13], [14]. Tens of thousands of peptides can be analyzed using this method [15], [16], [17]. These peptides can subsequently be used to infer the presence of thousands of proteins in the sample.

Fig. 1

Overview of how protein sequence database construction impacts protein identification. The figure shows an overview of the wetlab experimental portion of acquiring metaproteomic mass spectrometry data and the computational steps done on the database that mirror the wet lab steps. In the database search step the experimental MS2 spectra are then compared with the in silico generated spectra from the database. We differentiate in the figure between two distinct database quality levels (curated and uncurated) and their ultimate impact on the output and interpretation of metaproteomic experiments. While the shotgun proteomics approaches described above were originally developed to analyze proteomes of individual organisms, they have been adapted in a very similar form for metaproteomics [1], [5], [18]. However, metaproteomics comes with unique challenges not encountered when working with single organisms including (1) the difficulty of obtaining protein sequences from the organisms in the often highly diverse microbial communities and (2) the fact that the presence of homologous sequences from related organisms can make protein inference much more difficult [19], [20]. Some researchers avoid the protein inference problem by using a peptide-centric approach, which skips the protein inference step and infers taxonomy and function directly from detected peptides by matching the peptide sequences to peptide sequences generated from public protein sequence reference databases [21], [22]. The peptide-centric approach has the advantage that it avoids the protein inference step, however, it comes at the cost of being unable to know which specific proteins are present in a sample and thus functions and presence of taxonomic groups have to be aggregated at a relatively unspecific level. In this review, we focus on a protein-centric approach, which seeks to identify and quantify specific proteins, which are ultimately the biological unit investigated with metaproteomics. The protein-centric approach, when coupled with a well curated protein sequence database, is more sensitive and selective for taxonomic and functional annotation than the peptide-centric approach as key information for the taxonomic and functional classification of a protein can be accessed [23], [24]. This information includes the genome of origin of the protein sequence [24], [25], [26], which provides information on the taxon of origin and the neighborhood of the expressed gene, which, in prokaryotes, can often be very informative for deriving protein function [27], [28]. Our intended audiences for this review are (1) proteomics experts who have not previously worked with microbial communities and as such may be unfamiliar with some of the additional challenges involved in database construction for metaproteomics, (2) microbiologists who are interested in metaproteomics and are looking for concise guidance on specific elements of the metaproteomic process, and (3) metaproteomics experts seeking an overview of which database construction strategies have been developed and validated so far and where further need for development and validation still exists. Here we focus on construction of the protein sequence database, a key element of any metaproteomic study. This review is divided into two sections. In the first section we describe how protein database source and construction can impact peptide identification, protein inference, and taxonomic assignment. In the second section we provide a decision making framework for constructing protein databases for metaproteomics. For other topics, we refer the reader to articles on the overall metaproteomic workflow [18], [29], and on methodological considerations for specific components of the metaproteomic workflow [30], [31], [32], [33], [34], [35], [36]. Tandem mass spectrometry: A mass spectrometry technique where (peptide) ions are isolated by their mass to charge ratio and then fragmented in the mass spectrometer, both the initial mass of the ion and the masses of the resulting fragment ions are recorded. MS Mass spectrum of intact (peptide) ions i.e. not fragmented. MS Mass spectrum of (peptide) fragment ions generated by tandem mass spectrometry. Database search algorithm: An algorithm that identifies peptides from tandem mass spectrometry data by matching experimental MS2 spectra to MS2 spectra generated in silico from a protein database. The algorithm also scores and ranks the best matches for each MS2 spectrum. Protein database: Database of protein sequences in FASTA file format used by a database search algorithm. Peptide-spectrum match (PSM): Match between an MS2 spectrum and a peptide sequence found by a database search algorithm. False discovery rate (FDR): Proportion of PSMs, peptides or proteins passing selection criteria (e.g. search algorithm score threshold) that are incorrect. Target-decoy search strategy: A database search strategy to estimate false discovery rates in which the mass spectrometry data is searched against a database made up of correct protein sequences (targets) and incorrect protein sequences (decoys). Commonly decoy sequences are generated by reversing the sequences in the target database. Metagenome-assembled genome (MAG): Genome fragments (contigs) extracted from metagenomic assemblies and combined into what is thought to be a close representation of an actual individual genome that matches a specific strain/species in the sequenced sample. Sometimes the product of the initial contig grouping is called a “bin” and only after various quality checks for completeness and contamination is the bin then considered a “MAG” if quality thresholds are met [37]. Genome-resolved metagenomics: A metagenomic processing strategy in which the goal is to extract metagenome-assembled genomes. At the center of this strategy are binning methods, which group sequenced genome fragments into MAGs based on characteristics such as tetranucleotide frequency and sequencing coverage. Matched metagenome database: Protein database that is derived from metagenomic sequencing of samples that match the ones used for metaproteomics. Reference protein database: Protein database derived from public repositories such as NCBI or UniProt. Unmatched metagenome: Protein database derived from previous metagenomics and isolate sequencing efforts of a specific system. Sometimes these metagenomes are published as gene catalogs. Examples include the mouse gene catalog [38] or the IGC [39]. Protein unique peptide: A peptide identified by a database search algorithm that is unique to a single protein sequence in the database. Average nucleotide identity (ANI): A measure of genome similarity that is commonly used to classify genomes into species [40], [41]. Protein group: A group of similar, but not identical, protein sequences that share identified peptides and which cannot be distinguished due to the lack of protein unique peptides. Lowest common ancestor (LCA) : A taxonomy assignment strategy that matches a sequence (raw read, protein, or contig) to a reference database and assigns the taxonomy of the lowest unambiguous taxonomic rank of similar sequences in the reference database.

Protein database source and construction affects peptide identification, protein inference, and taxonomic assignment in metaproteomics

Assigning peptide sequences to tandem mass spectra and inferring proteins depends on the sequence database provided to the database search algorithm. In this section, we provide an in depth review of the interconnection of the sequence database with peptide identification and protein inference and how the source of sequences in the database influences peptide identification and protein inference. Furthermore we review the current literature on studies that have sought to evaluate the impact of different database sources and construction strategies on the quality and information content of metaproteomic data.

Peptide identification by database search algorithms

To understand the importance of protein sequence databases for shotgun proteomics it is critical to understand how database search algorithms work. For a detailed explanation we would like to refer readers to an excellent article by Marcotte [42]. Briefly, database search relies on a target protein sequence database to provide a search space of theoretical MS2 spectra for peptides that might be in the sample. The algorithm tries to match the experimental peptide MS2 spectra to these theoretical spectra. If a spectrum is successfully matched to a peptide, this is referred to as a peptide-spectrum match (PSM). Database search algorithms score and rank PSMs based on the similarity of the match between theoretical and experimental MS2 spectra. Since the implementation of the first 1994 algorithm - SEQUEST - many additional algorithms have been developed with accompanying improvements in the scoring scheme and search speed [11], [43], [44], [45], [46]. Metaproteomic studies tend to have more mass spectra and a larger search space than single organism studies, making certain search algorithms unable to handle the data due to time or memory limitations. There are, however, many database search platforms able to process metaproteomic data. These platforms include MetaproteomeAnalyzer [47], MetaProIQ [48], and Sipros Ensemble [49], which have been built specifically for metaproteomic data, and more general proteomics pipelines, such as the open source Crux toolkit [50] and commercial pipelines such as Thermo Fisher’s Proteome Discoverer and Bioinformatics Solutions’ PEAKS. After database search, PSMs are filtered to only retain quality PSMs. Commonly PSMs are filtered based on database search algorithm scores and peptide properties such as length and missed cleavages to meet a specified false discovery rate (FDR). To calculate the FDR, a decoy database made up of reversed or randomized sequences from the target database is included in the database search [51]. The FDR is calculated using a target-decoy competition, where the top PSM (target or decoy) for a MS2 spectrum is retained and the FDR is the proportion of total PSMs that are decoy hits at the used score threshold [51]. Crude FDR filtering using individual database search algorithm scores can lead to biased removal of PSMs with specific properties. This issue has been addressed by the development of machine learning algorithms, such as Percolator [52] and Sipros Ensemble [49], which consider a large diversity of scores and peptide properties for FDR-based PSM filtering. Advice on FDR thresholds is out of scope for this article, but thresholds typically range between one and five percent. PSMs from the target database whose score passes the FDR threshold are considered identified and used for protein inference. A problem with the target-decoy competition for FDR calculation that the literature has just started to address is that it assumes that there is only one peptide per MS2 spectrum; however, sometimes the mass spectrometer co-isolates multiple peptides with similar mass to charge ratios. The higher complexity of metaproteomics studies increases the probability of these co-isolation events. Though some solutions have been suggested [53], [54], [55], further discussion is beyond the scope of this review. The size and comprehensiveness of the protein database impacts the number of PSMs that can be identified. The comprehensiveness of the database (i.e. how many of the proteins in the sample are represented as sequences in the database) constrains the maximum number of peptides that can be detected in a sample as peptides not present in the database cannot be identified by database search. Increased database size, especially with regards to sequences not expected to be present in the sample, increases the number of high-scoring random hits to both the target and decoy portion of the database. These high-scoring random hits are false positives that lead to a tighter score threshold needed to attain the desired false discovery rate [56], [57]. The tighter score threshold needed for large databases with unnecessary sequences leads to the filtering out of true PSMs that would be retained with score thresholds needed for a smaller, better fitting database. Small non-comprehensive databases experience a similar problem, when not handled carefully, as similar sequences (very similar precursor mass and some shared y- and b-ions) can receive the best score when the real sequence is absent [58]. Thus, a large database, especially with many sequences not relevant to the sample, or a small non-comprehensive database limit the potential for peptide identifications. A solution for issues with very large databases that has been proposed is a two-step or multi-step search approach. In this approach, searches made with higher (>5%) or no FDR thresholding are used to generate a protein database restricted to just sequences that had a match in the initial search against the very large database. Two-step approaches have been shown to increase peptide identifications when databases are very large [59]; however, the validity of this kind of approach is debated because it takes advantage of prior information to improve the FDR [32], [60], [61]. Some solutions to this problem have been suggested but not fully validated [58], [62]. Automated peptide de novo sequencing from MS2 spectra represents an alternative to database search algorithms for obtaining peptide sequences from mass spectrometry data [63], [64], [65]. While no longer constrained by the database search space for peptide identification [66], peptide de novo sequencing approaches typically generate fewer peptides than database search [67], and still depend on a protein sequence database for protein inference [68] and subsequent biological interpretation [66]. Thus, de novo peptide sequencing approaches do not overcome the need for high quality protein sequence databases in metaproteomics.

Protein inference after database search

Peptides that pass the FDR threshold can be used to infer proteins by mapping identified peptide sequences back to proteins in the protein database. A challenge with this approach, known as the protein inference problem, is that some peptides are shared between protein sequences making it difficult to determine which protein was the actual source of the peptide and should thus be identified as present in the sample [20]. Metaproteomics exacerbates this problem on two levels. First, metaproteomic samples often contain many relatively closely related strains/species which have a partially shared set of homologous proteins. Depending on the sequence similarity between these homologs, a set of peptides that can be derived from these proteins will be identical between multiple strains/species, making them “non-unique” to a protein from a specific strain/species. These “non-unique” peptides can therefore only be used to determine the presence of a protein, but not its source species/strain. Second, sequence databases often contain very similar or identical sequences. This sequence redundancy can either be caused by having multiple identifiers for identical or highly similar sequences caused by bringing together data from multiple metagenomic assemblies or public databases, or the presence of strains/species with very similar sequences in a sequenced sample. Ultimately, both presence of proteins that yield identical peptides and sequence database redundancy lead to the same outcome of protein inference, namely the ambiguous matching of peptides to multiple sequences in the database. While no perfect approach exists to address the protein inference challenge in metaproteomics and even for single organisms proteomics, there are several approaches to limit the impact of protein inference challenges on metaproteomic data interpretation. To address the protein inference challenge, several metrics can be employed to improve confidence in protein identification. The most critical is filtering of inferred proteins to attain a specific FDR cutoff. FDRs on the protein level are estimated with the same target-decoy approach described above for PSM identification. Different protein inference methods and algorithms use a diversity of parameters to filter inferred proteins. These parameters can include the number of peptides matching to a protein [15], [69], [70], uniqueness of peptides matching to a protein (unique peptides) [26], [71], and the quality scores or FDR of peptides matching to a protein [72], [73]. In addition to FDR filtering, parsimony methods are frequently used to group proteins that share peptides, but have no independent evidence in the form of unique peptides, into protein groups of shared evidence [74], [75], [76]. Thus the detection of peptides unique to a protein sequence critically impacts the interpretation of identified proteins in metaproteomics [24]. Proteins with unique peptides have the advantage of being unambiguously identified and can be directly linked back to the taxon of origin if the protein sequence came from a taxonomically classified genome, whereas protein groups can potentially not be associated with a specific taxon only with a wider group of taxa. Protein groups, however, can still provide a clear identification of a particular protein function. To deal with unnecessary sequence redundancy in databases and to increase the number of identified unique peptides metaproteomics researchers frequently use a sequence clustering algorithm, such as CD-HIT [77] or UCLUST [78], to group highly similar or identical protein sequences, adding only one representative protein sequence with a single identifier to the protein sequence database [18]. This approach has been applied in many studies [7], [26], [30], [70], [79] and is discussed in detail in section 3.3.

Sources of sequence information to construct protein databases for metaproteomic studies

In metaproteomics, creating a protein database that is both comprehensive and not larger than it needs to be is particularly challenging. In a proteomics study with only one organism, the associated protein sequence database comes from predicted or known protein sequences derived from the organism's genome [56]. Ideally, this protein database comes from publicly available and reviewed reference proteomes, such as those found in NCBI RefSeq [80] or UniProt [81]. In metaproteomics, sequences from multiple organisms need to be acquired to create a comprehensive representation of the protein sequences likely to be in the sample. With the exception of artificially created, fully defined, communities [8], [82] or symbioses with a limited number of highly specific symbionts [6], [83], [84], [85], [86], it is often not possible to create this database by just combining the relevant RefSeq/UniProt reference proteomes or by using previous sequencing initiatives because the composition of environmental microbial communities cannot be known in advance without some form of prior sequencing. Even if the composition of a community is known (e.g. from amplicon sequencing), genomes for the organisms in the community are often not available in public databases [87], [88], [174]. Adding complication, in many cases, the microbial composition of a system can be different from sample to sample, as is the case for intestinal microbiome samples [89], [90]. All this makes assembling the set of sequences needed for a metaproteomic study a task that requires careful consideration to obtain and combine the best possible set of sequences. In the following we will describe the different types of sources of sequence information that have been used in the past to create metaproteomic protein databases. Matched metagenomes: sequences collected from metagenomes assembled from a set of samples that match the metaproteomic samples [26], [70], [79]. The main advantage of this sequence source is that it provides a set of protein sequences derived from the genomes of the organisms present in the samples interrogated by metaproteomics. With extensive processing, these matched metagenomes can be made genome-resolved by extracting metagenome-assembled genomes (MAGs) using a variety of binning methods to assign genome fragments (contigs) from a metagenomic assembly to individual genomes (see section 3.2) [120], [121], [122]. Protein sequences predicted from MAGs have the advantage that more information is available for analyses after identification, such as genomic neighborhood and taxonomic classification by the genome of origin [7], [25], [26]. Genome-resolved metagenomics, however, has the limitation that it has historically required extensive metagenomics expertise often unavailable to mass spectrometry groups. It is also currently infeasible to bin all the sequences in a metagenome into MAGs, which leads to loss of information if only binned sequences are used [6]. To save time, the protein database can also be assembled from genes predicted directly from raw sequence information [88], [94] or unbinned metagenomic assemblies [70], [79], [88]. This comes at the cost of lost taxonomic resolution as discussed in section 3.4. Unmatched metagenomes: sequences collected from metagenomic data from the same system (e.g., human or mouse gut) but different samples potentially from different studies and laboratories [48], [169], sometimes called gene catalogs [38], [39], [91]. Use of sequences from unmatched metagenomes is most common in human microbiome studies where there have already been massive sequencing efforts [48], [169]. The use of results from these sequencing initiatives for metaproteomics results in databases of millions of sequences that can make it difficult to achieve a high number of identifications at a low false discovery rate without a multi-step search strategy [48], [59], [170]. This approach also limits the taxonomic resolution to higher levels (e.g. phylum or genus level) when MAGs are not included [169], [170], [171]. In instances where the community is known to be the same despite different samples, such as microbial symbioses with highly specific symbionts, this approach can be equivalent to a matched metagenome [6], [83], [84], [85], [86]. Unrestricted reference databases: this approach uses all of the sequences from one of the major sequence repositories (e.g. NCBI RefSeq). Large unrestricted databases have the advantage of covering a large sequence space; however, they suffer from not being specific to the sample, leading to the possibility of false hits [57], [102] and low identification numbers [88], [92], [93], [94] due to the tight PSM identification scores needed to achieve a desired FDR with a large database (see section 2.1). Also, public databases are currently very incomplete with regards to genome coverage for microbial communities [87], [88], [174]. As evidenced by the fact that 80%-90% of the MAGs in metagenome projects belong to unnamed species absent from public repositories [172], [173]. Restricted reference databases: this approach uses prior knowledge of the community’s composition to acquire taxonomically relevant reference proteomes. For artificially defined communities (e.g., germ-free mice inoculated with a set of bacterial isolates), this approach is equivalent or better than matched metagenomes in terms of taxonomic resolution and completeness because the exact community composition is known in advance and reference proteomes of the specific members can be used [8], [82]. When the exact community composition is not known, an alternative approach is to use results from a phylogenetic marker gene analysis of the sample, such as 16S rRNA gene sequencing, to identify reference genomes that correspond most closely to the phylogeny of the marker genes [88], [106]. This approach relies on the relevant reference genomes being present in a public repository, and depends on strains from the same species, let alone genus, having similar gene content. Many strains from the same species, however, do not have the same gene content as evidenced by studies on the massive pangenomes of some microbial species such as Legionella pneumophila [175] and Escherichia coli [176], which have variable gene content between strains. We outlined the advantages and disadvantages of these different sources of sequence information in Table 1. As mentioned in the above text, there are some specific cases where previous sequence information or reference databases can provide equivalent or better sequence information than a matched metagenome. The matched metagenome, however, is often a critical component, along with some specific reference genomes for creating a database that is comprehensive without adding too many extraneous sequences.

Table 1

Characteristics, advantages and disadvantages of sequence sources for metaproteomic databases.

	Matched metagenome	Unmatched metagenome	Unrestricted reference database	Restricted database amplicon sequencing	Restricted database defined community
Monetary cost	Sample type dependent $100-$2,000/sample or pooled samples	Free	Free	$50-$100/sample	Free
Time cost (labor & computation)	Genome-resolved month-year, otherwise weeks	Days	Days	Weeks	Days
Presence of sequences representing proteins not actually in the sample	Low, sequences are derived from sample	Medium, sequences are derived from system but not specific sample	High, sequences represent all of sequenced life	Medium, sequences are derived from same taxa as the sample, but not the same genomes	Low, exact composition is known and reference database is used
Likelihood of sequences missing	Low to medium, Dependent on depth of sequencing and inclusion of unbinned sequences.	Medium to high, dependent on similarity between previously sequenced samples and samples measured by metaproteomics.	Medium to high, even if relatives of community members are present in public repositories, even closely related strains differ significantly in gene content.	Medium to high, even if representative genomes for identified taxa are available, closely related strains differ significantly in gene content.	None to low
Potential sources for redundant (highly similar or identical) sequences	Artificial: bringing together sequences from sequential gene prediction and multiple assemblies. Biological: similar genes in different strains from the same species or genus.	Artificial: bringing together sequences from sequential gene prediction and multiple assemblies. Biological: similar genes in different strains from the same species or genus.	Artificial: bringing together sequences from multiple sources.Biological: similar genes in different strains from the same species or genus.	Artificial: bringing together sequences from multiple sources. Biological: similar genes in different strains from the same species or genus.	Biological: similar genes in different strains from the same species or genus.
Taxonomic resolution	If genome-resolved subspecies to species, otherwise genus to phylum based on LCA to reference databases	If genome-resolved subspecies to species, otherwise genus to phylum based on LCA to reference databases	Genus to phylum based on LCA of all matches in the reference databases	Genus to phylum based on LCA to reference databases	Subspecies to species
Likelihood of misidentifying taxa	Low	Medium, dependent on relevance of metagenome to sample	High, many sequences missing from database and many sequences in the database are not in the sample	Medium, dependent on relevance of selected reference genomes to actual genomes in sample	Low

Characteristics, advantages and disadvantages of sequence sources for metaproteomic databases.

Studies on the effects of protein database construction on metaproteomic studies

Comparisons of the impact of different database construction strategies on peptide identification, protein inference, taxonomic assignment, and functional assignments in metaproteomic studies are critical for making informed decisions for protein database construction. Several studies have evaluated the effects of protein database construction on peptide identification, particularly with regards to source of sequence information. These studies focused on the number of peptides identified and generally found that protein databases from matched or unmatched metagenomes yielded more peptide identifications than protein databases from reference proteomes [48], [88], [92], [93], [94]. A 2016 article written by Tanca et al. presents a thorough evaluation of the effects of different database construction approaches on peptide identification [88]. In that paper, Tanca et al. compared databases derived from matched metagenomes to databases derived from UniProt [81]. These UniProt databases were made up of either all bacteria sequences in UniProt or were restricted to taxa identified by 16S rRNA gene sequencing at the family, genus, or species level. Tanca et al. found that the matched metagenomes yielded more peptides from human stool samples than any database constructed from UniProt. They also compared matched and unmatched metagenomes to a UniProt derived protein database for mouse and human fecal samples, and found that the matched metagenomes yielded more peptide identifications than unmatched metagenomes. In this evaluation, Tanca et al. found that the the mouse microbiome was underrepresented in UniProt and as a result the UniProt database had even fewer peptide identifications for the mice as compared to humans, indicating that reference proteomes may not be a good source of sequence information for environmental samples in general. Furthermore, Tanca et al. evaluated the effects on peptide identification of databases that combined protein sequences from multiple sources. They found that combining metagenomes from multiple human or mouse subjects yielded more peptide identifications, as long as the matched-metagenome was included. Furthermore, they found that combining all the databases, including the UniProt one, did not decrease the number of identifications. This result indicates that missing sequences have a greater impact on the peptides identified than increased database size. This is in line with a 2013 study, in which Tanca et al. found that a protein database made up of genomes sequenced from isolates in a mock community yielded more peptides than a database derived from metagenomic sequencing of the mock community [92]. In another study on arctic ocean samples, May et al. also found that combining the results from reference database and metagenome derived protein databases yielded more peptides than metagenome derived databases alone; however, these databases were not searched together making this result inadequate for determining impact of database size on peptide identification [94]. The results from May et al., however, do provide some insight on the impact of database size on peptide identification. The authors found that thousands of peptides identified in the smaller metagenome-derived database were missed when searching against the larger NCBI environmental database despite those peptides being present in the NCBI database, which shows that increased database size leads to peptides not being identified. A contrasting result was obtained by Zhang et al., who found that an unmatched metagenomic database made up of millions of sequences from the extensive metagenomic sequencing efforts previously done in humans yielded more peptide identifications than a matched metagenome [39], [48]. This result could suggest that with enough sequencing efforts, as has been done in humans, unmatched metagenomes could be equivalent to a matched metagenome in terms of number of peptides identified; however, Zhang et al. conducted this comparison using a two-step search approach, which blunts the effects of large protein database size using techniques that have not been fully validated (see subsection 2.1). Together these studies show the importance of matched metagenomes for creating complete protein databases when sequences for proteins in the sample are not present in public databases. These studies also show that matched-metagenomes alone do not necessarily provide complete databases. One feature that has a major impact on the completeness of protein databases from matched metagenomes is sequencing depth. In their 2016 article, Tanca et al. showed that increasing sequencing depth had a positive linear relationship with the number of peptides identified. Sequencing depth in this evaluation was limited to eighteen megabase pairs (Mbps) per sample. Later in the study, when comparing matched to unmatched metagenomes in mice and humans, they sequenced metagenomes that were six gigabase pairs (Gbps) per sample. These sequencing depths are not high enough to produce a complete evaluation of the effects of sequencing depth on peptide identification based on simple calculations (expected number of organisms X average length of an organism’s genome X desired sequence coverage). For comparison, assuming an average genome length of 3.8 Mbps [95] and a desired coverage of 20-fold [96], 7.6 Gbps are needed to obtain good coverage of a 100 species community. As such, further evaluations of sufficient depth are needed to find how much sequencing is really needed to generate a protein database using matched metagenomes that cover all detectable proteins in metaproteomic samples. How protein sequences are predicted from a metagenome and whether they are predicted from raw reads or assembled contigs also impacts the completeness of a metagenome derived protein database, and as a result, how many peptides can be identified by database search. Proteins can be predicted from raw reads or contigs using brute-force six-frame translations or dedicated gene prediction softwares [88]. Six-frame translations extract all possible open reading frames (ORFs) above a certain length cutoff from a contiguous DNA sequence even if the ORFs overlap. Gene prediction softwares use models of prokaryotic genes to predict non-overlapping ORFs likely corresponding to true genes [97]. In their 2016 article, Tanca et al. found that six-frame translation yielded fewer peptides than gene predictions on both assembled contigs and raw reads, and gene predictions from raw reads yielded slightly more peptides than gene predictions from contigs [88]. In contrast, significantly more peptides were assigned a functional annotation when genes were predicted from contigs instead of raw reads. May et al. identified substantially more peptides when predicting genes on raw reads versus contigs, but more peptides were identified in total when combining the two approaches [94]. The effect of raw reads or contigs on taxonomic assignment was inconclusive and neither of these approaches looked at the effects of these database construction approaches on protein inference nor the effect of protein inference on functional annotation. Our assumption would be that protein inference and gene predictions on contigs would yield better functional annotations since length or completeness of a predicted gene has been shown to yield more sensitive and accurate annotations [23], [98], [99], [100]. Additionally, none of the evaluations mentioned in this paragraph processed their assembled metagenomes into MAGs, thus they were not able to evaluate whether MAGs improved taxonomic assignment or functional annotations. Benefits such as improved taxonomic classification, assignment of complete pathways to individual organisms, and the ability to analyze genes in their genomic context have led to MAGs being a critical component of the many studies that have investigated function with metaproteomics at the species and genome level [6], [7], [9], [25], [26], [90], [101]. Therefore, the use of genome-resolved metagenomes in metaproteomics databases deserves more careful future evaluations. The studies described in this section provide insight into the effect of database size and completeness on the number of peptides identified, but offer only limited information on the effects of protein database construction on protein inference, taxonomic resolution and accuracy, functional assignment and interpretation, and whether low FDR peptides are actually being identified accurately. More evaluations are needed, especially in light of an article by Timmins-Schiffman et al., which showed that protein databases generated from assembled metagenomes versus reference databases yielded very different taxonomic compositions and functional results [102]. These different taxonomic compositions were observable even at the phylum level, and the 10 functions that changed the most varied depending on the database used. Timmins-Schiffman et al. suggested that metagenome derived databases were likely the safer option based on these results, but they did not further evaluate if the metagenome was indeed the most accurate database in this study. In their 2013 study, Tanca et al. showed that assembled (but not genome-resolved) matched metagenomes had lower mismatches at the species level than protein databases made up of all bacteria, fungi, and viruses in UniProt or NCBI, when evaluating taxonomy of a mock community of known composition [92]. Since this study was not genome-resolved, taxonomy was assigned separately to individual peptides using the UniPept [21] or MEGAN [103] softwares, which use LCA methods to identify a consensus taxonomy based on matching the sequences to taxonomically classified sequences in UniProt or NCBI. This leads to somewhat circular logic as the taxonomy is being evaluated using the databases to which peptides are compared to in the analysis. Further evaluations are needed to investigate the accuracy of species level assignments independent of these reference databases. This sort of evaluation would need to be done in the context of MAGs as discussed in the previous paragraph. Beyond taxonomic accuracy, the impact of different protein database construction strategies on FDR estimation of peptides and proteins still needs to be studied. Since FDR is just an estimation of peptide or protein identification accuracy, the actual accuracy needs to be evaluated empirically. Kumar et al. provide some insight into how to do this by including a set of sequences known to not be in the sample in the target database (an entrapment database) [62]. They then evaluated the number of identifications after database search that were from the entrapment database to estimate FDR calculation accuracy in their evaluation of multi-step search methods. An alternative approach could be to use spiked-in peptides or proteins in various quantities to create a population of peptides known to be in the sample. Spiked-in peptides are a form of ground truth typically used in the evaluation of proteomic quantification methods [104]. By spiking peptides into some samples, but not into others, spiked-in peptides could be used as a way to evaluate whether peptides known to be in the sample or absent from the sample are being detected by database search. Finally, studies are also needed to evaluate the effect of database construction on peptide and protein identification beyond just source of sequence information. For example, evaluations on the effect of sequencing depth and removal of sequence redundancy on peptide identification and particularly protein inference are still needed. Despite these limitations, the information above can be used to inform the construction of a protein database based on the current standards of the field, which we explore in section 3.

Considerations and best practice suggestions for constructing a metaproteomics protein database

The information provided in section 2 can guide decision making for constructing a metaproteomic protein database. A well-constructed database has three main elements: (1) comprehensive sequence coverage while minimizing irrelevant sequences (covered in subsections 3.1 and 3.2), (2) a link to the genome of origin for each protein sequence when possible (covered in 3.2 and 3.4), and (3) curation of redundant sequences to facilitate unambiguous protein inference and annotation (covered in 3.3 and 3.4). In Fig. 2 we provide a decision tree that divides database construction into these four main steps: (1) community assessment, (2) sequence acquisition, (3) database construction, and (4) annotation. We discuss these steps in detail in the following subsections. In addition to the steps represented in the decision tree, we provide an additional section that discusses functional annotation of the protein database (section 3.5).

Fig. 2

Decision tree reflecting the steps to take when constructing a protein sequence database for metaproteomics. We define a synthetic community as one that is designed by the researcher (e.g. defined communities, mock communities, gnotobiotic systems). We define a natural community as a community taken from the environment (e.g. soil, fecal, ocean).

Assessing the community prior to protein database construction

The first step of any metaproteomics study should be to determine the composition of the studied microbial community to create a protein database. In most cases this can be done using prior literature, amplicon sequencing [105], [108], [109] or metagenomic sequencing [7], [107], [108]. In specific cases, the exact composition of a microbial community in a sample is known as in the case of constructed, fully defined communities [8], [82] or well-characterized highly specific symbioses [6], [83], [84], [85], [86]. Often, however, the exact community composition is unknown, as discussed in section 2.3. In these cases, sequencing the samples can provide insight into the steps that need to be taken. Though not typically done, it can be beneficial to conduct a preliminary amplicon sequencing analysis, prior to shotgun metagenomic sequencing. Amplicon sequencing results can be analyzed using robust analysis platforms such as mothur [177] or QIIME2 [178]. In contrast to metaproteomics, which allows the analysis of proteins from all domains of life in a single analysis, determining microbial community composition with amplicon sequencing may require separate analysis of multiple different marker genes to obtain a comprehensive overview of community composition (e.g. 16S rRNA gene for Archaea and Bacteria, and 18S rRNA gene and Internal Transcribed Spacer (ITS) for various eukaryotes) [109]. These preliminary analyses can provide insight into the availability of genomes from community members in public repositories, if and how much metagenomic sequencing is needed, and metagenomic processing steps needed to cover the community (see section 3.2).

Acquiring sequences for protein database construction

Once the community in the samples in question has been assessed, the next step is to gather protein sequences that cover all sequences of proteins potentially present in the samples. If the microbial community composition is known in advance and genomes of specific strains are publically available, these microbial protein sequences can be acquired by going to the source and downloading them. In the case of a defined/synthetic community with reference genomes, this involves going to RefSeq or UniProt and downloading the relevant FASTA files for the reference proteomes of the strains in question. If genomes are not available in reference repositories, this involves acquiring the sequences from sources found through the data accessibility statement of previous manuscripts [6] or by sequencing the isolates making up the defined community [92]. Please note that we suggest only collecting reference protein sequences from public repositories if you can be certain that they correspond to the strains in your samples. We do not recommend the use of genomes from relatives based on phylogenetic marker gene analysis as gene content of even closely related strains can differ widely (see section 2.3). In addition to microbial sequences, additional sequences of proteins that may be in the sample need to be gathered, for example, host sequences for host-associated microbiomes, culture media components, dietary components if working with gut microbiomes, and common laboratory contaminants [6], [26], [79]. The cRAP database provides many of the common contaminants found in proteomics studies [179]. In the case of studies on host-associated microbiomes, such as from humans, mice, or Arabidopsis, complete protein sequence sets (reference proteomes) can be acquired from UniProt’s reference proteomes [81]. In most cases protein sequences for the studied microbial community are not available from repositories and then metagenomic and sometimes metatranscriptomic sequencing is the most straightforward way to obtain sequences for metaproteomics. The first decisions to make when starting a metagenomic study are the sequencing technology to use and the sequencing depth to aim for. Currently most of the tools for analyzing shotgun metagenomic data are built to use paired-end reads from Illumina sequencers and how much sequencing is needed can be calculated based on the assessment of microbial community composition suggested in section 3.1 (See section 2.4 for details on how this could be calculated). After DNA extraction, library preparation and sequencing of the samples can be done at a core sequencing facility or commercial service provider. Once raw sequence data are provided by the sequencing facility, publicly available tools that are relatively easy to install and have good documentation can be used to acquire a comprehensive set of predicted protein sequences, many of which come from MAGs. These tools can be used in individual steps as detailed below or as easily installable bundled workflows such as Anvi’o [110], MetaWRAP [111], and ATLAS [112]. Decontamination: sequencing reads are quality checked and trimmed for adapters and low quality regions if necessary. This can be done automatically using, for example, Trim Galore [113]. Additionally, undesired contaminating sequences, such as host derived sequences and the Illumina control spike-in PhiX, can also be removed using an aligner, such as BBMap in the BBtools suite [114], BWA [115], or Bowtie [116]. Assembly: metagenome-specific assemblers extend short read sequences into contigs using iterative de Bruijn graph assembly. Generally, metaSPAdes [117] generates the most accurate assemblies , while MEGAHIT [118] generates reasonably good assemblies but has the advantage that it is an order of magnitude more efficient in terms of time and memory usage [119]. MEGAHIT also has the ability to assemble reads from multiple samples in tandem to form a consensus assembly (co-assembly) between all the samples in a study. Contig binning: To obtain genome-resolved metagenomes, contigs are grouped into bins using so-called binning approaches. Binning approaches use contig-intrinsic information such as read coverage in several samples (differential coverage) and tetranucleotide frequencies. These binning approaches are implemented in automated software tools such as the frequently used CONCOCT [122], MetaBat2 [121], and MaxBin [120]. Performance of the different binning approaches is sample specific and can be empirically evaluated with tools like DASTool [123] and MetaWRAP [111]. Evaluation of bin quality to determine which bins can be considered metagenome-assembled genomes: the most common approach for evaluating if a bin potentially corresponds to a partial or complete genome is the assessment of single copy gene content of bins [124], [125]. Essentially a list of genes that have been empirically shown to be present as a single copy in genomes of specific phylogenetic groups is used to taxonomically classify bins at a higher taxonomic level and to evaluate the percent completion and contamination (redundancy) of each bin. The tools Anvi’o, BUSCO, and CheckM produce this evaluation automatically [110], [124], [125]. Other metrics, such as number of tRNAs and the presence of rRNA genes can also be used to evaluate MAG quality, as well as general assembly quality metrics, such as N50 and circularity [37], [107]. There are additional steps that can be taken to improve bin quality, such as manual curation and re-assembly [107], facilitated by tools such as Anvio and MetaWRAP [110], [111]. For a good review on generating high quality MAGs see [107]. The specific completeness and contamination levels for when a bin can be considered a MAG vary in the literature. Generally, following the recommendations set forth by the Genomic Standards Consortium (GSC) is recommended for any MAGs that will be submitted to a public repository, such as the NCBI databases [37], [126]. These recommendations include a > 50% completion and < 10% contamination cutoff for MAGs to be considered medium quality genomes [37]. It's still not clear, however, if these cutoffs are ideal for the construction of protein databases for metaproteomics, as there can still be useful information about gene neighborhoods and a protein’s organism of origin for organisms whose genome could not be assembled into a medium quality MAG. Further evaluations are needed to investigate the impacts of different MAG quality thresholds on protein database construction from MAGs, as discussed in section 2.4. It is possible that there will not be one hard set of rules, as these cutoffs may end up being system or study specific. Organizing MAGs into species and subspecies groups: once a set of acceptable MAGs has been selected, they can be grouped into species and subspecies groups by average nucleotide identity (ANI), with tools such as dRep [7], [127]. The genomic delimiter of bacterial species has been shown to be 95% ANI [40], [41]. Higher ANI thresholds, such as 98% have been used to delineate subspecies groupings [7]. dRep outputs the highest quality MAG, by single copy gene metrics, for an ANI group as a representative genome. dRep also outputs a table containing the information about which MAGs were grouped by ANI. Gene annotation: after a set of MAGs has been selected, prokaryotic gene calling algorithms can be used to predict genes on binned and unbinned contigs. Many metagenomic studies set a contig length cutoff of 1000 bp to reduce the number of predicted genes that are fragmented, but it can be beneficial to use a lower cutoff for a metaproteomic protein database in order to not lose potentially identifiable peptide sequences. Prodigal [97] and MetaGeneMark [128] are common gene annotators used in metagenomics and both softwares output translated amino acid sequences for the predicted genes. Many bioinformatics tools for processing and analyzing metagenomic data or providing functional annotation to genes use Prodigal for gene prediction (more details in section 3.5) [125], [127], [129], [130], [131], [132]. The above steps favor the detection of bacterial and archeal genes and MAGs. If assessment of the community, as described in section 3.1, indicates the substantial presence of one or more eukaryotic organisms in the community that has no or low quality public data, then additional steps need to be taken to acquire those protein sequences. Eukaryotic contigs will often be present in the unbinned fraction of a metagenomic assembly, and gene calls from a prokaryotic gene caller, like Prodigal, can still be used to identify eukaryotic proteins as was done with green algae by Kleiner et al. [24]. These gene predictions are, however, often highly fragmented and incomplete due to the presence of introns in the genes of many eukaryotes, nevertheless they can potentially be classified as eukaryotic using an LCA approach as described later in section 3.4. To acquire better gene annotations for eukaryotes there are two options: (1) use de novo assembled transcripts from RNA-seq to identify eukaryotic transcripts and predict their complete encoded protein sequences as was done for crustacean [133] and gutless worm hosts [134], or (2) retrieve eukaryotic contigs from the metagenome and apply eukaryotic specific gene prediction algorithms. A workflow for retrieving eukaryotic contigs using machine learning, and applying binning methods to assemble eukaryotic MAGs, was proposed by West et al. and benchmarked using a variety of data sets [135]. West et al. further used this approach to identify proteins of the yeast Candida in preterm infants using metaproteomics [101]. The BUSCO tool can be used to classify bins as eukaryotic, bacterial, or archaeal, and provides gene predictions and completion versus contamination metrics for those bins [124]. None of the approaches described here to obtain eukaryotic protein sequences are ideal either in terms of quality in the case of Prodigal gene predictions, or labor in the case of the West et al. approach; however, they represent the current state of the field for acquisition of eukaryotic protein sequences for metaproteomics. In summary, assessment of community composition in advance provides a powerful framework with which to select the sources of sequence information needed for a metaproteomics study, whether it be reference proteomes for known communities or metagenomic/metatranscriptomic sequences for understudied communities.

Assembling and curating the protein database

After acquiring the protein sequences the next step is to assemble the database. For this, all sequences from different sources (in FASTA format) acquired in the sequence acquisition step need to be combined into a single database. A simple linux utility like “cat” can be used for simply combining all fasta files. Sometimes simply combining fasta files is not the best mode of action, for example, when the same sequence may be present multiple times in the various sequence sources used. In these cases sequences can be combined in a stepwise fashion considering their annotation quality (see below). The next step after combining sequences is to remove redundant sequences, i.e. highly similar or identical sequences that have different identifiers (accession numbers). This can be done by clustering sequences based on amino acid identity (percentage of amino acids that match between sequences) with algorithms, such as UCLUST [78] or CD-HIT [77], which in addition to a FASTA file with representative sequences also provide an output file that specifies which sequences were clustered together. For sequence clustering the choice of the identity threshold at which sequences are clustered is critical. Various studies have used sequence identity thresholds for clustering that range from 90 to 100 percent amino acid identity [7], [18], [26], [70], [79], [86]. Lower identity thresholds will result in smaller databases and allow for a greater number of possible unique peptides at the price of losing peptide sequences that are distinct between similar protein sequences, as well as the ability to differentiate proteins from very closely related organisms. Higher identity thresholds allow for better species and subspecies level resolution by retaining more similar sequences, at the cost of increased database size and identification of proteins without unique peptides. The optimal clustering threshold needs to be determined specifically for each study and sample type, for example, by searching a subset of the data against databases with different clustering levels and then evaluating quality metrics such as number of protein unique peptides identified, number of unambiguously identified proteins, and percentage of proteins traceable to a specific species or subspecies group. If sequences from multiple sources are used and some sources have more useful information associated with them, such as being derived from taxonomically classified MAGs or reference proteomes, it can be beneficial to cluster sequences in a way that will preferentially retain the better annotated sequences as the representative sequence of a cluster. This was, for example, done in a study by Kleiner et al. when combining protein sequences from MAGs and unbinned contigs [6]. Kleiner et al. preferentially retained the well annotated sequences from the MAGs and only added sequences from unbinned contigs if no similar sequence from a MAG was available. For this, the authors used CD-HIT-2D, an extension of CD-HIT, which allows users to compare databases and output sequences that are unique to one of the databases. Specifically after combining and clustering the well annotated sequences with CD-HIT, the authors compared the sequences from unbinned contigs against this initial “higher-quality” database using CD-HIT-2D and then added sequences absent from the “higher-quality” database to the final database.This approach will maximize unambiguous identification of protein sequences, while favoring better annotated sequences.

Annotating the protein database with taxonomic information

Once the protein database has been assembled, the database needs to be annotated with functional and taxonomic information. Taxonomy can be assigned to the protein sequences that make up a metaproteomic protein database based on their genome of origin or through a consensus taxonomy acquired from similarity searches against reference databases (i.e. LCA). For proteins that originate from a genomic unit (i.e., acquired from strain specific reference proteomes or MAGs), the most straightforward course of action is to assign the taxonomy of the genome. Proteins acquired from a strain-specific reference proteome can simply be assigned the species of that reference proteome; however, proteins acquired from MAGs require the MAG to be taxonomically classified. For MAGs, if the species has already been discovered and has a representative in a reference database, the MAG’s species can be assigned by matching with an ANI of 95% or greater to its representatives in a reference database. If the species of the MAG does not have a representative genome in a reference database, then the lowest possible taxonomy can be predicted using the consensus taxonomy from similarity searches of all the genes in the genome, as done by BAT [136], or by using phylogenomic methods to place the genome in a tree of life. GTDB-Tk does the ANI and phylogenetic comparison automatically for bacterial and archaeal MAGs based on the Genome Taxonomy Database [129]. The Genome Taxonomy Database is built from genomes of sufficient quality in NCBI’s genbank along with some additional MAGs [129], [137], [138]. While MAGs that do not have a representative in a reference database cannot be assigned a species name, proteins from these MAGs can still be traced to an unnamed species using the species and subspecies groupings described in section 3.2. This unnamed species can be assigned a unique identifier for the study, which can then be used in the submission of the MAG to NCBI. Since proteins are often gathered from multiple sources and clustered to remove redundancy, as discussed in section 3.3, the taxonomic origin of all the sequences that make up a cluster should be considered when assigning taxonomy based on the genome of origin. Based on our experience, we suggest doing this as follows. If all protein sequences in a cluster come from genomes of the same subspecies or species (see section 3.2), then that species or subspecies can be assigned to the representative sequence of the cluster. If there are protein sequences in a cluster that come from genomes that are not the same species, then the representative protein could be labeled as multi-species, while retaining information about all the possible origin species. If none of the sequences that make up a cluster can be linked to a genome then the taxonomy can be determined using LCA approaches as described in the next paragraph. For proteins that can be traced to a specific species, a predefined species code can be added to the identifier of the protein to facilitate interpretation of the data once the metaproteomic data has been processed, as described here [24]. For proteins that do not retain their genome of origin, for example, unbinned contigs in a metagenome, unbinned gene catalogs built from previous metagenomic studies [38], [39], or from a general download of one of the major reference databases like UniProt, taxonomy can only be acquired by doing similarity searches against reference databases using LCA algorithms. If proteins come from an unbinned contig with multiple genes, then the consensus taxonomy of all the genes in the contig can be used. This is done automatically by the algorithm CAT [136]. If proteins are independent singletons, they can be assigned a taxonomy using a standard metagenomic LCA method, such as those provided by MEGAN [103], Kaiju [139], Centrifuge [140], or Kraken2 [141]. For LCA approaches, if a sequence similar (>95% identity) to the protein in question is not present in any of the genomes present in a public reference database, then it is impossible to assign a species specific taxonomy.

Annotating the protein database with functional information

To assign putative functions to protein sequences, they are compared to sequences or profiles/models of sequence groups in public reference databases of protein function, for example eggNOG [142], KEGG [143], UniProt [81], InterPro [144], MEROPS [145], MetaCyc [146], and CAZy [147], among others. The KEGG and MetaCyc databases are mostly focused on enzymes though they do provide information about other cellular processes, such as transporters. Other databases such as eggNOG, UniProt, and InterPro are more comprehensive, including information for many cellular processes. In addition to these more general databases there are specialized databases such as MEROPS and CAZy that focus on peptidases and carbohydrate active enzymes, respectively. The quality of the functional annotations in these databases, and their link to the metaproteomic protein database, plays a major role in determining functional output of any metaproteomics study. An example of high quality annotations would be the reviewed fraction of the UniProt database (Swiss-Prot) as compared to the computationally generated unreviewed fraction [81]. Functional information from these databases comes in the form of functional descriptions found in the header of protein sequences in FASTA files or in tables provided by these databases, and in the form of functional classification systems. These functional classification systems can be general, such as Gene Ontology terms (GO) [148]; based on protein families, like eggNOG [142], COG [149], or KEGG ORTHOLOGY (KO) [143]; or more focused on specific metabolism, like the enzyme commission (EC) numbers [150], Transporter Classification (TC) [151], or Carbohydrate-Active enZYmes (CAZy) [147]. Several tools provide automated functional annotation of protein sequences with the above classifications. Evaluating these tools is outside of the scope of this review, but several studies have been conducted that provide some insight [152], [153], [154]. There are full-service genome annotation tools, such as RAST [155], Prokka [130], DRAM [131], and MetaErg [132], that work on contigs and genomes predicting genes and their functions in tandem. These tools also predict other non-protein coding genetic features, such as tRNAs and rRNAs. RAST and Prokka are older softwares, originally developed for single prokaryotic genomes, and are limited to functional descriptions and EC numbers, while DRAM, and MetaErg were more recently developed for unbinned metagenome and MAG annotation and provide a wide array of functional classifications. Web-based tools such as RAST have the advantage of providing easy access to visualizations of the gene neighborhood of all the genes predicted on a contig, providing insight into the potential function of unknown proteins and easy comparison with other genomes. If the protein database has already been compiled, but more functional annotations are needed, other annotation tools can be used to annotate protein sequences directly, such as eggNOG-mapper [152], InterProScan [156], dbCAN2 [157] and GhostKOALA [158]. These tools can be run on a curated protein database or on only the protein sequences that have been identified, saving computational time. InterProScan and eggNOG-mapper provide a wide array of functional annotation information, while dbCAN2 and GhostKOALA are more specialized, focusing on CAZy enzymes and KO terms, respectively. In the end, the choice of annotation tool depends on the desired functional outcomes of any given metaproteomic study.

Perspectives and concluding remarks

Construction of the protein sequence database plays a critical role in the outcome of any metaproteomics study. In this review, we have provided a comprehensive overview of the effects of protein databases on peptide identification and protein inference, as well as their subsequent taxonomic and functional interpretation. Existing evidence indicates that peptides and proteins are best identified, and taxonomically classified, when the database is complete and has minimal extraneous sequences, which is usually best accomplished through sequencing sample-matched metagenomes or comprehensive prior sequencing of a specific system. For continued improvement of protein sequence databases, future evaluations should focus on how different metagenomic processing methods used for protein database construction affect peptide identification numbers, peptide identification accuracy, and taxonomic accuracy (e.g. discussed in 2.4). For example, one option in database generation that has not been evaluated is if and how combining metagenomes from multiple samples impacts peptide identification numbers and accuracy i.e. will having sample specific databases or databases that combine all metagenomes for the whole experiment be better. Improving the accuracy of peptide and protein identifications in the context of multi-step search strategies, such as two-step searches [59], is also needed because their validity has recently been put into question [62], and they are needed for the identification of peptides and proteins using the massive databases from previous sequencing initiatives [48]. As large genome-resolved gene catalogs become available for more biological systems, such as mice [159], it will become all the more critical to evaluate the utility of these databases for metaproteomic studies and to develop better database reduction strategies. We still do not fully understand if distinct construction strategies produce databases that perform differently on biological systems of variable complexity since the most thorough evaluations described in this paper were done only on mice and human gut samples [88]. Community efforts such as the Metaproteomics Initiative [160], which, for example, recently carried out an interlaboratory comparison of metaproteomic workflows [19], may represent an excellent mechanism to evaluate the impact of database construction approaches on metaproteomics. Emerging technologies in the realms of DNA sequencing, peptide mass spectrometry and novel protein measurement approaches will likely impact protein database construction for metaproteomics. With regards to DNA sequencing technologies, long-read sequencing technologies, such as Oxford Nanopore [161] and PacBio [162], as well as sequencing technologies that connect DNA reads based on their cell of origin, such as Hi-C sequencing [163], provide avenues for obtaining higher quality MAGs with better taxonomic resolution. With regards to protein identification, new technologies, such as ion mobility spectrometry TOF mass spectrometers [164], data-independent acquisition (DIA) [165], or actual sequencing of proteins independent of mass spectrometry using nanoPores [166] provide new avenues to improve metaproteomic depth, quantification and protein inference. These technologies are likely to change how protein databases affect the outcome of a metaproteomic study. In the case of the new mass spectrometry technologies, however, recent publications indicate that identification of proteins using these technologies will follow similar principles with regards to spectral matching and FDR calculations as database search [167], [168], indicating that many of the principles described in this review will still apply. In the case of protein sequencing technologies, protein databases will likely no longer be needed for the identification of proteins; however, protein databases will still be important for taxonomic and functional classification. Therefore, at least for the foreseeable future, protein database construction remains critical for investigating molecular phenotypes of microbial communities.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

175 in total

1. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors: Weizhong Li; Adam Godzik
Journal: Bioinformatics Date: 2006-05-26 Impact factor: 6.937

2. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.

Authors: Joshua E Elias; Steven P Gygi
Journal: Nat Methods Date: 2007-03 Impact factor: 28.547

3. Assessing the impact of protein extraction methods for human gut metaproteomics.

Authors: Xu Zhang; Leyuan Li; Janice Mayne; Zhibin Ning; Alain Stintzi; Daniel Figeys
Journal: J Proteomics Date: 2017-07-10 Impact factor: 4.044

4. The minimum information about a genome sequence (MIGS) specification.

Authors: Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal: Nat Biotechnol Date: 2008-05 Impact factor: 54.908

5. Database resources of the National Center for Biotechnology Information.

Authors:
Journal: Nucleic Acids Res Date: 2013-11-19 Impact factor: 16.971

6. Assessing species biomass contributions in microbial communities via metaproteomics.

Authors: Manuel Kleiner; Erin Thorson; Christine E Sharp; Xiaoli Dong; Dan Liu; Carmen Li; Marc Strous
Journal: Nat Commun Date: 2017-11-16 Impact factor: 14.919

7. Peptide-based functional annotation of carbohydrate-active enzymes by conserved unique peptide patterns (CUPP).

Authors: Kristian Barrett; Lene Lange
Journal: Biotechnol Biofuels Date: 2019-04-30 Impact factor: 6.040

8. Critical Assessment of MetaProteome Investigation (CAMPI): a multi-laboratory comparison of established workflows.

Authors: Tim Van Den Bossche; Benoit J Kunath; Kay Schallert; Stephanie S Schäpe; Paul E Abraham; Jean Armengaud; Magnus Ø Arntzen; Ariane Bassignani; Dirk Benndorf; Stephan Fuchs; Richard J Giannone; Timothy J Griffin; Live H Hagen; Rashi Halder; Céline Henry; Robert L Hettich; Robert Heyer; Pratik Jagtap; Nico Jehmlich; Marlene Jensen; Catherine Juste; Manuel Kleiner; Olivier Langella; Theresa Lehmann; Emma Leith; Patrick May; Bart Mesuere; Guylaine Miotello; Samantha L Peters; Olivier Pible; Pedro T Queiros; Udo Reichl; Bernhard Y Renard; Henning Schiebenhoefer; Alexander Sczyrba; Alessandro Tanca; Kathrin Trappe; Jean-Pierre Trezzi; Sergio Uzzau; Pieter Verschaffelt; Martin von Bergen; Paul Wilmes; Maximilian Wolf; Lennart Martens; Thilo Muth
Journal: Nat Commun Date: 2021-12-15 Impact factor: 14.919

9. Strategies to improve reference databases for soil microbiomes.

Authors: Jinlyung Choi; Fan Yang; Ramunas Stepanauskas; Erick Cardenas; Aaron Garoutte; Ryan Williams; Jared Flater; James M Tiedje; Kirsten S Hofmockel; Brian Gelder; Adina Howe
Journal: ISME J Date: 2016-12-09 Impact factor: 10.302

1 in total

Review 1. Tool and techniques study to plant microbiome current understanding and future needs: an overview.

Authors: Prem Chandra
Journal: Commun Integr Biol Date: 2022-08-10

1 in total