| Literature DB >> 29280960 |
Michael Riffle1,2, Damon H May3, Emma Timmins-Schiffman4, Molly P Mikan5, Daniel Jaschob6, William Stafford Noble7, Brook L Nunn8.
Abstract
Metaproteomics is the characterization of all proteins being expressed by a community of organisms in a complex biological sample at a single point in time. Applications of metaproteomics range from the comparative analysis of environmental samples (such as ocean water and soil) to microbiome data from multicellular organisms (such as the human gut). Metaproteomics research is often focused on the quantitative functional makeup of the metaproteome and which organisms are making those proteins. That is: What are the functions of the currently expressed proteins? How much of the metaproteome is associated with those functions? And, which microorganisms are expressing the proteins that perform those functions? However, traditional protein-centric functional analysis is greatly complicated by the large size, redundancy, and lack of biological annotations for the protein sequences in the database used to search the data. To help address these issues, we have developed an algorithm and web application (dubbed "MetaGOmics") that automates the quantitative functional (using Gene Ontology) and taxonomic analysis of metaproteomics data and subsequent visualization of the results. MetaGOmics is designed to overcome the shortcomings of traditional proteomics analysis when used with metaproteomics data. It is easy to use, requires minimal input, and fully automates most steps of the analysis-including comparing the functional makeup between samples. MetaGOmics is freely available at https://www.yeastrc.org/metagomics/.Entities:
Keywords: bioinformatics; data visualization; gene ontology; mass spectrometry; metaproteomics; proteomics; software
Year: 2017 PMID: 29280960 PMCID: PMC5874761 DOI: 10.3390/proteomes6010002
Source DB: PubMed Journal: Proteomes ISSN: 2227-7382
Figure 1The MetaGOmics algorithm. (A) The first phase of the algorithm (functional analysis) examines all peptides identified in a mass spectrometry (MS) experiment. Each peptide is matched to proteins in the FASTA file (metaPROTEIN in figure), those are matched to UniProtKB proteins via BLAST (uniprotPROTEIN in figure), and Gene Ontology (GO) annotations for the UniProtKB proteins are used to create complete GO graphs for each protein containing direct annotations and all ancestor terms. All GO graphs from all proteins matched by a peptide are merged into a single, non-redundant GO graph (the union of the sets), and the spectral count of each term is increased by the spectral count for the peptide. This process is repeated for all peptides in the experiment to obtain final spectral counts for all GO terms; (B) The second phase of the algorithm, taxonomic analysis of functions, examines all peptides that are assigned a specific GO term. Each peptide is matched to a FASTA protein (metaPROTEIN in figure), the FASTA proteins are matched to UniProtKB proteins via BLAST (uniprotPROTEIN in figure), and taxonomic annotations for the UniProtKB proteins are used. A taxonomic tree is generated containing the direct taxonomic annotations and all ancestor terms. All taxonomic trees resulting from all matched proteins are merged such that the resulting tree contains only those terms present in all trees (the intersection of the sets). The taxonomic terms have their spectral count increased by the spectral count of the peptide. After all peptides assigned to a GO term are processed, the ratio of the spectral count of each taxonomic term to the total spectral count of the GO term is calculated. This provides the relative, unambiguous contribution (in spectral count) of each taxon to a GO term at any arbitrary level of the taxonomic tree.
Small subset of GO terms, spectral counts, and relative abundance ratio in a hypothetical mass spectrometry experiment.
| GO Accession String | GO Aspect | GO Name | Spectral Count | Ratio |
|---|---|---|---|---|
| GO:0005575 | cellular_component | cellular_component | 12,217 | 1 |
| GO:0008150 | biological_process | biological_process | 12,217 | 1 |
| GO:0003674 | molecular_function | molecular_function | 12,217 | 1 |
| unknownprc | biological_process | unknown biological process | 5472 | 0.45 |
| GO:0005488 | molecular_function | binding | 4185 | 0.34 |
| GO:0097159 | molecular_function | organic cyclic compound binding | 3579 | 0.29 |
| GO:1901363 | molecular_function | heterocyclic compound binding | 3579 | 0.29 |
| GO:0005524 | molecular_function | ATP binding | 1712 | 0.14 |
| GO:1901566 | biological_process | organonitrogen compound biosynthetic process | 1353 | 0.11 |
| GO:0042026 | biological_process | protein refolding | 1145 | 0.09 |
| GO:1990351 | cellular_component | transporter complex | 200 | 0.02 |
For a given GO term, the taxa, spectral count, fraction of this GO term’s spectral count, and fraction of all spectra in the experiment that could be unambiguously attributed to each respective taxon. E.g., 88% of the spectra for this GO term were attributable to the Bacteroidetes phylum. 3.5% of the spectra in the experiment were attributable to this GO term and the Bacteriodetes phylum.
| Taxon Name | Taxonomy Rank | Spectral Count | Ratio of GO | Ratio of Experiment |
|---|---|---|---|---|
| superkingdom | 240 | 0.88 | 3.50 × 10−2 | |
| class | 141 | 0.52 | 2.05 × 10−2 | |
| phylum | 141 | 0.52 | 2.05 × 10−2 | |
| order | 141 | 0.52 | 2.05 × 10−2 | |
| genus | 81 | 0.3 | 1.18 × 10−2 | |
| family | 81 | 0.3 | 1.18 × 10−2 | |
| phylum | 41 | 0.15 | 5.97 × 10−3 | |
| order | 33 | 0.12 | 4.81 × 10−3 | |
| family | 33 | 0.12 | 4.81 × 10−3 | |
| genus | 33 | 0.12 | 4.81 × 10−3 | |
| class | 33 | 0.12 | 4.81 × 10−3 | |
| species | 23 | 0.08 | 3.35 × 10−3 | |
| order | 6 | 0.02 | 8.74 × 10−4 | |
| class | 6 | 0.02 | 8.74 × 10−4 | |
| phylum | 5 | 0.02 | 7.28 × 10−4 |
For the comparison of two hypothetical MS experiments, a small subset of the GO terms, log-fold changes, and q-values for GO terms detected in the two experiments.
| GO Name | Fold Change | |
|---|---|---|
| outer membrane | 1.55 | 5.27 × 10−106 |
| cell outer membrane | 1.55 | 5.61 × 10−106 |
| external encapsulating structure part | 1.5 | 5.64 × 10−102 |
| membrane | 1.14 | 3.00 × 10−101 |
| receptor activity | 1.47 | 5.03 × 10−93 |
| intrinsic component of membrane | 1.44 | 6.37 × 10−88 |
| integral component of membrane | 1.44 | 6.37 × 10−88 |
| molecular transducer activity | 1.35 | 4.14 × 10−81 |
| membrane part | 1.01 | 4.69 × 10−53 |
| carbohydrate derivative binding | −2.03 | 1.25 × 10−49 |
| ribonucleotide binding | −2.03 | 1.25 × 10−49 |
| purine ribonucleoside binding | −2.04 | 5.36 × 10−47 |
| ribonucleoside binding | −2.04 | 5.36 × 10−47 |
| purine ribonucleoside triphosphate binding | −2.04 | 5.36 × 10−47 |
Figure 2Screenshots from the MetaGOmics web application. (A) A user fills out an initial form to create a context for MetaGOmics analysis. The user (1) uploads a FASTA file containing a database of protein sequences to which peptides should be matched, (2) selects a BLAST database to use for protein annotations, (3) chooses cutoffs for the BLAST hits, and (4) enters an email address to be notified when processing is complete; (B) After a user submits the form in part (A), a unique URL is created for a page where a user may perform MetaGOmics analysis using the desired FASTA file, BLAST database, and BLAST cutoff settings for all uploaded data. To upload data for analysis, the user clicks “Upload Peptide Count List” to upload a text file containing peptide sequences and spectral counts. Each row under “Uploaded Peptide Count Lists” shows each requested analysis and its current status. Upon completion, users may click the “Download GO Analysis” button to download the results as text reports or images. Two analyses may be compared by clicking the checkbox next to two rows and clicking “Compare Checked Runs.” Comparisons may also be downloaded as text reports or as images.
Figure 3An example GO graph generated by comparing two experiments using the MetaGOmics server. Each node is a GO term, with lines indicating edges between those nodes in the GO structure. Each node is labeled with the name of the GO term, the log fold change, and the q-value. Nodes with a positive log-fold change are shaped as parallelograms, and shaded yellow—where darker shades of yellow indicate more significant q-values. Nodes with a negative log-fold change are shaped as rectangles, with shades of blue indicating q-value significance. Grey terms are not statistically significant. In this example, GO terms with the “cellular component” aspect were compared. The ratio of spectra in the second experiment matching proteins that localized to the outer cell membrane, extracellular space, phosphopyruvate hydralase complex, and integral component of the membrane were significantly reduced. Whereas, the ratio of spectra matching proteins with an unknown cellular component and ATP-binding (ABC) transporter complex were increased.
Figure 4GO term statistics produced by a MetaGOmics analysis comparing ocean surface to bottom water samples from May et al. All data from the analysis are available at https://www.yeastrc.org/metagomics/ocean. (A) Volcano plot depicting the negative log (base 10) of the q-value versus the log (base 2) fold change for all GO terms found in either sample. A horizontal reference line is added for a q-value cutoff of 0.01. A vertical reference line is added for no change. Each point is a GO term and has been colored and re-shaped according to its GO aspect; (B) A scatter plot depicting the log (base 2) fold change from the surface sample to the bottom sample for each GO term versus that GO term’s spectral count ratio in the surface sample. Each GO term has been colored either lavender (not statistically significant) or red (q-value ≤ 0.01).
Up to the top 10 leaf GO terms with a q-value ≤ 0.01 for positive and negative log-fold changes comparing ocean water samples from BSt (Surface) to CS (Bottom) from May et al. Shown are the name of the GO term, the log-fold change from surface to bottom samples, and the q-value resulting from the Benjamini-Hochberg adjustment.
| −6.09 | 3.49 × 10−73 | protein refolding | 1.66 | 1.01 × 10−167 | |
| translation | −0.77 | 2.43 × 10−57 | chromosome condensation | 1.52 | 3.69 × 10−50 |
| translational elongation | −1.11 | 9.60 × 10−26 | DNA repair | 1.61 | 1.82 × 10−7 |
| transcription anti-termination | −2.79 | 5.66 × 10−8 | dephosphorylation | 2.04 | 6.62 × 10−7 |
| fatty acid biosynthetic process | −1.21 | 6.42 × 10−8 | de novo’ pyrimidine nucleobase biosynthetic process | 3.74 | 4.53 × 10−5 |
| GTP biosynthetic process | −5.12 | 7.30 × 10−8 | RNA phosphodiester bond hydrolysis, exonucleolytic | 1 | 5.52 × 10−5 |
| UTP biosynthetic process | −5.12 | 7.30 × 10−8 | mRNA catabolic process | 0.93 | 1.69 × 10−4 |
| CTP biosynthetic process | −5.12 | 7.30 × 10−8 | 7,8-dihydroneopterin 3′-triphosphate biosynthetic process | 3.09 | 4.48 × 10−3 |
| tricarboxylic acid cycle | −3.84 | 4.41 × 10−8 | response to cadmium ion | 1.45 | 7.42 × 10−3 |
| cell division | −1.07 | 1.18 × 10−5 | |||
| monosaccharide binding | −6.07 | 6.71 × 10−72 | histidine ammonia-lyase activity | 4.54 | 1.93 × 10−113 |
| receptor activity | −1.03 | 5.77 × 10−65 | unfolded protein binding | 0.83 | 2.85 × 10−57 |
| structural constituent of ribosome | −0.68 | 1.12 × 10−34 | nitrate reductase activity | 7.35 | 4.13 × 10−35 |
| DNA-directed RNA polymerase activity | −1.68 | 4.06 × 10−34 | heme binding | 3.38 | 4.23 × 10−35 |
| translation elongation factor activity | −1.09 | 7.50 × 10−25 | ATP binding | 0.51 | 6.87 × 10−34 |
| GTP binding | −0.92 | 6.40 × 10−17 | 4 iron, 4 sulfur cluster binding | 2.33 | 3.28 × 10−27 |
| GTPase activity | −0.92 | 8.64 × 10−7 | prephenate dehydratase activity | 3.94 | 4.38 × 10−13 |
| nucleoside diphosphate kinase activity | −5.11 | 1.28 × 10−7 | selenium binding | 2 | 8.97 × 10−13 |
| tRNA binding | −0.87 | 3.00 × 10−6 | 4-phytase activity | 3.45 | 7.83 × 10−73 |
| acetyl-CoA carboxylase activity | −3.91 | 3.46 × 10−6 | formate dehydrogenase (NAD+) activity | 1.48 | 9.79 × 10−6 |
| cell outer membrane | −0.98 | 1.17 × 10−43 | cytoplasm | 0.74 | 8.53 × 10−84 |
| intracellular | −0.71 | 1.09 × 10−36 | bacterial-type flagellum filament | 3.49 | 1.47 × 10−15 |
| ribosome | −0.64 | 1.78 × 10−33 | bacterial-type flagellum | 2.01 | 1.53 × 10−8 |
| integral component of membrane | −0.98 | 9.77 × 10−24 | unknown cellular component | 0.07 | 1.04 × 10−6 |
| thylakoid | −2.08 | 1.07 × 10−11 | ATP-binding cassette (ABC) transporter complex | 0.6 | 1.04 × 10−5 |
| large ribosomal subunit | −1.21 | 1.54 × 10−11 | cytosolic small ribosomal subunit | 3.66 | 9.19 × 10−3 |
| acetyl-CoA carboxylase complex | −3.84 | 4.22 × 10−6 | |||
| plasma membrane | −0.29 | 3.37 × 10−4 | |||
| pyruvate dehydrogenase complex | −3.99 | 7.77 × 10−4 | |||
| proton-transporting ATP synthase complex, catalytic core F(1) | −0.37 | 1.36 × 10−3 | |||