Literature DB >> 22139914

MOPED: Model Organism Protein Expression Database.

Eugene Kolker¹, Roger Higdon, Winston Haynes, Dean Welch, William Broomall, Doron Lancet, Larissa Stanberry, Natali Kolker.

Abstract

Large numbers of mass spectrometry proteomics studies are being conducted to understand all types of biological processes. The size and complexity of proteomics data hinders efforts to easily share, integrate, query and compare the studies. The Model Organism Protein Expression Database (MOPED, htttp://moped.proteinspire.org) is a new and expanding proteomics resource that enables rapid browsing of protein expression information from publicly available studies on humans and model organisms. MOPED is designed to simplify the comparison and sharing of proteomics data for the greater research community. MOPED uniquely provides protein level expression data, meta-analysis capabilities and quantitative data from standardized analysis. Data can be queried for specific proteins, browsed based on organism, tissue, localization and condition and sorted by false discovery rate and expression. MOPED empowers users to visualize their own expression data and compare it with existing studies. Further, MOPED links to various protein and pathway databases, including GeneCards, Entrez, UniProt, KEGG and Reactome. The current version of MOPED contains over 43,000 proteins with at least one spectral match and more than 11 million high certainty spectra.

Entities: Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2011 PMID： 22139914 PMCID： PMC3245040 DOI： 10.1093/nar/gkr1177

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein expression, the presence or quantity of a protein in a biological sample, is one of the key measures essential for understanding biological processes. The data serve as a snapshot of the state of an organism at the time of sample collection. Notably, aberrant protein expression patterns in disease states may be indicative of the mis-regulations associated with the disease. MOPED (http://moped.proteinspire.org) was motivated, in part, by the idea that easy public access to protein expression data will enable scientists to better identify and understand protein expression patterns that are related to significant diseases and biological processes. Mass spectrometry-based proteomics is the most common approach used to survey complex samples for the presence of proteins and their expression (1,2). To provide ample context for the data contained in MOPED, we briefly describe a proteomics workflow. Prior to analysis by mass spectrometry, proteins are typically digested into their peptide components. Search engines such as Sequest, Mascot, X!Tandem and OMSSA match the spectra generated by tandem mass spectrometry with peptides from a target protein sequence database (3–6). Due to the highly complex nature of protein samples and their processing, as well as mass spectrometry instrumentation, approaches and analysis, peptide spectral matches are associated with varying degrees of uncertainty (7–9). Once peptide spectral matches are formed, the peptides are amalgamated into protein identifications with associated measures of statistical certainty. Commonly, peptide spectral matches are performed against decoy databases generated by reversing or randomizing the target database to estimate the false discovery rate (FDR) associated with protein and peptide identifications (10,11). From these searches, estimates of protein expression can be determined by using measures such as spectra counts (the number of identified spectra which correspond to a specific protein), sequence coverage and peak areas or intensities (12,13). Expression in mass spectrometry proteomics experiments can be measured dichotomously in terms of the certainty of a protein being present or with quantitative measures that reflect the protein's concentration. Relative expression measures are used for comparing the relative amounts of the same protein across different conditions. Absolute expression, the quantification of different proteins within the same sample is difficult to measure in part due to variability in individual protein responses to mass spectrometry assay methods. A number of websites provide host services for massive proteomics datasets (14–17). Although these repositories are excellent resources for accessing raw data and quick experimental summaries, they neither provide protein expression data, nor do they allow for a standardized comparison of expression levels across tissues, localizations and conditions. Furthermore, the extreme scale of data in these repositories makes meta-analysis and even simple querying of these datasets a staggering challenge, often worthy of its own publication (18,19). Such meta-analysis typically requires the download of raw data, whose volume is often measured in terabytes, and analysis of these data through a computationally intensive proteomics workflow. In cases where summary information is available, these data may be in varying formats, have been processed through non-standard pipelines and often provide limited or non-comparable statistical measures of protein identification certainty. Additionally, proteome profiles from other resources omit the relevant expression information (20). The aforementioned challenges hinder the utilization of publicly available proteomics data. Enabling researchers to access these data in an effective manner is an important challenge in proteomics. MOPED complements the availability of raw data from other resources by presenting standardized data analysis and enabling the user to view experimental data relative to existing expression profiles across many different tissues, localizations and conditions (21). Where there are multiple experimental datasets for a given combination of organism, tissue, localization and condition, a meta-analysis is provided based on the recently published approach (18). The simple format of the MOPED data and the straightforward approach to meta-analysis allows for the uncomplicated combination of proteomics datasets. These features and comparisons empower the user to make meaningful statements about identified proteins with respect to the existing knowledge-base.

DATABASE CONTENT

Expression data

The core component of MOPED's database is the repository of expression information from public proteomics datasets. By storing and displaying essential summary information without requiring the user to download any files, MOPED simplifies access to the proteomics data. To maintain statistical integrity, MOPED requires that statistical measures be provided for each protein identification, including the protein FDR and spectral counts. A full list of required measures is found in Table 1. Users may submit data to MOPED by providing either raw files or pre-processed data. Currently, all data displayed in MOPED were analyzed using the standardized data analysis and statistical methods of the SPIRE pipeline (21,22).

Table 1.

The fields required for each protein expression data point in MOPED

Statistic	Definition
Expression percentile	The percentile (0–100%) corresponding to the protein expression level in this experiment
Normalized expression	Number of spectra counts divided by sequence length normalized to the maximum expression value in the experiment (0–1)
FDR	Cumulative FDR threshold for protein identification
Spectral count	The number of unique spectra identified which correspond to the identified proteins.
Unique peptides	Number of unique peptide sequences identified
Sequence coverage	Percentage of the protein sequence covered by identified peptide sequences

The fields required for each protein expression data point in MOPED

Meta-data

A major problem when accessing public data is a lack of specificity from data providers about experimental protocols. To prevent this frustration, MOPED requires a minimum amount of meta-data that must be included with each dataset. At the experiment level, users must supply a brief experimental description, the source organism from the NCBI taxon database and any applicable journal references (23). Additionally, each protein identification is associated with a tissue, localization and condition which align with the BRENDA Tissue Ontology, Cell Type Ontology and Disease Ontology, respectively (24–26).

Organisms

MOPED contains information on both humans and model organisms. Not only does studying model organisms increase our understanding of biological systems, but also studies of model organisms can inform our knowledge of homologous systems in humans and other species (27). Thus far, MOPED contains data from four of the most studied organisms: Homo sapiens (human), Mus musculus (mouse), Caenorhabditis elegans (worm) and Saccharomyces cerevisiae (yeast).

Protein information

To maximize information content, MOPED has been built to link out to many of the most popular and useful data resources. In terms of protein identifiers, MOPED has universal links to the heavily utilized UniProt and NCBI databases and organism-specific links to the authoritative WormBase and Saccharomyces Genome Database (28–31). A symbiotic relationship has been established whereby, MOPED links to GeneCards and GeneCards displays MOPED’s data (32). MOPED contains an innovative database that extends coverage of proteins to pathway databases (KEGG, Reactome, Metacyc, PANTHER and SEED) using orthologous groups of proteins specified by both the aforementioned pathways databases and eggNOG (33–38). In total, MOPED links to 10 external databases.

Release statistics

As of 10 November 2011, MOPED contains 43 794 proteins with at least one high certainty spectral match, 23 167 proteins with an FDR<1% and more than 11 million spectra (39). These data come from 35 experiments on 4 organisms covering 13 tissues, 21 localizations and 10 conditions. Organism-specific release statistics are in Table 2. In addition to individual experiments, the database also contains meta-analyses of yeast and worm data based upon the recently published approach to meta-analysis (18).

Table 2.

Release statistics as of 10 November 2011

Species	Proteins with at least one spectral match	Proteins with <1% FDR	High confidence spectra
Homo sapiens (human)	15 847	6102	3 906 048
Mus musculus (mouse)	10 308	5935	2 650 237
Caenorhabditis elegans (worm)	10 922	7383	1 979 744
Saccharomyces cerevisiae (yeast)	6717	3747	2 809 390
Total	43 794	23 167	11 345 419

Release statistics as of 10 November 2011

USER INTERFACE

MOPED front page

The MOPED front page (http://moped.proteinspire.org) provides a description of the MOPED resource and contains tabs to access database search, upload data and view help files.

MOPED search view

MOPED's access point to proteomics data is located in the ‘Search’ tab. From this view, users are able to access the entirety of MOPED's expression database (Figure 1, top). Protein expression data can be both browsed by categories such as organism, tissue and localization and queried by protein ID and keywords. After the user has selected filters, clicking the ‘Search’ button quickly renders all matching expression data points and associated meta-data. Most of the search view is dominated by the ‘Protein ID and Expression Summary’ section which displays expression data resulting from the user's query. Each row in the expression summary table displays all statistical information contained in Table 1, as well as experimental meta-data. Complete protein annotations can be viewed by hovering over either the protein IDs or partial annotations. The set of meta-data corresponding to all displayed expression information is summarized under the separate ‘Experiment Summaries’ table. The filtering capabilities at the top of the MOPED interface's Search tab allows users to query on these different experiments.

Figure 1.

MOPED views. The main MOPED view, on top and the protein view, on bottom. Clicking on links for an identified protein in the main MOPED view brings up the protein view. In this example, P06733 has been selected from the main MOPED view.

MOPED protein view

Clicking on a protein ID from any tab allows the user to open a page containing all stored information related to that protein, including the protein annotation, links to protein and pathway databases and identifications of that protein in other MOPED experiments (Figure 1, bottom). The primary advantage of MOPED's protein view over other databases is the presentation of expression data from many experiments side by side. On the protein page, MOPED automatically displays the expression information for that protein in every single experiment contained in MOPED (Figure 1, bottom). Ideally, this information will enable the user to identify meaningful expression patterns across different conditions. The same expression information has been incorporated with both GeneCards (human data only) and SPIRE (32,21).

MOPED upload

Through the upload tab, users can compare their experimental data with the data contained in the MOPED servers. User upload of data automatically filters MOPED data to display only those proteins which were identified in the user's experiment. For identification only queries, users are able to upload a list of UniProt protein identifiers. For expression based queries, users may upload UniProt protein identifiers, expression and FDR values and condition names. Once this information has been uploaded, the user can experiment with several functionalities in the Upload tab (Figure 2). MOPED displays the data for proteins identified in both the user's experiment and experiments in the MOPED servers. These data may be interrogated in the same manner as the MOPED search page. For identification visualization, MOPED separates user data based on condition and generates overlap plots of the identifications with dynamic thresholding by protein FDR (Figure 3). For expression visualization, MOPED dynamically generates heatmaps of the user-uploaded data with user-specified expression value thresholding (Figure 4).

Figure 2.

Figure 3.

Overlap plot. An overlap plot generated for data from Ref. (42) with two conditions, cancer and control.

Figure 4.

Overlap plot. An overlap plot generated for data from Ref. (42) with two conditions, cancer and control.

Upload tab. Users may upload their own data through the upload tab. These data can then be visualized by clicking any of the ‘Generate’ links under their associated functionalities. Experiment summaries and details create a view at the bottom of the screen akin to the view in Figure 1. The overlap plot and heatmap views are seen in Figure 3 and Figure 4, respectively. Overlap plot. An overlap plot generated for data from Ref. (42) with two conditions, cancer and control. Overlap plot. An overlap plot generated for data from Ref. (42) with two conditions, cancer and control.

MOPED documentation

MOPED provides a comprehensive help file and a tutorial example to clarify the usage and highlight its features. This documentation is accessible under the Help tab and comes in the form of two pdf files. The tutorial contains real data examples.

FUTURE DIRECTIONS

Increased data and public data submission

MOPED is currently involved in a number of collaborations that will dramatically increase the amount of proteomics data available. Though all MOPED data are currently loaded in-house, work is in progress to create an interface for public submission of proteomics expression data. Users will be able to fulfill publication and grant requirements for data preservation by uploading their datasets to MOPED. Researchers interested in submitting their data are invited to contact the MOPED team at moped@proteinspire.org. In addition to increasing the number of protein identification experiments, MOPED plans to utilize data from relative expression experiments, providing users with expression ratios and statistical significance for many different condition comparisons.

Increased visualization

MOPED remains under continuous development to improve all components of the user experience. Currently, work is underway to develop a plug-in for Cytoscape that provides pathway level visualization of the experimental data currently residing in MOPED (40). The goal is to maximize the user's knowledge of fluctuating patterns of pathway regulation (Supplementary Figure S5). Additionally, scripts are being developed to dynamically visualize experimental expression relative to the MOPED experiments (Supplementary Figure S6).

Integration of other omics data

While proteomics data provides comprehensive insight into cellular mechanisms at the protein level, combining proteomics knowledge with other omics disciplines stands to develop a more complete understanding of complex biological systems. Metabolomics, transcriptomics, lipidomics and genomics are notable disciplines for which integrated analysis with proteomics is a natural extension. For example, proteomics data from MOPED could be linked with transcriptomics data from GEO for common organ, tissue, localization and condition combinations (41).

DISCUSSION

Currently, proteomics datasets are either scattered throughout individual data repositories or trapped within labs’ own databases. Knowledge discovery is often obscured by bulky datasets, non-standard formats, missing meta-data and limited access to data. MOPED presents a solution which addresses these challenges. MOPED provides essential statistical summaries and a number of query and visualization tools to relate the findings to those observed in other experiments. Patterns of expression within and across sample sets can be visualized, proteins of interest can be directly queried and condition-specific expression data can be browsed. As community resource, MOPED will increase reliable data proliferation and make analysis more comprehensive.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Figures 5 and 6.

FUNDING

National Science Foundation (DBI grant 0544757); National Institutes of Health (NIGMS grant 5R01 GM076680-02, NIDDK grants UO1 DK072473, 1U01DK089571-01); the McMillen Foundation (grant to E.K.). Funding for open access charge: Seattle Children's. Conflict of interest statement. None declared.

42 in total

Review 1. Advances in proteome analysis by mass spectrometry.

Authors: T J Griffin; D R Goodlett; R Aebersold
Journal: Curr Opin Biotechnol Date: 2001-12 Impact factor: 9.740

2. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.

Authors: Andrew Keller; Alexey I Nesvizhskii; Eugene Kolker; Ruedi Aebersold
Journal: Anal Chem Date: 2002-10-15 Impact factor: 6.986

3. Open mass spectrometry search algorithm.

Authors: Lewis Y Geer; Sanford P Markey; Jeffrey A Kowalak; Lukas Wagner; Ming Xu; Dawn M Maynard; Xiaoyu Yang; Wenyao Shi; Stephen H Bryant
Journal: J Proteome Res Date: 2004 Sep-Oct Impact factor: 4.466

Review 4. Protein identification and expression analysis using mass spectrometry.

Authors: Eugene Kolker; Roger Higdon; Jason M Hogan
Journal: Trends Microbiol Date: 2006-04-17 Impact factor: 17.079

5. Annotating the human genome with Disease Ontology.

Authors: John D Osborne; Jared Flatow; Michelle Holko; Simon M Lin; Warren A Kibbe; Lihua Julie Zhu; Maria I Danila; Gang Feng; Rex L Chisholm
Journal: BMC Genomics Date: 2009-07-07 Impact factor: 3.969

6. Reactome: a database of reactions, pathways and biological processes.

Authors: David Croft; Gavin O'Kelly; Guanming Wu; Robin Haw; Marc Gillespie; Lisa Matthews; Michael Caudy; Phani Garapati; Gopal Gopinath; Bijay Jassal; Steven Jupe; Irina Kalatskaya; Shahana Mahajan; Bruce May; Nelson Ndegwa; Esther Schmidt; Veronica Shamovsky; Christina Yung; Ewan Birney; Henning Hermjakob; Peter D'Eustachio; Lincoln Stein
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

7. NCBI GEO: archive for functional genomics data sets--10 years on.

Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Michelle Holko; Oluwabukunmi Ayanbule; Andrey Yefanov; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

8. Ongoing and future developments at the Universal Protein Resource.

Authors:
Journal: Nucleic Acids Res Date: 2010-11-04 Impact factor: 16.971

9. Cytoscape 2.8: new features for data integration and network visualization.

Authors: Michael E Smoot; Keiichiro Ono; Johannes Ruscheinski; Peng-Liang Wang; Trey Ideker
Journal: Bioinformatics Date: 2010-12-12 Impact factor: 6.937

10. An ontology for cell types.

Authors: Jonathan Bard; Seung Y Rhee; Michael Ashburner
Journal: Genome Biol Date: 2005-01-14 Impact factor: 13.583

55 in total

1. Opportunities and challenges for the life sciences community.

Authors: Eugene Kolker; Elizabeth Stewart; Vural Ozdemir
Journal: OMICS Date: 2012-03

2. WikiCell: a unified resource platform for human transcriptomics research.

Authors: Dongyu Zhao; Jiayan Wu; Yuanyuan Zhou; Wei Gong; Jingfa Xiao; Jun Yu
Journal: OMICS Date: 2012-06

3. Exploring genetic targets of psoriasis using genome wide association studies (GWAS) for drug repurposing.

Authors: Harshit Nanda; Nirmaladevi Ponnusamy; Rajasree Odumpatta; Jeyaraman Jeyakanthan; Arumugam Mohanapriya
Journal: 3 Biotech Date: 2020-01-10 Impact factor: 2.406

4. Detection of early pancreatic ductal adenocarcinoma with thrombospondin-2 and CA19-9 blood markers.

Authors: Jungsun Kim; William R Bamlet; Ann L Oberg; Kari G Chaffee; Greg Donahue; Xing-Jun Cao; Suresh Chari; Benjamin A Garcia; Gloria M Petersen; Kenneth S Zaret
Journal: Sci Transl Med Date: 2017-07-12 Impact factor: 17.956

5. iGWAS: Integrative Genome-Wide Association Studies of Genetic and Genomic Data for Disease Susceptibility Using Mediation Analysis.

Authors: Yen-Tsung Huang; Liming Liang; Miriam F Moffatt; William O C M Cookson; Xihong Lin
Journal: Genet Epidemiol Date: 2015-05-22 Impact factor: 2.135

6. Optimizing high performance computing workflow for protein functional annotation.

Authors: Larissa Stanberry; Bhanu Rekepalli; Yuan Liu; Paul Giblock; Roger Higdon; Elizabeth Montague; William Broomall; Natali Kolker; Eugene Kolker
Journal: Concurr Comput Date: 2014-09-10 Impact factor: 1.536

7. Toward more transparent and reproducible omics studies through a common metadata checklist and data publications.

Authors: Eugene Kolker; Vural Özdemir; Lennart Martens; William Hancock; Gordon Anderson; Nathaniel Anderson; Sukru Aynacioglu; Ancha Baranova; Shawn R Campagna; Rui Chen; John Choiniere; Stephen P Dearth; Wu-Chun Feng; Lynnette Ferguson; Geoffrey Fox; Dmitrij Frishman; Robert Grossman; Allison Heath; Roger Higdon; Mara H Hutz; Imre Janko; Lihua Jiang; Sanjay Joshi; Alexander Kel; Joseph W Kemnitz; Isaac S Kohane; Natali Kolker; Doron Lancet; Elaine Lee; Weizhong Li; Andrey Lisitsa; Adrian Llerena; Courtney Macnealy-Koch; Jean-Claude Marshall; Paola Masuzzo; Amanda May; George Mias; Matthew Monroe; Elizabeth Montague; Sean Mooney; Alexey Nesvizhskii; Santosh Noronha; Gilbert Omenn; Harsha Rajasimha; Preveen Ramamoorthy; Jerry Sheehan; Larry Smarr; Charles V Smith; Todd Smith; Michael Snyder; Srikanth Rapole; Sanjeeva Srivastava; Larissa Stanberry; Elizabeth Stewart; Stefano Toppo; Peter Uetz; Kenneth Verheggen; Brynn H Voy; Louise Warnich; Steven W Wilhelm; Gregory Yandl
Journal: OMICS Date: 2014-01

8. MOPED 2.5--an integrated multi-omics resource: multi-omics profiling expression database now includes transcriptomics data.

Authors: Elizabeth Montague; Larissa Stanberry; Roger Higdon; Imre Janko; Elaine Lee; Nathaniel Anderson; John Choiniere; Elizabeth Stewart; Gregory Yandl; William Broomall; Natali Kolker; Eugene Kolker
Journal: OMICS Date: 2014-06

Review 9. Integrating omics technologies to study pulmonary physiology and pathology at the systems level.

Authors: Ravi Ramesh Pathak; Vrushank Davé
Journal: Cell Physiol Biochem Date: 2014-04-28

10. Development of Physiologically Based Pharmacokinetic Model (PBPK) of BMP2 in Mice.

Authors: Aditya Utturkar; Bikram Paul; Hemanth Akkiraju; Jeremy Bonor; Prasad Dhurjati; Anja Nohe
Journal: Biol Syst Open Access Date: 2013