Literature DB >> 34791421

ProteomicsDB: toward a FAIR open-source resource for life-science research.

Ludwig Lautenbacher1, Patroklos Samaras2, Julian Muller2, Andreas Grafberger2, Marwin Shraideh3,4, Johannes Rank3,4, Simon T Fuchs3,4, Tobias K Schmidt2, Matthew The2, Christian Dallago5,6, Holger Wittges3,4, Burkhard Rost5,7, Helmut Krcmar3,4, Bernhard Kuster2,8, Mathias Wilhelm1.   

Abstract

ProteomicsDB (https://www.ProteomicsDB.org) is a multi-omics and multi-organism resource for life science research. In this update, we present our efforts to continuously develop and expand ProteomicsDB. The major focus over the last two years was improving the findability, accessibility, interoperability and reusability (FAIR) of the data as well as its implementation. For this purpose, we release a new application programming interface (API) that provides systematic access to essentially all data in ProteomicsDB. Second, we release a new open-source user interface (UI) and show the advantages the scientific community gains from such software. With the new interface, two new visualizations of protein primary, secondary and tertiary structure as well an updated spectrum viewer were added. Furthermore, we integrated ProteomicsDB with our deep-neural-network Prosit that can predict the fragmentation characteristics and retention time of peptides. The result is an automatic processing pipeline that can be used to reevaluate database search engine results stored in ProteomicsDB. In addition, we extended the data content with experiments investigating different human biology as well as a newly supported organism.
© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 34791421      PMCID: PMC8728203          DOI: 10.1093/nar/gkab1026

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

ProteomicsDB (https://www.ProteomicsDB.org) has developed into a multi-omics and multi-organism resource for life science research (1). It is built upon the in-memory-database technology HANA (2) enabling fast access to stored data and thus offering real-time data analytics capabilities. ProteomicsDB was originally developed to investigate large quantities of human quantitative mass spectrometry-based proteomics data, highlighted on one of the first drafts of the human proteome (3,4). However, over the past years it was extended to include additional organisms including Mus musculus and Arabidopsis thaliana (5) as well as additional omics types, such as transcriptomics and phenomics data (1,4). Because of this, ProteomicsDB has become a rich and valuable resource for life science research and extends beyond the scope of proteomics experiments. This is visible by the external resources integrating with ProteomicsDB, such as GeneCards (6), UniProt (7), OmniPathDB (8) and Gene Information eXtension (GIX) (9). Today, we notice on average ∼500 unique visitors per day. A unique characteristic of ProteomicsDB is its ability to integrate large amounts of diverse data. For example, while Expression Atlas (10) provides differential and baseline proteomics and transcriptomics data for a diverse set of organisms that can be explored online, the analysis is limited to the investigation of single experiments. In ProteomicsDB, the expression information across hundreds or thousands of experiments can be investigated simultaneously. In MaxQB (11), researchers are able to retrieve data from individual proteins similar to ProteomicsDB. However, the stored data are limited to proteomics with a limited number of distinct experiments. For example, the expression information of epidermal growth factor receptor (EGFR) in MaxQB covers 11 cell lines while ProteomicsDB provides information for 41 tissues and body-fluids as well as for 60 cell lines. For 52 of these, ProteomicsDB also provides cell viability information. Large data stewards, like ProteomicsDB, have the obligation to provide access to its data content in a way that also enables other researchers to reproduce, reanalyze and integrate the data. The specific requirements and principles behind this concern the Findability, Accessibility, Interoperability, and Reusability (FAIR) of (research) data (12). Following this movement, additional work expanded these principles in order to account for (research) software as well (13). The need for this separation becomes clear when considering one concrete principle. The reusability aspect of data is met when rich descriptions of the data are made available in a common data format. For software, this principle is additionally linked to the maintainability of the codebase. This includes the availability of appropriate documentation of the source code (13). The FAIR principles are at the very core of open science and are essential for the scientific community to use generated data effectively. As such, they were a major focus guiding the development of ProteomicsDB over the last 2 years. In this update, we discuss the developments of ProteomicsDB of the last two years, and specifically highlight our progress in turning ProteomicsDB into a FAIR and open source resource for life science research. For that purpose, we designed and implemented a reference architecture for ProteomicsDB (14) to enable fast development of new services and keep these services maintainable, manageable and extendable in future. Based on this, we created a new API that gives users access to essentially all data stored in ProteomicsDB achieving a major step toward enabling FAIR data access. We also release an open-source re-implementation of the user interface (UI) that not only turns the frontend into a reusable and expandable resource by external developers but also brings ProteomicsDB in accordance with modern web standards. In light of this, a new visualization was added that shows the primary, secondary and tertiary structure of proteins. In addition, we imported new data into ProteomicsDB, including data from a new organism, rice (Oryza sativa ssp. japonica), and we created a pipeline to improve the quality of the proteomics data stored within ProteomicsDB by using Prosit, a deep neural network that can predict various properties of peptides (15,16).

RESULTS

Full access to data stored in ProteomicsDB via new API

ProteomicsDB offered access to its data in form of an application programming interface (API) since its inception. However, the available APIs limited access to 10 predefined views all centered on the proteomics data. Already then, users did not have access to a large number of internal tables storing information on, for example, the used controlled vocabularies and neither to the newly supported omics data added in the past years. For this reason, we developed a new central API version two here (APIv2.0) that provides access to essentially all data currently stored in ProteomicsDB (Figure 1). During its development, we followed the guidelines and recommendations of the FAIR principles (12) with a focus to make the API of ProteomicsDB accessible and usable for both (non-)bioinformatics researchers and developers. The new version incorporates the functionality of all previously offered APIs turning it into the central (programmatic) access point to the data stored in ProteomicsDB.
Figure 1.

The architecture of ProteomicsDB. The data content and data layer of ProteomicsDB are accessible via three application programming interfaces (APIs). The API4UI is used by the frontend and contains predefined requests to the data in ProteomicsDB for the purpose of data visualization. The novel vue-based visualization layer of ProteomicsDB (top left) is separated into three levels. The proteomicsdb-components package is agnostic toward ProteomicsDB and thus usable on any website. The package proteomicsdb-wrappers connects the components with ProteomicsDB and can be re-used on any website as well. The package proteomicsdb-view contains the entire vue-based frontend of ProteomicsDB. The APIv1.1 is used by external resources (top right) and will remain publicly available. The new APIv2.0 provides access to virtually any datasets stored in ProteomicsDB.

The architecture of ProteomicsDB. The data content and data layer of ProteomicsDB are accessible via three application programming interfaces (APIs). The API4UI is used by the frontend and contains predefined requests to the data in ProteomicsDB for the purpose of data visualization. The novel vue-based visualization layer of ProteomicsDB (top left) is separated into three levels. The proteomicsdb-components package is agnostic toward ProteomicsDB and thus usable on any website. The package proteomicsdb-wrappers connects the components with ProteomicsDB and can be re-used on any website as well. The package proteomicsdb-view contains the entire vue-based frontend of ProteomicsDB. The APIv1.1 is used by external resources (top right) and will remain publicly available. The new APIv2.0 provides access to virtually any datasets stored in ProteomicsDB. An important aspect of offering FAIR data access is to use established standards. For this reason, we decided to use the OData Version 2.0 Protocol (https://www.odata.org/documentation/odata-version-2-0/overview/). OData is used for creating HTTP-based data services that can be queried by web clients using standard HTTP messages and respond in a standardized structure. For each OData service, metadata concerning the service is automatically created. This ensures the compliance of all already created and all future API endpoints regarding their findability, accessibility, interoperability and reusability. Furthermore, OData offers a large set of automatically generated functionalities, such as filtering and data formatting [in JSON (17) and XML (18)]. These features are consequently all available in our APIv2.0. For easier navigation, we separated the entire data model of ProteomicsDB into 19 topic clusters. A topic cluster groups multiple entities (e.g. samples and experiments) that contain information about a similar content type (e.g. the repository or transcriptomics data). For example, the repository of ProteomicsDB is such a topic cluster (Figure 2) where the data and relation between projects, experiments, samples, files, measurements and supplementary files can be queried. The APIv2.0 allows to query in total 93 entities. To query an entity, the URL only contains the requested entity, e.g. /api_v2/api.xsodata/Sample. This query will return the descriptions and metadata to all available samples in ProteomicsDB.
Figure 2.

APIv2.0. The tables of ProteomicsDB are grouped into topic clusters (e.g. Repository and Peptide identification data, see Figure 1 data layers). Each table is available in the API as a separate entity (square boxes). To navigate between entities with (dashed black arrows) or across (solid black arrows) topic clusters, corresponding navigation properties were defined that allow the traversal of the available data. A detailed documentation of the API is available online under https://www.proteomicsdb.org/vue/apiv2/.

APIv2.0. The tables of ProteomicsDB are grouped into topic clusters (e.g. Repository and Peptide identification data, see Figure 1 data layers). Each table is available in the API as a separate entity (square boxes). To navigate between entities with (dashed black arrows) or across (solid black arrows) topic clusters, corresponding navigation properties were defined that allow the traversal of the available data. A detailed documentation of the API is available online under https://www.proteomicsdb.org/vue/apiv2/. A central objective of the APIv2.0 was that users can navigate from one entity to another. This was realized by the ‘navigation properties’. These navigation properties allow users an easy traversal between entities in multiple directions. For example, from the list of samples users can navigate to a list of all files that are connected to this sample or navigate to the respective experiment of that sample (Figure 2). This can be achieved by querying for ‘/api_v2/api.xsodata/Sample(ID)/File’ or ‘/api_v2/api.xsodata/Sample(ID)/Experiment’, respectively. This feature is available for all entities within a topic cluster and where possible across topic clusters. With this step, we simplify access and allow users to systematically query for data originally separated into multiple APIs. In accordance with the FAIR principles, all entities in ProteomicsDB come with a Global Unique Identifier (G_UID) that follow the format: PRDB_UID:PRDB:: . A detailed description of the APIv2.0 is available online (https://www.proteomicsdb.org/vue/apiv2/). Here, we list all available entities, their attributes (columns) and possible navigation properties to other entities. Additionally, each navigation property and entity listed also includes an example request. In order to find relevant entities and navigation properties, we implemented a search functionality that allows searching for any content listed in the API documentation (i.e. entities, attributes, navigation properties and examples). We are continuously working on extending ProteomicsDB and due to this, the APIv2.0 will also be subject to changes, such as the addition of new navigation properties, entities and columns. The newly developed reference architecture for ProteomicsDB (14) enables versioning. Because of that, currently available endpoints will remain available even in the rare event of modifications to the internal representation of the data. When using the APIv2.0, we recommend to only request necessary data by using e.g. the filtering options of OData to reduce the overall response time as the largest table of ProteomicsDB exceeds 40 billion entries. The new API is a substantial improvement over the status quo and will enable scientist to benefit from the wealth of data stored in ProteomicsDB as well as an easier integration of data from ProteomicsDB into their applications and databases.

Open-source ProteomicsDB frontend via reimplementation in Vue.JS

The current user interface (UI) of ProteomicsDB was built based on a SAP specific framework, termed SAPUI5. However, even its open-source variant, OpenUI5, is infrequently used in research. Due to this, developers in the field of life science research are unlikely to integrate or reuse the applications and visualizations developed for ProteomicsDB. Hence, open-sourcing the current UI is of little value to the scientific community. In accordance with our goal of turning ProteomicsDB into a FAIR resource, we set out to re-implement the UI of ProteomicsDB focusing on modularity, reusability and flexibility. The current version of the re-implementation (https://www.proteomicsdb.org/vue) covers all functionality required to browse and interact with the results stored for a single protein of interest as well as two analytics. We selected Vue.js (https://vuejs.org/) in combination with the Vuetify (https://vuetifyjs.com/en/) package as the new frontend framework. This decision was made because of two reason. First, it is intuitive and well documented, which is important for creating a maintainable and reusable UI. Especially (external) developers interested in generating a new visualization will benefit from this. Second, the component system (modularization) of Vue.js allows easy encapsulation of functionality and subsequently reuse of visualizations. In line with our goal to improve the FAIRness of ProteomicsDB, we decided to exploit this core feature of Vue.js and separate our new interface into three functional levels (Figure 1, top left). The package proteomicsdb-components (https://github.com/wilhelm-lab/proteomicsdb-components) provides the base functionality for different visualization used in ProteomicsDB. They are agnostic to ProteomicsDB and thus can be reused on any website without specific dependencies and can be connected to any other source of data. The package proteomicsdb-wrappers provides wrappers (https://github.com/wilhelm-lab/proteomicsdb-wrappers) for these visualizations that request the data from ProteomicsDB. These wrappers can also be used on any website but will require a connection to ProteomicsDB. Last, these visualizations are combined into views in the package proteomicsdb-views (https://github.com/wilhelm-lab/proteomicsdb-views) that can be thought of as subpages in ProteomicsDB. All of these levels are publicly available on GitHub as separate repositories. We expect that this will further improvement the findability and accessibility, but particularly the reusability of the code base of ProteomicsDB. Each of the three repositories are identified by individual Digital Object Identifiers (DOI), while each version can be uniquely identified with the associated git commit hash. With the switch to Vue.js and the reimplementation necessary for that, we also decided to redesign the layout of ProteomicsDB to provide a more intuitive and modern looking experience (Figure 3). The organism selection previously located on the left of the screen is now moved to a drop-down menu located at the top left, next to the ProteomicsDB logo. The main tabs that were previously at the top of the screen can now be access on the right side of the screen after clicking the three stacked horizontal bars (hamburger button) in the top right of the screen. Otherwise, they are hidden to dedicate a larger proportion of the screen to the current view. At the top center of the screen a new universal search field can be found that can be used as direct entry point to all aspects of ProteomicsDB.
Figure 3.

Screenshot of the new vue-based protein summary page. The organism selection is located at the top left next to the ProteomicsDB logo. In the top middle, a new universal search field was added visible at all times. The hamburger button on the top right opens the main navigation panel of ProteomicsDB. On the left, the protein navigation panel is shown. The protein summary page shows general information about the selected protein as well as the sequence coverage and the expression of the protein for tissues and body fluids.

Screenshot of the new vue-based protein summary page. The organism selection is located at the top left next to the ProteomicsDB logo. In the top middle, a new universal search field was added visible at all times. The hamburger button on the top right opens the main navigation panel of ProteomicsDB. On the left, the protein navigation panel is shown. The protein summary page shows general information about the selected protein as well as the sequence coverage and the expression of the protein for tissues and body fluids. After searching for a gene of interest and selecting a specific protein/isoform, the UI changes and a second menu appears on the left. This menu shows the different navigation options to investigate, for example, the observed peptides or expression pattern. The blue bubbles indicate whether and how much data are available in this view, for example, 137 distinct peptides identified for protein EGFR (Figure 3). The views available here are largely identical to the old UI, but some slight adjustments were made. For example, the biochemical assay tab was split into three separate views that show the available binding data for different inhibitors, melting behavior and turnover data. In addition to the redesign of the UI, two new visualizations were created for ProteomicsDB. First, the Feature Viewer (Figure 4), which is a custom adjustment (https://github.com/wilhelm-lab/protvista-proteomicsdb) of protvista-uniprot (https://github.com/ebi-webcomponents/protvista-uniprot) that depicts primary (e.g. sequence coverage and conservation) and secondary (e.g. domains, solvent accessibility and disordered regions) structure information of the selected protein. The properties shown originate from internal data or external resources (19–22) and are shown as separate tracks. Each track can be expanded to reveal a more detailed view (Figure 4, secondary structure), while a specific region of one attribute can be selected to reveal additional information (Figure 4, gray popup on the domain FU 496–547). In addition, available 3D structures are retrieved from PDB (22) and listed. A single structure can be selected (Figure 4, bottom left table, yellow highlight) and interactively investigated (Figure 4, bottom right structure viewer). If present in the structure, regions selected in the attribute view are automatically highlighted in the structure (Figure 4, yellow region highlighted in red in the 3D structure).
Figure 4.

Protein Feature Viewer. This interactive visualization depicts different information about the primary and secondary structure about the protein in separate tracks. Each of these tracks can be expanded to reveal a more detailed view, exemplified by the expanded predicted secondary structure. Each region of a track can be selected to reveal additional information, exemplified for the Furin-like-repeats domain. In the bottom left, the table shows available 3D structures from PDB for this proteins. The selected structure is shown in the bottom right and the selected region (yellow highlight) is marked in red in the protein structure.

Protein Feature Viewer. This interactive visualization depicts different information about the primary and secondary structure about the protein in separate tracks. Each of these tracks can be expanded to reveal a more detailed view, exemplified by the expanded predicted secondary structure. Each region of a track can be selected to reveal additional information, exemplified for the Furin-like-repeats domain. In the bottom left, the table shows available 3D structures from PDB for this proteins. The selected structure is shown in the bottom right and the selected region (yellow highlight) is marked in red in the protein structure. The second example of a vastly improved visualization is the spectrum viewer (Figure 5) that is a modified version of the Universal Spectrum Explorer (23). It is accessible by selecting a specific peptide of interest in either the Peptide MS/MS or Reference Peptides view that show a table with the observed or synthetic/predicted reference peptides for the selected protein. As in the old version, every peptide spectrum match (PSM) stored in ProteomicsDB can be investigated here. Selecting a PSM (Figure 5, top left) fetches the associated spectrum. By default, a corresponding predicted reference spectrum is generated in real-time by Prosit and can be used to manually verify the correctness of the identification. In addition, reference spectra stored in ProteomicsDB from e.g. ProteomeTools (24) can be selected.
Figure 5.

Spectrum viewer. The spectrum viewer (bottom) visualizes the selected peptide spectrum match from the table in the top left. The configuration element on the top right can be used for, but is not limited to, retrieving reference spectra depicted in the mirror view to the bottom. Reference spectra can be generated in real-time by Prosit or requested from ProteomeTools. In between the experimental and reference spectrum, the alignment error between an observed and reference peak is shown in parts-per-million (ppm). The spectral similarity between the experimental and reference spectrum is measured by calculating the Pearson correlation (PCC) and normalized spectral contrast angle (SA). The measures inside the brackets show the result of this comparison when taking either the peaks of the experimental or reference spectrum into account whereas the values outside the brackets show the measures calculated taking all peaks form both spectra into account.

Spectrum viewer. The spectrum viewer (bottom) visualizes the selected peptide spectrum match from the table in the top left. The configuration element on the top right can be used for, but is not limited to, retrieving reference spectra depicted in the mirror view to the bottom. Reference spectra can be generated in real-time by Prosit or requested from ProteomeTools. In between the experimental and reference spectrum, the alignment error between an observed and reference peak is shown in parts-per-million (ppm). The spectral similarity between the experimental and reference spectrum is measured by calculating the Pearson correlation (PCC) and normalized spectral contrast angle (SA). The measures inside the brackets show the result of this comparison when taking either the peaks of the experimental or reference spectrum into account whereas the values outside the brackets show the measures calculated taking all peaks form both spectra into account. The reimplementation of the UI in Vue.js not only will enable external developers to be able to reuse views and visualization developed for ProteomicsDB but also shows that external views can be reused in ProteomicsDB. The availability of the source code on GitHub also creates a communication channel with users and developers that can report bugs and request new features, all supporting the FAIRification of ProteomicsDB.

Increasing peptide and protein coverage by rescoring of FAIR data

Our recently described deep-neural-network Prosit was trained to predict the fragment intensities and retention times of peptides (15). Such prediction can be used to improve the separation between correct and incorrect matches of database search engine results (25). To achieve this, theoretical spectra of the proposed peptide sequences are predicted using Prosit and compared to the experimentally observed spectrum. Based on this, a variety of intensity-based scores are calculated. This rescoring process supports that published datasets often contain more information than what was initially discovered (26) and that FAIR datasets are a rich resource for novel findings. Additionally, it can be used to align and compare the results obtained from different database search engines (16). Considering the large amounts of data made available via ProteomicsDB, we decided to integrate the rescoring workflow directly into ProteomicsDB to enable the automatic re-processing of any FAIR dataset. The workflow (Figure 6A) can be triggered on datasets which have an associated ProteomeXchange (27) identifier. The associated raw mass spectrometry files are then automatically downloaded from PRIDE (28). Together with the reconstructed database search engine results from ProteomicsDB, a regular rescoring by Prosit is triggered. Then the percolator results are imported into ProteomicsDB again. This does not overwrite any data of the original search results and during false discovery rate (FDR) estimation either the original search engines scores or the intensity-based scores from Prosit are used.
Figure 6.

Integration of Prosit into ProteomicsDB. (A) Depiction of the workflow implemented to enable automatic rescoring of projects in ProteomicsDB. Raw mass spectrometry data are downloaded from PRIDE. The rescoring is performed on the database search results stored in ProteomicsDB by retrieving predictions from Prosit. The resulting scores are merged by percolator and imported into ProteomicsDB where the picked protein approach is used for FDR estimation. (B) The number of proteins (right) and peptides (left) identified with (blue) and without (red) rescoring at an estimated PSM, peptide and protein FDR of 1% for 30 tissues from Wang et al. (29). (C) Distribution of target and decoy Q-scores of proteins supported by peptide identifications for all mouse proteins in ProteomicsDB. The example highlights the q-value of the Pyruvate kinase PKM (P52480).

Integration of Prosit into ProteomicsDB. (A) Depiction of the workflow implemented to enable automatic rescoring of projects in ProteomicsDB. Raw mass spectrometry data are downloaded from PRIDE. The rescoring is performed on the database search results stored in ProteomicsDB by retrieving predictions from Prosit. The resulting scores are merged by percolator and imported into ProteomicsDB where the picked protein approach is used for FDR estimation. (B) The number of proteins (right) and peptides (left) identified with (blue) and without (red) rescoring at an estimated PSM, peptide and protein FDR of 1% for 30 tissues from Wang et al. (29). (C) Distribution of target and decoy Q-scores of proteins supported by peptide identifications for all mouse proteins in ProteomicsDB. The example highlights the q-value of the Pyruvate kinase PKM (P52480). As a proof of principle, we rescored 30 tissues of the data published by Wang et al. (29) in which the proteomes and transcriptomes of healthy human tissues were characterized. When analyzing each tissue separately, on average 8289 (±1126 standard deviation, SD) proteins were identified without rescoring (Figure 6B). The rescoring approach identified on average 8788 (±1088 SD) proteins across the different tissues. This is equal to an average relative increase of 6%. The largest benefit we observed was for bone marrow with a relative increase of 13%. The data for the small intestine benefited least from the rescoring but still showed an increase in the number of identified proteins by ∼4%. The effect on peptide level was even more pronounced. The number of identified peptides increased on average by 16% from 71 631 (±22 216 SD) to 82 165 (±22 209 SD). The tissues which benefited the strongest and the least on peptide level were bone marrow and brain with an increase of 40% and 7%, respectively. The large effects seen in bone marrow on peptide and protein level are most likely due to the overall lower number of identifications in this tissue. The biggest relative effect was observed for tissues with the smallest number of identified peptides without rescoring. This is consistent with previous observations that the rescoring is most beneficially when the identification rate is unexpectedly low, likely due to a strong overlap in targets and decoy matches (15). In order to safely allow the combination of rescored and non-rescored data, we modified the FDR estimation procedure implemented in ProteomicsDB. As described earlier (30), we utilize Q-scores (-log10q-values) in order to combine results from different result sets. Figure 6C shows the Q-score distribution of target and decoy proteins. Here, the mouse data were chosen because of its high ratio of rescored data. The high degree of overlap between the number of estimated false positives (decoys) and likely incorrect targets in the low scoring region suggests that no bias is visible for proteins being supported by either rescored data or non-rescored data. This is further supported by the estimated distribution of true positives (target-decoy) that does not show any bimodality, suggesting that the decoy distribution accurately resembles the distribution of false matches in the target database. The systematic rescoring of datasets in ProteomicsDB is only possible due to resources such as PRIDE which enable the findability, accessibility, interoperability and reusability of raw mass spectrometry files. With the full integration of the rescoring approach into ProteomicsDB, the number of peptides and confidence in their identification can be increased. With the ever growing amount of data available in ProteomicsDB, accurately assessing the confidence of peptide spectrum matches will remain a challenge which will require regular checks to be able to assure high overall data quality.

Increasing the findability of aggregated data by ProteomicsDB

ProteomicsDB is the central point of access to aggregated information (e.g. protein expression) for a majority of its stored datasets and by that fosters their FAIRness. Over the last 2 years, many additional datasets were added to ProteomicsDB (Figure 7). We imported proteomics data from 32 projects investigating different human biology (29,31–65) that represent data on 40 new tissues and cell lines. In total, over 57 million experimental spectra and >500 thousand quantitative data points were added to ProteomicsDB. Considering the large amount of data previously available in ProteomicsDB, the effect on the number of identified proteins and genes is not less substantial, raising the confidence of 1281 protein isoforms and 878 genes to meet the <1% FDR criteria.
Figure 7.

New data added to ProteomicsDB. (A) Expression bodymap (left) of rice illustrated on the example for Phosphoglycerate kinase (A0A0P0WP33). The individual expression values are depicted in the barplot (right). (B) Venn diagram showing the overlap of human genes, for which proteomics, transcriptomics or biochemical assay data is available in ProteomicsDB. (C) Venn diagram showing the overlap of human tissues, cell lines and body fluids for which proteomics, transcriptomics or cell viability assay data are available in ProteomicsDB. (D) Barplot showing the increase in data across the depicted categories (y-axis) from 2019 to 2021.

New data added to ProteomicsDB. (A) Expression bodymap (left) of rice illustrated on the example for Phosphoglycerate kinase (A0A0P0WP33). The individual expression values are depicted in the barplot (right). (B) Venn diagram showing the overlap of human genes, for which proteomics, transcriptomics or biochemical assay data is available in ProteomicsDB. (C) Venn diagram showing the overlap of human tissues, cell lines and body fluids for which proteomics, transcriptomics or cell viability assay data are available in ProteomicsDB. (D) Barplot showing the increase in data across the depicted categories (y-axis) from 2019 to 2021. Especially the FAIRness of dataset reporting aggregated data beyond protein expression values (e.g. melting curves or dose response curves) benefit from ProteomicsDB because even fewer resources exist for those. Most often such data are only available in the supplement of the original publication hampering FAIRness. Recently, we added protein-drug binding data, covering a new class of proteins, histone deacetylases (HDACs). The inhibition of HDACs has shown promise as therapeutic option in oncology and other conditions such as Duchenne Muscular Dystrophy (66). We imported data for 53 HDAC inhibitors covering 14 target proteins, totaling 735 HDAC dose–response curves (67). Most notably, we extended ProteomicsDB to support the storage and visualization of data for a new organism, Oryza sativa ssp. Japonica (rice) (Figure 7A). All functionalities of ProteomicsDB readily transfer to new organisms. For example, the visualization of expression values on a ‘bodymap’ (Figure 7A) only require the addition of a new organism visualization while the data retrieval, mapping and coloring of tissues is implemented generically. The imported data covers 28 rice tissues. In total, >4 million experimental spectra were imported resulting in the confident identification of close to 170 thousand distinct peptides of which >150 thousand are unique on gene level. Due to the imported data, 2621 of the 4051 annotated rice genes are confidently identified resulting in a coverage of 64%. For proteins isoforms, 13 742 of the 43 671 annotated were identified, resulting in an isoform coverage of 31%.

FUTURE DIRECTIONS

The updates introduced over the last two years provide a solid foundation of turning ProteomicsDB into a FAIR resource for life science research. There are three specific objectives we aimed to support by this. First, foster data re-use for wet- and dry-lab researchers and allow them to utilize and benefit from the wealth of data available. Second, share our efforts in developing modern and easy-to-use web applications. Third, switch the development of ProteomicsDB to a community-driven effort. For this purpose, we are also currently developing a community portal within ProteomicsDB to allow users to share and discuss ideas about new visualization and features. At the time of writing, a direct line of communication between users and the current developers was established via GitHub where users can report discovered bugs or request new features. Ultimately, the availability of a comprehensive API and open source UI may lead to external developers contributing novel tools and analytics to ProteomicsDB. The integration of Prosit into ProteomicsDB enables the rescoring of all data stored in ProteomicsDB. On individual datasets, we observed an average increase in the number identified peptides by 16% and proteins by 6%. When performed on all data, this may increase the coverage of ProteomicsDB substantially and increase the quantitative precision by increasing the number observed peptides used to quantify each protein. In addition, this allows us to combine multiple database search engine results across and within datasets and will eventually enable us to integrate the results of novel search engines. A strong focus of the next years will be on the finalization of the new interface, as well as the integration of substantially more data. Particularly the extension to support the storage, visualization and integration of data from experiments that investigated post-translational modifications will be of high priority. For this, new views and visualization are required, which can be developed much faster by the migration to the new reference architecture and Vue.js. We expect that the publicly available API and open source implementation of the UI will facilitate the development of novel applications and analytics. We further envisage that ProteomicsDB can be made available as private instances for research institutions, consortia or individual labs.

DATA AVAILABILITY

ProteomicsDB is available at https://www.ProteomicsDB.org. Protvist-proteomicsdb is available at https://github.com/wilhelm-lab/protvista-proteomicsdb. Proteomicsdb-wrappers is available at https://github.com/wilhelm-lab/proteomicsdb-wrappers. Proteomicsdb-components is available at https://github.com/wilhelm-lab/proteomicsdb-components. Proteomicsdb-views is available at https://github.com/wilhelm-lab/proteomicsdb-views.
  61 in total

1.  Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning.

Authors:  Siegfried Gessulat; Tobias Schmidt; Daniel Paul Zolg; Patroklos Samaras; Karsten Schnatbaum; Johannes Zerweck; Tobias Knaute; Julia Rechenberger; Bernard Delanghe; Andreas Huhmer; Ulf Reimer; Hans-Christian Ehrlich; Stephan Aiche; Bernhard Kuster; Mathias Wilhelm
Journal:  Nat Methods       Date:  2019-05-27       Impact factor: 28.547

2.  Simultaneous dissection and comparison of IL-2 and IL-15 signaling pathways by global quantitative phosphoproteomics.

Authors:  Nerea Osinalde; Virginia Sanchez-Quiles; Vyacheslav Akimov; Barbara Guerra; Blagoy Blagoev; Irina Kratchmarova
Journal:  Proteomics       Date:  2014-09-30       Impact factor: 3.984

3.  Integrated omic analysis of lung cancer reveals metabolism proteome signatures with prognostic impact.

Authors:  Lei Li; Yuhong Wei; Christine To; Chang-Qi Zhu; Jiefei Tong; Nhu-An Pham; Paul Taylor; Vladimir Ignatchenko; Alex Ignatchenko; Wen Zhang; Dennis Wang; Naoki Yanagawa; Ming Li; Melania Pintilie; Geoffrey Liu; Lakshmi Muthuswamy; Frances A Shepherd; Ming Sound Tsao; Thomas Kislinger; Michael F Moran
Journal:  Nat Commun       Date:  2014-11-28       Impact factor: 14.919

4.  In-Depth Cerebrospinal Fluid Quantitative Proteome and Deglycoproteome Analysis: Presenting a Comprehensive Picture of Pathways and Processes Affected by Multiple Sclerosis.

Authors:  Ann Cathrine Kroksveen; Astrid Guldbrandsen; Marc Vaudel; Ragnhild Reehorst Lereim; Harald Barsnes; Kjell-Morten Myhr; Øivind Torkildsen; Frode S Berven
Journal:  J Proteome Res       Date:  2016-10-26       Impact factor: 4.466

5.  Building ProteomeTools based on a complete synthetic human proteome.

Authors:  Daniel P Zolg; Mathias Wilhelm; Karsten Schnatbaum; Johannes Zerweck; Tobias Knaute; Bernard Delanghe; Derek J Bailey; Siegfried Gessulat; Hans-Christian Ehrlich; Maximilian Weininger; Peng Yu; Judith Schlegl; Karl Kramer; Tobias Schmidt; Ulrike Kusebauch; Eric W Deutsch; Ruedi Aebersold; Robert L Moritz; Holger Wenschuh; Thomas Moehring; Stephan Aiche; Andreas Huhmer; Ulf Reimer; Bernhard Kuster
Journal:  Nat Methods       Date:  2017-01-30       Impact factor: 28.547

6.  Defining the phospho-adhesome through the phosphoproteomic analysis of integrin signalling.

Authors:  Joseph Robertson; Guillaume Jacquemet; Adam Byron; Matthew C Jones; Stacey Warwood; Julian N Selley; David Knight; Jonathan D Humphries; Martin J Humphries
Journal:  Nat Commun       Date:  2015-02-13       Impact factor: 14.919

7.  Proteomic maps of breast cancer subtypes.

Authors:  Stefka Tyanova; Reidar Albrechtsen; Pauliina Kronqvist; Juergen Cox; Matthias Mann; Tamar Geiger
Journal:  Nat Commun       Date:  2016-01-04       Impact factor: 14.919

8.  The ProteomeXchange consortium in 2020: enabling 'big data' approaches in proteomics.

Authors:  Eric W Deutsch; Nuno Bandeira; Vagisha Sharma; Yasset Perez-Riverol; Jeremy J Carver; Deepti J Kundu; David García-Seisdedos; Andrew F Jarnuczak; Suresh Hewapathirana; Benjamin S Pullman; Julie Wertz; Zhi Sun; Shin Kawano; Shujiro Okuda; Yu Watanabe; Henning Hermjakob; Brendan MacLean; Michael J MacCoss; Yunping Zhu; Yasushi Ishihama; Juan A Vizcaíno
Journal:  Nucleic Acids Res       Date:  2020-01-08       Impact factor: 16.971

9.  The FAIR Guiding Principles for scientific data management and stewardship.

Authors:  Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal:  Sci Data       Date:  2016-03-15       Impact factor: 6.444

10.  RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences.

Authors:  Stephen K Burley; Charmi Bhikadiya; Chunxiao Bi; Sebastian Bittrich; Li Chen; Gregg V Crichlow; Cole H Christie; Kenneth Dalenberg; Luigi Di Costanzo; Jose M Duarte; Shuchismita Dutta; Zukang Feng; Sai Ganesan; David S Goodsell; Sutapa Ghosh; Rachel Kramer Green; Vladimir Guranović; Dmytro Guzenko; Brian P Hudson; Catherine L Lawson; Yuhe Liang; Robert Lowe; Harry Namkoong; Ezra Peisach; Irina Persikova; Chris Randle; Alexander Rose; Yana Rose; Andrej Sali; Joan Segura; Monica Sekharan; Chenghua Shao; Yi-Ping Tao; Maria Voigt; John D Westbrook; Jasmine Y Young; Christine Zardecki; Marina Zhuravleva
Journal:  Nucleic Acids Res       Date:  2021-01-08       Impact factor: 16.971

View more
  3 in total

1.  The 2022 Nucleic Acids Research database issue and the online molecular biology database collection.

Authors:  Daniel J Rigden; Xosé M Fernández
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

2.  In Depth Exploration of the Alternative Proteome of Drosophila melanogaster.

Authors:  Bertrand Fabre; Sebastien A Choteau; Carine Duboé; Carole Pichereaux; Audrey Montigny; Dagmara Korona; Michael J Deery; Mylène Camus; Christine Brun; Odile Burlet-Schiltz; Steven Russell; Jean-Philippe Combier; Kathryn S Lilley; Serge Plaza
Journal:  Front Cell Dev Biol       Date:  2022-05-26

3.  PDS5A and PDS5B differentially affect gene expression without altering cohesin localization across the genome.

Authors:  Nicole L Arruda; Audra F Bryan; Jill M Dowen
Journal:  Epigenetics Chromatin       Date:  2022-08-19       Impact factor: 5.465

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.