Literature DB >> 30855969

IsoProt: A Complete and Reproducible Workflow To Analyze iTRAQ/TMT Experiments.

Johannes Griss1,2, Goran Vinterhalter3, Veit Schwämmle4.   

Abstract

Reproducibility has become a major concern in biomedical research. In proteomics, bioinformatic workflows can quickly consist of multiple software tools each with its own set of parameters. Their usage involves the definition of often hundreds of parameters as well as data operations to ensure tool interoperability. Hence, a manuscript's methods section is often insufficient to completely describe and reproduce a data analysis workflow. Here we present IsoProt: A complete and reproducible bioinformatic workflow deployed on a portable container environment to analyze data from isobarically labeled, quantitative proteomics experiments. The workflow uses only open source tools and provides a user-friendly and interactive browser interface to configure and execute the different operations. Once the workflow is executed, the results including the R code to perform statistical analyses can be downloaded as an HTML document providing a complete record of the performed analyses. IsoProt therefore represents a reproducible bioinformatics workflow that will yield identical results on any computer platform.

Entities:  

Keywords:  Docker; Jupyter; ProtProtocols; TMT; bioinformatics; iTRAQ; isobaric labeling; protocol; reproducibility; workflow

Mesh:

Substances:

Year:  2019        PMID: 30855969      PMCID: PMC6456869          DOI: 10.1021/acs.jproteome.8b00968

Source DB:  PubMed          Journal:  J Proteome Res        ISSN: 1535-3893            Impact factor:   4.466


Introduction

Lack of reproducibility in general, and in bioinformatics workflows specifically, is a growing concern.[1] Bioinformatic workflows in proteomics experiments often consist of multiple software tools, each with its own set of parameters. Seemingly small changes to a workflow, such as using different normalization method details, can have dramatic effects on the final result. Due to the many steps and settings that make a complex workflow, it is often impossible to fully document it in a research paper’s methods section. Additionally, finding and using the exact same software versions later on often represents a major obstacle when replicating bioinformatic analyses. Older versions may no longer be compatible with the available operating system or are just altogether unavailable. Therefore, fully reproducible workflows should not only record the exact software versions and parameters, but also preserve specific software versions and ensure that they will produce the same results in different computing environments. Several projects exist to create reproducible bioinformatic workflows. Biocontainers[2] provides Docker containers to make bioinformatic tools available in a standardized way. Docker containers are lightweight virtual machines that, in the case of Biocontainers, ensure that a given software version performs identically on any operating system supported by Docker. Therefore, users do not have to install any software but only download the respective container. Galaxy[3] is a web-based platform for biomedical research mainly focused on genomics. It contains thousands of tools that can be joined together to create workflows and also supports tools for proteomics analyses. KNIME (http://www.knime.com, KNIME AG) is another workflow software focused on data analysis in general. All OpenMS[4] nodes were recently integrated in KNIME, making it possible to build complete proteomics workflows with it. ProteomeDiscoverer (Thermo Fisher) is also a workflow system but specifically targeting proteomics data analysis. Several academic research groups[4−6] are contributing to ProteomeDiscoverer making it usable for a wide variety of proteomics workflows. Finally, to a certain extent MaxQuant[7] with Perseus[8] allows the user to create a complete analysis workflow in a single software. Nevertheless, all of these existing solutions have shortcomings that prevent the creation of complete, reproducible workflows. Biocontainers is a platform to supply bioinformatic tools in a standardized fashion but has no functionality to combine these tools into workflows. KNIME and Galaxy are very powerful analysis platforms that can be adapted to a wide variety of data analysis problems. This functionality comes at the cost of high complexity, and many nonexpert users will find it difficult to adopt Galaxy and KNIME to their needs. Additionally, both KNIME and Galaxy do not contain methods to take a snapshot of the external tools used to actually process the data. ProteomeDiscoverer also depends on external nodes. Therefore, to fully replicate an existing workflow the user again has to take care of locating and installing the exact same versions of these nodes. Moreover, new ProteomeDiscoverer versions generally come with significant changes which requires nodes to be specifically developed for a given version. Nodes developed for one version of ProteomeDiscoverer are generally incompatible with newer ones. Therefore, none of these existing solutions fulfill all requirements of a completely reproducible workflow environment. Isobaric labeling has become one of the most common methods for quantitative mass spectrometry based proteomics experiments. A major advantage is that it allows researchers to multiplex samples and thereby reduce instrument runtime and eliminate variability caused by the mass spectrometer itself. The two methods currently available for these experiments, tandem mass tag (TMT[9]) and multiplexed isobaric tagging technology for relative quantitation (iTRAQ[10]) basically only differ in the reporter masses they generate but do not require dedicated software tools. Even though isobaric labeling has become a standard method in many laboratories, dedicated, easy-to-use software solutions to analyze these data are still rare. This is particularly problematic when dealing with more complex experimental designs that include multiple runs on the mass spectrometer, such as multiple instances of differently labeled multiplexed samples. Existing dedicated software solutions, such as iQuant,[11] isobar,[12] MilQuant,[13] and IsobariQ[14] all require identification results from specific search engines and do not support complex experimental designs with more than two treatment groups or samples split across multiple iTRAQ/TMT runs. Therefore, many research groups rely on unpublished in-house scripts to process their experiments, which greatly hampers reproducibility. In an effort to simplify proteomics data analysis and provide fully reproducible data analysis workflows, we launched the ProtProtocols project (https://protprotocols.github.io) under the umbrella of the European Bioinformatics Community (EuBIC).[15] On the basis of the Biocontainers project,[2] the protocols are shipped in containerized Docker images that include all necessary software tools. Docker containers are lightweight virtual machines that encapsulate all the software required for the protocol to run. This ensures that the version of all used software is linked to the protocol version and the user does not have to worry about installing any separate tools. Hence, 100% reproducibility can be achieved by using the same protocol version on any computer with a Docker environment. Here, we present IsoProt which serves as a blueprint for the ProtProtocol concept. IsoProt is designed for the analysis of isobarically labeled experiments, which is one of the most commonly used methods for high-throughput proteomics. Next to a user-friendly web interface, IsoProt provides accurate statistical analyses for a wide range of common experimental designs.

Experimental Procedures

Software Layout and Implementation

General Implementation

All software was installed in a Docker image to ensure full reproducibility on each computer system supported by Docker. To simplify the installation and usage of our protocols, we created the free, open-source “ProtProtocol docker-launcher” (https://github.com/ProtProtocols/docker-launcher). It provides an easy-to-use graphical user interface that can automatically install the protocol (once Docker is installed) and launch the image. As it is written in Java, it supports the major operating systems Windows, Mac OSX, and Linux. Therefore, many technical difficulties surrounding the use of Docker are hidden from the user. Detailed instructions on how to install and use all tools, as well as how to extend ProtProtocols, can be found at https://protprotocols.github.io/documentation/. The complete protocol is run through a Jupyter notebook (http://jupyter.org) corresponding to one web page in the browser. All relevant parameters can be set through common graphical user elements created through Jupyter widgets. Therefore, the user interface is highly similar to most available search engines. The complete source code as well as additional documentation of the protocol is freely available through https://protprotocols.github.io.

Proteomics Software

IsoProt handles the entire analysis pipeline from mass spectra given as peak lists to the set of differentially regulated proteins (Figure A). We used SearchGUI[16] and PeptideShaker[17] to perform peptide identification and validation, with MS-GF+[18] as a database search engine. Proteins are summarized and quantified by R scripts based on the MSnBase R library.[19] R scripts furthermore generate figures for quality control and perform statistical tests (LIMMA library[20]) according to the experimental design.
Figure 1

(A) Scheme of the entire workflow including operations (ellipsoids) and data given by type and format (squares). The annotation form and terms of the workflow follow to large extent the EDAM ontology.[21] (B, C) Experimental designs and organization in folder structure for analysis in IsoProt.

(A) Scheme of the entire workflow including operations (ellipsoids) and data given by type and format (squares). The annotation form and terms of the workflow follow to large extent the EDAM ontology.[21] (B, C) Experimental designs and organization in folder structure for analysis in IsoProt.

Input Files and Parameters

Input Files

The only files required for the analysis are mass spectra as peak lists (MGF format) and a FASTA file containing the protein sequences where we recommend the UniProt version of the FASTA format. Databases can already contain decoy sequences (following the SearchGUI instructions, http://compomics.github.io/projects/searchgui.html); otherwise, the decoy database is created automatically. The files can be copied into the Docker file structure or directly mirrored onto the /data folder as automatically done by our docker-launcher application.

Analysis Parameters

All parameters required for the data analysis can be changed through a graphical user interface integrated into the Jupyter notebook. In the first section, the user has to set database search related parameters such as precursor and fragment ion tolerance, the FASTA sequence database to use, the labeling agent used, and the fixed and variable modifications to consider. On the basis of the selected labeling method and detected folder structure, the interface to enter the experimental design is generated. The protocol currently supports two setups: (1) all MGF files are placed in the input directory and are part of the same (fractionated) run (Figure B) or (2) MGF files from different runs are organized by placing them in different subdirectories (Figure C). Next, the experimental design user interface allows the user to enter names for the sample groups (for example “treatment” and “control”) and names for the samples (one name per channel and subdirectory) and assign each sample to one of the groups. Most importantly, the protocol supports up to 20 sample groups and can thereby model complex experimental designs. Finally, the user is asked to enter parameters related to the analysis of the quantitative data. Once all required information is entered, the search and analysis are directly controlled through buttons in the user interface.

Output Files and Quality Control

IsoProt provides figures and tables for the different steps of the analysis including peptide identifications, quantitative values of peptide-spectrum matches (PSMs), and proteins as well as a table for the statistical results from the significance analysis. Visual measures for quality control were implemented as R scripts and include total intensities of the reporter ion channels for each sample, violin plots at different stages of the analysis, principal component analysis, and volcano plots (Figure ).
Figure 2

Examples of the visualization and diagnostic plots created by IsoProt based on the shipped example data. (A) The mass accuracy of all reporter ions is presented as a histogram. (B) Correlation of reporter intensities for all channels. (C) Distribution of estimated abundances on the spectrum level for all channels. (D) Distribution of protein abundances for all samples. (E) Principal component analysis of all samples based on the aggregated data highlighting the treatment groups. (F) Volcano plot for a quick visualization of quantitative data and statistical results.

Examples of the visualization and diagnostic plots created by IsoProt based on the shipped example data. (A) The mass accuracy of all reporter ions is presented as a histogram. (B) Correlation of reporter intensities for all channels. (C) Distribution of estimated abundances on the spectrum level for all channels. (D) Distribution of protein abundances for all samples. (E) Principal component analysis of all samples based on the aggregated data highlighting the treatment groups. (F) Volcano plot for a quick visualization of quantitative data and statistical results.

Test Data Sets

To evaluate the performance of our analysis workflow, we processed the data from three publically available data sets using the same search parameters as in the original studies. We downloaded the respective RAW files from PRIDE Archive[22] and converted them into the MGF file format using ProteoWizard’s msconvert tool[23] when no MGF peak list files were available.

Benchmark Data Set

D’Angelo et al. recently published a TMT benchmark data set containing an experiment where 12 human proteins were spiked into an Escherichia coli background[24] using various concentrations (PRIDE Archive identifier PXD005486). D’Angelo et al. used this data set to assess the number of proteins that were incorrectly identified as being regulated. As every protein was added using varying concentrations among the samples, a standard statistical analysis of the spiked-in proteins was not possible. Therefore, our analysis focuses on the accuracy of the derived quantitative estimates for the spiked proteins and the (unchanged) background E. coli proteins. The complete analysis was performed using IsoProt version 0.2. Spectra were identified using MSGF+[18] through SearchGUI version 3.3.3.[16] The precursor tolerance was set to 20 ppm and the fragment tolerance to 0.03 Da. One missed cleavages was allowed. Carbamidomethylation and TMT 10-plex of K,TMT 10-plex of peptide N-term were set as fixed modifications. Oxidation of M was set as variable modification. PSMs were filtered at a target false discovery rate (FDR) of 0.01 using the target-decoy approach. UniProt E. coli sequences (version August 2018) and the spiked human protein sequences, also from UniProt, were used for spectra identification. Quantitative analysis was done using the R Bioconductor package MSnbase version 2.7.1.[19] Protein summarization was performed using the “medpolish” method as implemented by MSnbase. Modified peptides were not used for quantitation. Only proteins with at least two identified peptides were accepted for further analysis. Differential expression was assessed using the R Bioconductor package limma version 3.34.[20]

Cerebral Malaria Pathogenesis

The study uses TMT6 labeling to compare mouse blood with different stages of cerebral malaria (d3, ECM) to noninfected mice (NI).[25] Four replicates of each of the three sample types were arranged in two TMT6 sets and run separately, corresponding to a similar case as in Figure C, now having three conditions being distributed over two separate runs on the mass spectrometer. Peak list data files (MGF file format) were downloaded from PRIDE Archive (PXD003772). The analysis was again performed using IsoProt version 0.2 (see above) with the precursor tolerance set to 10 ppm and the fragment tolerance to 0.05 Da. One missed cleavage was allowed. Carbamidomethylation and TMT 6-plex of K,TMT 6-plex of peptide N-term were set as fixed modifications. Oxidation of M was set as variable modification. PSMs were filtered at a target FDR of 0.01 using the target-decoy approach. SwissProt sequences from mouse (January 2018) were used for spectra identification. Only proteins with at least two identified peptides were accepted for further analysis.

Nonmuscle Invasive and Muscle-Invasive Bladder Cancer

The study compares tumor tissue samples from nonmuscle invasive and muscle-invasive bladder cancer.[26] MGF files were downloaded from PRIDE Archive (PXD002170). The analysis was again performed using IsoProt version 0.2 (see above) with the precursor tolerance set to 10 ppm and the fragment tolerance to 0.05 Da. One missed cleavage was allowed. Carbamidomethylation and iTRAQ 8-plex of K, iTRAQ 8-plex of Y, iTRAQ 8-plex of peptide N-term were set as fixed modifications. Oxidation of M was set as variable modification. PSMs were filtered at a target FDR of 0.01 using the target-decoy approach. Sequences from SwissProt sequences from human (January 2017) were used for spectra identification. Only proteins with at least two identified peptides were accepted for further analysis.

Results

IsoProt allows users running the full data analysis of iTRAQ/TMT experiments in a straightforward and reproducible way. The protocol supports different experimental designs including multiple runs on the mass spectrometer and differently labeled multiple samples. Additionally, the open layout of the protocol allows complex adjustments and modifications at all stages of the workflow.

A Fully Reproducible Environment

The protocol can be run on any computer with a functional Docker environment, by just downloading and running the available Docker image. This is fully automated through our “ProtProtocol docker-launcher” tool (https://github.com/ProtProtocols/docker-launcher). Hence, the protocol avoids all possible platform- and operating system-specific installation issues and provides identical results independent of operating system, its configuration, and computer hardware. Every IsoProt release has a stable version number that points to a specific docker image. Therefore, by citing the used IsoProt version number, it will always be possible to exactly restore the used analysis environment, including the versions of all used software tools. Once the protocol has been executed, it is possible to save it, including all generated figures, as a standard HTML page. Therefore, the complete analysis workflow can be easily made available, for example, at the time of review, and be viewed with a standard web browser. Additionally, all user-entered parameters are stored in text files next to the analyses results which can easily be reused for future projects (see https://protprotocols.github.io/documentation/isoprot/save_analysis for details). For an overview of the visualizations, see Figure .

Simple Example Workflow

IsoProt can be tested using an example data set that is small enough to run in under 10 min on a standard computer. The data set is part of the IsoProt Docker image, and necessary parameters settings are preloaded when starting IsoProt. The database search via SearchGUI and validation via PeptideShaker result in a tab-delimited file containing detailed information on all PSMs. Search and output parameters are automatically saved for future reference. Additionally, a “methods” section is generated that can be included in a manuscript and describes all used settings. Each spectrum file is processed separately to match and quantify PSMs that passed the identification FDR (default 0.01). The mass distribution of all matched fragment ions allows control for critical channels with inefficient labeling (Figure A). All PSM quantifications are saved in a separate file (AllQuantPSMs.csv). The output of all files of each run on the mass spectrometer are merged, normalized, and visualized for quality control. Violin plots of normalized PSM intensities compare the intensity distributions (Figure C). Channels with different distributions can identify problematic samples or changes within the entire proteome. Six different histograms counting PSM, peptide, protein, and protein group numbers allows determining protein coverage and uniqueness by the available mass spectra. Similarity between samples is assessed through scatter plots comparing all quantified spectra from all ion channels (Figure B). Using the default parameters, the PSMs are summarized to proteins using median summarization after outlier removal requiring a minimum of 1 PSM per protein. In addition, the protocol supports iPQF,[27] mean expression, median expression (without outlier removal), and robust summarization as methods. A violin plot of protein ratios versus mean of all channels shows whether the analyzed samples exhibit similar distributions on the protein level (Figure D). Quantifications from different runs (only one in the example) are merged and submitted to a principal component analysis (Figure E). This places all samples in a two-dimensional space and color codes different treatment groups. Studies where the samples of the different types are not placed as distinguishable groups are unlikely to provide differentially regulated proteins. Additionally, potential systematic biases can quickly be discovered using this plot. The example set quantified a total of 221 protein groups. LIMMA statistical tests did not find any regulated proteins with FDR < 0.05, which is in agreement with the original results. p-Values and false discovery rates (p-values corrected for multiple testing) are visualized in histograms, volcano plots (Figure F), and a figure counting the number of differentially regulated proteins over a range of FDRs. The latter can be used to identify a suitable combination of the confidence threshold and the number of significant proteins. It is advised to keep FDR < 0.1 as the number of false positives becomes critically high otherwise.

Performance Tests by Reprocessing Public Data

D’Angelo et al. performed a comparison of different approaches to analyze TMT data sets.[24] In their first data set, the authors spiked different concentrations of 12 human proteins into an E. coli background. They used this data set to assess the type-I error as the number of false positive proteins. Similar to the original study, we assigned the first five channels to one treatment group, and the second five channels to the second group. As expected, no proteins were identified as being significantly regulated. The estimated log-fold changes of the E. coli background proteins were all close to 0 (Figure A).
Figure 3

(A) Log-fold changes of the E. coli background proteins. This represents the expression of background proteins based on a comparison of the first five channels against the other five ones similar to the approach by D’Angelo et al. As expected, the estimated log-fold changes are all closely centered around 0. (B) Observed bias and RMSE of estimated fold-changes of the D’Angelo et al. benchmark data set from our pipeline and the best-performing pipeline published by the authors. (C) Variation of ground truth proteins at different spike-in levels. Imputation by lowest value (green) leads to increased variation compared to no imputation (red), except of the lowest level.

(A) Log-fold changes of the E. coli background proteins. This represents the expression of background proteins based on a comparison of the first five channels against the other five ones similar to the approach by D’Angelo et al. As expected, the estimated log-fold changes are all closely centered around 0. (B) Observed bias and RMSE of estimated fold-changes of the D’Angelo et al. benchmark data set from our pipeline and the best-performing pipeline published by the authors. (C) Variation of ground truth proteins at different spike-in levels. Imputation by lowest value (green) leads to increased variation compared to no imputation (red), except of the lowest level. Proteins were spiked twice using the same concentration in different channels and only once for the two highest concentrations. Therefore, only a single, or two replicate measurements at maximum are available when comparing two concentrations. Since this setup prevents a standard statistical evaluation, we focused on the accuracy of the estimated fold changes using the same error measurements as in the original manuscript. Similarly, we assessed the accuracy of our fold change estimates using the bias and the root-mean-square error (RMSE). Across most spiked fold-changes, we observed a comparable bias and RMSE (Figure B). For the highest spiked fold-change, we observed slightly higher average error rates than D’Angelo et al. This is most likely caused by the fact that D’Angelo et al. imputed missing values by taking the lowest observed intensity of the given PSM across all samples. Thereby, missing values were automatically interpreted as very low expression. As expected, the measured abundance of the lowest concentrations showed larger variation with several missing values. In our approach, these missing values were ignored thus leading to less stable average fold-changes. Imputing missing values like D’Angelo et al. did naturally reduced this variation leading to reduced error rates. However, when we, for example, estimated the fold change of the two highest protein concentrations (also a fold change of 2), the bias is 0 with an RMSE of 0.2 improving the error rates dramatically. While D’Angelo et al.’s imputation approach is valid if values can be expected to be missing not at random (i.e., because of a concentration below the limit of detection), it is not valid for values missing (completely) at random (i.e., because of inefficient labeling).[28] Therefore, for the spiked proteins D’Angelo et al.’s approach should only have been applied to cases were the lowest amount of proteins were spiked. Since it is generally unknown why a value is missing in actual experiments, our pipeline is not using any imputation. Limma’s model treats these values as “missing as random”, which we feel is more appropriate for most biological studies. To estimate the effect of these different approaches, we calculated the variance of the spiked-in proteins as the sum of the absolute difference between the duplicate measurements. Independent of the used protein summarization method, imputing missing values increased the variance of all but the duplicates with zero concentration of the respective proteins (Figure C). In our opinion, this highlights the downside of using “blind” imputation for all missing values as this can result in increased noise levels or bias in the data set. The complete output of our pipeline can be found in Supplementary File 1.

Cerebral Malaria Pathogenesis

The authors investigated differences in the plasma proteome between healthy and malaria-infected mice (two stages). The available two TMT 6plex sets were considered to contain independent samples. IsoProt quantified more protein groups (324 versus 289) when requiring a minimum of 2 unique PSMs and an identification FDR < 1%. For the further comparison, we restricted the IsoProt output to the uniquely identified 214 proteins (no peptides shared with other proteins). In the original study, statistical testing was carried out separately for the two TMT runs, yielding a total of 54 (more precisely 43 as 11 were detected in both runs) proteins found to be differentially regulated between Plasmodium berghei ANKA (PbA)-infected (d8 postinfection, labeled ECM) and noninfected (labeled NI) mice (Mann–Whitney U test, p ≤ 0.001). Since the authors did not correct p-values for multiple testing, these results cannot be considered significant. We found a total of 41 differentially regulated proteins (FDR < 0.01) and an overlap of only 20 proteins with the original study. Given the different statistical procedures, we analyzed all proteins that were found differentially regulated by either one of the methods. All but four proteins found differentially regulated in the original study were quantified by IsoProt and showed similar abundances in both analyses (Figure A). Proteins only deemed significant in the original study were not found significant by IsoProt mostly due to low fold-changes (Figure B).
Figure 4

A Comparison of fold-changes of proteins differentially regulated in the original study (A1) and IsoProt results (A2). Proteins found differentially regulated in the original study were labeled red. (B) Abundance profiles of Retinol-binding protein 4 (Q00724) (B1) and disulfide-isomerase (P09103) (B2).

A Comparison of fold-changes of proteins differentially regulated in the original study (A1) and IsoProt results (A2). Proteins found differentially regulated in the original study were labeled red. (B) Abundance profiles of Retinol-binding protein 4 (Q00724) (B1) and disulfide-isomerase (P09103) (B2). We further investigated the two proteins that mostly differed between the two types of analyses. Retinol-binding protein 4 (Q00724) was the protein with the lowest FDR within the proteins found differentially regulated only by IsoProt. Figure C shows PSM measurements for the 2 TMT runs of this protein (scaled for better comparison). Summarized protein abundances (thick lines) by median summarization with outlier removal show that the PSMs of peptides with less differential behavior were removed. By merging the observation of the two TMT runs, IsoProt increases its statistical power and thus provides evidence for regulatory behavior of this protein. On the other hand, protein protein disulfide-isomerase (P09103) was the protein with the highest FDR (least significant) in IsoProt that was found significantly regulated in the original study (TMT-1, Figure D). Given only high abundances in one of the two ECM replicates in TMT-1, manual interpretation would discard this protein from being regulated (Figure D). The PSMs measured in the second TMT-2 run confirm this observation. The complete output of our pipeline can be found in Supplementary File 1.

Nonmuscle Invasive and Muscle-Invasive Bladder Cancer

IsoProt quantified 1145 protein groups when restricting to a minimum of 2 unique peptides and 1% FDR, compared to 1092 in the original study (minimum of 2 peptides, Occam razor principle for peptide inference and 1% FDR). Both analyses had an overlap of 662 proteins. Despite only having different bioinformatics workflows, the mean log-fold changes of proteins between the two cancer subtypes were very different (Figure A, Pearson’s correlation of 0.78).
Figure 5

(A) Comparison of log-ratios between IsoProt output and original study. Pearson’s correlation between both quantification: 0.79. (B–C) Volcano plots for results from statistical testing in the original study (B) and in IsoProt (C). Colored points correspond to proteins with a (uncorrected) p-value below 5% in the other study, respectively. (D–E) Distribution of relative protein abundances in original study (D) and IsoProt (E).

(A) Comparison of log-ratios between IsoProt output and original study. Pearson’s correlation between both quantification: 0.79. (B–C) Volcano plots for results from statistical testing in the original study (B) and in IsoProt (C). Colored points correspond to proteins with a (uncorrected) p-value below 5% in the other study, respectively. (D–E) Distribution of relative protein abundances in original study (D) and IsoProt (E). IsoProt found one differentially regulated protein (15-hydroxyprostaglandin dehydrogenase, FDR < 0.01) after correction for multiple testing, which was not carried out in the original study. In order to allow a comparison of both results, we therefore also used uncorrected p-values for the following analysis. This is not recommended as it is prone to greatly overestimate the number of regulated proteins. When comparing these uncorrected p-values, the majority of “significant” proteins were different between the two studies (Figure B,C, colored points indicate p < 0.05 in the other respective study). This striking difference in the statistical results is due to different normalization approaches used. Their effect can be seen in the distribution of protein abundances (Figure D,E). The authors of the original study normalized the ratios between cancer subtypes after protein summarization and averaging of replicates. The more common and in our opinion correct approach is to normalize the different channels (i.e., individual samples) on the (measured) PSM or (aggregated) peptide level prior to the aggregated analysis of these measurements on the protein level and, most importantly, prior to merging any independent (i.e., replicate) measurements. Strong deviations of individual channels which are visible on the peptide level were thus discarded in the original study. The complete output of our pipeline can be found in Supplementary File 1.

Discussion

IsoProt shows how the ProtProtocols framework can be used to create user-friendly, reproducible bioinformatic workflows. IsoProt makes it simple to include the complete bioinformatic data processing workflow as a supplementary file. Thereby, reviewers and other researchers can easily assess the used methods. Encapsulating protocols into docker containers preserves the complete setup including all software versions which can be referenced through a single protocol version number. This allows anyone to replicate the results at any later stage without having to worry that older software might no longer work. Once a given version of the protocol is downloaded, users can be sure that it will behave in exactly the same way on all supported platforms. The use of docker makes the protocol highly portable. Docker currently supports Windows, Linux, and Mac OS making our protocol truly multiplatform. The fact that the protocol can be installed through a single command makes it trivial to move the setup from one machine to another. With our “ProtProtocol docker-launcher” tool, the protocol can even be installed with the click of a single button. This should greatly reduce the effort in setting up a complex proteomics analysis environment. Unfortunately, Docker support for Windows is not yet fully stable. Therefore, several Windows users experienced issues when installing Docker which prevented them from using IsoProt. Even though this currently reduces the ease-of-use of ProtProtocols on Windows machines, we believe that this will quickly be improved since Microsoft recently became an official partner of Docker.[29] IsoProt’s performance was tested on three publicly available data sets. The results highlight that subtle differences in the data analysis can lead to considerable differences in the final results. Such differences can only be identified by reproducing the complete environment of the analysis workflow, something that is very difficult to realize when only relying on information from a scientific paper. Thus, more complete and easily readable information on the used workflow and its parameters, or even the entire computational environment, will considerably improve paper reviews as well as reproducing and discussing results from already published studies. Such workflows will further increase quality and credibility of both scientific studies and the presenting journals. IsoProt enables users to easily provide such complete information on their analysis. Our approach facilitates comparison with other data analysis pipelines or testing of robustness to parameter changes with minimal efforts requiring only peak list files, their relation to the experimental design and main parameters for identification and quantification. All of these developments are available as free and open-source software. Thereby, we encourage other researchers to use the ProtProtocol infrastructure as starting point to develop their own analysis workflows and make them available to the community. All our tools are modularized and prepared to support and simplify such external developments. Since Docker has become an industry standard for containerized applications long-term support seems to be guaranteed for these developments. In summary, we developed a user-friendly environment for fully reproducible data analysis and exemplified its use through a complete workflow for the analysis of data from isobarically labeled mass spectrometry experiments.
  28 in total

1.  MilQuant: a free, generic software tool for isobaric tagging-based quantitation.

Authors:  Xiao Zou; Minzhi Zhao; Hongyan Shen; Xuyang Zhao; Yuanpeng Tong; Qingsong Wang; Shicheng Wei; Jianguo Ji
Journal:  J Proteomics       Date:  2012-07-10       Impact factor: 4.044

2.  The generating function of CID, ETD, and CID/ETD pairs of tandem mass spectra: applications to database search.

Authors:  Sangtae Kim; Nikolai Mischerikow; Nuno Bandeira; J Daniel Navarro; Louis Wich; Shabaz Mohammed; Albert J R Heck; Pavel A Pevzner
Journal:  Mol Cell Proteomics       Date:  2010-09-09       Impact factor: 5.911

3.  MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.

Authors:  Jürgen Cox; Matthias Mann
Journal:  Nat Biotechnol       Date:  2008-11-30       Impact factor: 54.908

4.  PeptideShaker enables reanalysis of MS-derived proteomics data sets.

Authors:  Marc Vaudel; Julia M Burkhart; René P Zahedi; Eystein Oveland; Frode S Berven; Albert Sickmann; Lennart Martens; Harald Barsnes
Journal:  Nat Biotechnol       Date:  2015-01       Impact factor: 54.908

5.  SearchGUI: A Highly Adaptable Common Interface for Proteomics Search and de Novo Engines.

Authors:  Harald Barsnes; Marc Vaudel
Journal:  J Proteome Res       Date:  2018-05-25       Impact factor: 4.466

6.  OpenMS: a flexible open-source software platform for mass spectrometry data analysis.

Authors:  Hannes L Röst; Timo Sachsenberg; Stephan Aiche; Chris Bielow; Hendrik Weisser; Fabian Aicheler; Sandro Andreotti; Hans-Christian Ehrlich; Petra Gutenbrunner; Erhan Kenar; Xiao Liang; Sven Nahnsen; Lars Nilse; Julianus Pfeuffer; George Rosenberger; Marc Rurik; Uwe Schmitt; Johannes Veit; Mathias Walzer; David Wojnar; Witold E Wolski; Oliver Schilling; Jyoti S Choudhary; Lars Malmström; Ruedi Aebersold; Knut Reinert; Oliver Kohlbacher
Journal:  Nat Methods       Date:  2016-08-30       Impact factor: 28.547

7.  Statistical Models for the Analysis of Isobaric Tags Multiplexed Quantitative Proteomics.

Authors:  Gina D'Angelo; Raghothama Chaerkady; Wen Yu; Deniz Baycin Hizal; Sonja Hess; Wei Zhao; Kristen Lekstrom; Xiang Guo; Wendy I White; Lorin Roskos; Michael A Bowen; Harry Yang
Journal:  J Proteome Res       Date:  2017-08-18       Impact factor: 4.466

8.  limma powers differential expression analyses for RNA-sequencing and microarray studies.

Authors:  Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth
Journal:  Nucleic Acids Res       Date:  2015-01-20       Impact factor: 16.971

9.  2016 update of the PRIDE database and its related tools.

Authors:  Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal:  Nucleic Acids Res       Date:  2015-11-02       Impact factor: 16.971

10.  BioContainers: an open-source and community-driven framework for software standardization.

Authors:  Felipe da Veiga Leprevost; Björn A Grüning; Saulo Alves Aflitos; Hannes L Röst; Julian Uszkoreit; Harald Barsnes; Marc Vaudel; Pablo Moreno; Laurent Gatto; Jonas Weber; Mingze Bai; Rafael C Jimenez; Timo Sachsenberg; Julianus Pfeuffer; Roberto Vera Alvarez; Johannes Griss; Alexey I Nesvizhskii; Yasset Perez-Riverol
Journal:  Bioinformatics       Date:  2017-08-15       Impact factor: 6.937

View more
  4 in total

1.  PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features.

Authors:  Veit Schwämmle; Christina E Hagensen; Adelina Rogowska-Wrzesinska; Ole N Jensen
Journal:  Mol Cell Proteomics       Date:  2020-05-18       Impact factor: 5.911

2.  MSstatsTMT: Statistical Detection of Differentially Abundant Proteins in Experiments with Isobaric Labeling and Multiple Mixtures.

Authors:  Ting Huang; Meena Choi; Manuel Tzouros; Sabrina Golling; Nikhil Janak Pandya; Balazs Banfai; Tom Dunkley; Olga Vitek
Journal:  Mol Cell Proteomics       Date:  2020-07-17       Impact factor: 5.911

Review 3.  Quantitative Proteomics Using Isobaric Labeling: A Practical Guide.

Authors:  Xiulan Chen; Yaping Sun; Tingting Zhang; Lian Shu; Peter Roepstorff; Fuquan Yang
Journal:  Genomics Proteomics Bioinformatics       Date:  2022-01-08       Impact factor: 6.409

Review 4.  Quantitative proteomics characterization of cancer biomarkers and treatment.

Authors:  Xiao-Li Yang; Yi Shi; Dan-Dan Zhang; Rui Xin; Jing Deng; Ting-Miao Wu; Hui-Min Wang; Pei-Yao Wang; Ji-Bin Liu; Wen Li; Yu-Shui Ma; Da Fu
Journal:  Mol Ther Oncolytics       Date:  2021-04-20       Impact factor: 7.200

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.