Johannes Griss1,2, Goran Vinterhalter3, Veit Schwämmle4. 1. EMBL-European Bioinformatics Institute , Wellcome Trust Genome Campus , CB10 1SD Hinxton, Cambridge , United Kingdom. 2. Department of Dermatology , Medical University of Vienna , Währinger Gürtel 18-20 , 1090 Vienna , Austria. 3. Faculty of Mathematics , University of Belgrade , Studentski trg 16 , 11001 Belgrade , Serbia. 4. Department for Biochemistry and Molecular Biology , University of Southern Denmark , Campusvej 55 , 5230 Odense , Denmark.
Abstract
Reproducibility has become a major concern in biomedical research. In proteomics, bioinformatic workflows can quickly consist of multiple software tools each with its own set of parameters. Their usage involves the definition of often hundreds of parameters as well as data operations to ensure tool interoperability. Hence, a manuscript's methods section is often insufficient to completely describe and reproduce a data analysis workflow. Here we present IsoProt: A complete and reproducible bioinformatic workflow deployed on a portable container environment to analyze data from isobarically labeled, quantitative proteomics experiments. The workflow uses only open source tools and provides a user-friendly and interactive browser interface to configure and execute the different operations. Once the workflow is executed, the results including the R code to perform statistical analyses can be downloaded as an HTML document providing a complete record of the performed analyses. IsoProt therefore represents a reproducible bioinformatics workflow that will yield identical results on any computer platform.
Reproducibility has become a major concern in biomedical research. In proteomics, bioinformatic workflows can quickly consist of multiple software tools each with its own set of parameters. Their usage involves the definition of often hundreds of parameters as well as data operations to ensure tool interoperability. Hence, a manuscript's methods section is often insufficient to completely describe and reproduce a data analysis workflow. Here we present IsoProt: A complete and reproducible bioinformatic workflow deployed on a portable container environment to analyze data from isobarically labeled, quantitative proteomics experiments. The workflow uses only open source tools and provides a user-friendly and interactive browser interface to configure and execute the different operations. Once the workflow is executed, the results including the R code to perform statistical analyses can be downloaded as an HTML document providing a complete record of the performed analyses. IsoProt therefore represents a reproducible bioinformatics workflow that will yield identical results on any computer platform.
Lack of reproducibility
in general, and in bioinformatics workflows
specifically, is a growing concern.[1] Bioinformatic
workflows in proteomics experiments often consist of multiple software
tools, each with its own set of parameters. Seemingly small changes
to a workflow, such as using different normalization method details,
can have dramatic effects on the final result. Due to the many steps
and settings that make a complex workflow, it is often impossible
to fully document it in a research paper’s methods section.
Additionally, finding and using the exact same software versions later
on often represents a major obstacle when replicating bioinformatic
analyses. Older versions may no longer be compatible with the available
operating system or are just altogether unavailable. Therefore, fully
reproducible workflows should not only record the exact software versions
and parameters, but also preserve specific software versions and ensure
that they will produce the same results in different computing environments.Several projects exist to create reproducible bioinformatic workflows.
Biocontainers[2] provides Docker containers
to make bioinformatic tools available in a standardized way. Docker
containers are lightweight virtual machines that, in the case of Biocontainers,
ensure that a given software version performs identically on any operating
system supported by Docker. Therefore, users do not have to install
any software but only download the respective container. Galaxy[3] is a web-based platform for biomedical research
mainly focused on genomics. It contains thousands of tools that can
be joined together to create workflows and also supports tools for
proteomics analyses. KNIME (http://www.knime.com, KNIME AG) is another workflow software focused on data analysis
in general. All OpenMS[4] nodes were recently
integrated in KNIME, making it possible to build complete proteomics
workflows with it. ProteomeDiscoverer (Thermo Fisher) is also a workflow
system but specifically targeting proteomics data analysis. Several
academic research groups[4−6] are contributing to ProteomeDiscoverer
making it usable for a wide variety of proteomics workflows. Finally,
to a certain extent MaxQuant[7] with Perseus[8] allows the user to create a complete analysis
workflow in a single software.Nevertheless, all of these existing
solutions have shortcomings
that prevent the creation of complete, reproducible workflows. Biocontainers
is a platform to supply bioinformatic tools in a standardized fashion
but has no functionality to combine these tools into workflows. KNIME
and Galaxy are very powerful analysis platforms that can be adapted
to a wide variety of data analysis problems. This functionality comes
at the cost of high complexity, and many nonexpert users will find
it difficult to adopt Galaxy and KNIME to their needs. Additionally,
both KNIME and Galaxy do not contain methods to take a snapshot of
the external tools used to actually process the data. ProteomeDiscoverer
also depends on external nodes. Therefore, to fully replicate an existing
workflow the user again has to take care of locating and installing
the exact same versions of these nodes. Moreover, new ProteomeDiscoverer
versions generally come with significant changes which requires nodes
to be specifically developed for a given version. Nodes developed
for one version of ProteomeDiscoverer are generally incompatible with
newer ones. Therefore, none of these existing solutions fulfill all
requirements of a completely reproducible workflow environment.Isobaric labeling has become one of the most common methods for
quantitative mass spectrometry based proteomics experiments. A major
advantage is that it allows researchers to multiplex samples and thereby
reduce instrument runtime and eliminate variability caused by the
mass spectrometer itself. The two methods currently available for
these experiments, tandem mass tag (TMT[9]) and multiplexed isobaric tagging technology for relative quantitation
(iTRAQ[10]) basically only differ in the
reporter masses they generate but do not require dedicated software
tools.Even though isobaric labeling has become a standard method
in many
laboratories, dedicated, easy-to-use software solutions to analyze
these data are still rare. This is particularly problematic when dealing
with more complex experimental designs that include multiple runs
on the mass spectrometer, such as multiple instances of differently
labeled multiplexed samples. Existing dedicated software solutions,
such as iQuant,[11] isobar,[12] MilQuant,[13] and IsobariQ[14] all require identification results from specific
search engines and do not support complex experimental designs with
more than two treatment groups or samples split across multiple iTRAQ/TMT
runs. Therefore, many research groups rely on unpublished in-house
scripts to process their experiments, which greatly hampers reproducibility.In an effort to simplify proteomics data analysis and provide fully
reproducible data analysis workflows, we launched the ProtProtocols
project (https://protprotocols.github.io) under the umbrella of the European Bioinformatics Community (EuBIC).[15] On the basis of the Biocontainers project,[2] the protocols are shipped in containerized Docker
images that include all necessary software tools. Docker containers
are lightweight virtual machines that encapsulate all the software
required for the protocol to run. This ensures that the version of
all used software is linked to the protocol version and the user does
not have to worry about installing any separate tools. Hence, 100%
reproducibility can be achieved by using the same protocol version
on any computer with a Docker environment.Here, we present
IsoProt which serves as a blueprint for the ProtProtocol
concept. IsoProt is designed for the analysis of isobarically labeled
experiments, which is one of the most commonly used methods for high-throughput
proteomics. Next to a user-friendly web interface, IsoProt provides
accurate statistical analyses for a wide range of common experimental
designs.
Experimental Procedures
Software Layout and Implementation
General Implementation
All software was installed in
a Docker image to ensure full reproducibility on each computer system
supported by Docker. To simplify the installation and usage of our
protocols, we created the free, open-source “ProtProtocol docker-launcher”
(https://github.com/ProtProtocols/docker-launcher). It provides an easy-to-use graphical user interface that can automatically
install the protocol (once Docker is installed) and launch the image.
As it is written in Java, it supports the major operating systems
Windows, Mac OSX, and Linux. Therefore, many technical difficulties
surrounding the use of Docker are hidden from the user. Detailed instructions
on how to install and use all tools, as well as how to extend ProtProtocols,
can be found at https://protprotocols.github.io/documentation/.The complete protocol is run through a Jupyter notebook (http://jupyter.org) corresponding
to one web page in the browser. All relevant parameters can be set
through common graphical user elements created through Jupyter widgets.
Therefore, the user interface is highly similar to most available
search engines. The complete source code as well as additional documentation
of the protocol is freely available through https://protprotocols.github.io.
Proteomics Software
IsoProt handles the entire analysis
pipeline from mass spectra given as peak lists to the set of differentially
regulated proteins (Figure A). We used SearchGUI[16] and PeptideShaker[17] to perform peptide identification and validation,
with MS-GF+[18] as a database search engine.
Proteins are summarized and quantified by R scripts based on the MSnBase
R library.[19] R scripts furthermore generate
figures for quality control and perform statistical tests (LIMMA library[20]) according to the experimental design.
Figure 1
(A) Scheme
of the entire workflow including operations (ellipsoids)
and data given by type and format (squares). The annotation form and
terms of the workflow follow to large extent the EDAM ontology.[21] (B, C) Experimental designs and organization
in folder structure for analysis in IsoProt.
(A) Scheme
of the entire workflow including operations (ellipsoids)
and data given by type and format (squares). The annotation form and
terms of the workflow follow to large extent the EDAM ontology.[21] (B, C) Experimental designs and organization
in folder structure for analysis in IsoProt.
Input Files and Parameters
Input Files
The
only files required for the analysis
are mass spectra as peak lists (MGF format) and a FASTA file containing
the protein sequences where we recommend the UniProt version of the
FASTA format. Databases can already contain decoy sequences (following
the SearchGUI instructions, http://compomics.github.io/projects/searchgui.html); otherwise, the decoy database is created automatically. The files
can be copied into the Docker file structure or directly mirrored
onto the /data folder as automatically done by our
docker-launcher application.
Analysis Parameters
All parameters required for the
data analysis can be changed through a graphical user interface integrated
into the Jupyter notebook. In the first section, the user has to set
database search related parameters such as precursor and fragment
ion tolerance, the FASTA sequence database to use, the labeling agent
used, and the fixed and variable modifications to consider.On the basis of the selected labeling method and detected folder
structure, the interface to enter the experimental design is generated.
The protocol currently supports two setups: (1) all MGF files are
placed in the input directory and are part of the same (fractionated)
run (Figure B) or
(2) MGF files from different runs are organized by placing them in
different subdirectories (Figure C). Next, the experimental design user interface allows
the user to enter names for the sample groups (for example “treatment”
and “control”) and names for the samples (one name per
channel and subdirectory) and assign each sample to one of the groups.
Most importantly, the protocol supports up to 20 sample groups and
can thereby model complex experimental designs.Finally, the
user is asked to enter parameters related to the analysis
of the quantitative data. Once all required information is entered,
the search and analysis are directly controlled through buttons in
the user interface.
Output Files and Quality Control
IsoProt provides figures
and tables for the different steps of the analysis including peptide
identifications, quantitative values of peptide-spectrum matches (PSMs),
and proteins as well as a table for the statistical results from the
significance analysis. Visual measures for quality control were implemented
as R scripts and include total intensities of the reporter ion channels
for each sample, violin plots at different stages of the analysis,
principal component analysis, and volcano plots (Figure ).
Figure 2
Examples of the visualization
and diagnostic plots created by IsoProt
based on the shipped example data. (A) The mass accuracy of all reporter
ions is presented as a histogram. (B) Correlation of reporter intensities
for all channels. (C) Distribution of estimated abundances on the
spectrum level for all channels. (D) Distribution of protein abundances
for all samples. (E) Principal component analysis of all samples based
on the aggregated data highlighting the treatment groups. (F) Volcano
plot for a quick visualization of quantitative data and statistical
results.
Examples of the visualization
and diagnostic plots created by IsoProt
based on the shipped example data. (A) The mass accuracy of all reporter
ions is presented as a histogram. (B) Correlation of reporter intensities
for all channels. (C) Distribution of estimated abundances on the
spectrum level for all channels. (D) Distribution of protein abundances
for all samples. (E) Principal component analysis of all samples based
on the aggregated data highlighting the treatment groups. (F) Volcano
plot for a quick visualization of quantitative data and statistical
results.
Test Data Sets
To evaluate the performance of our analysis
workflow, we processed the data from three publically available data
sets using the same search parameters as in the original studies.
We downloaded the respective RAW files from PRIDE Archive[22] and converted them into the MGF file format
using ProteoWizard’s msconvert tool[23] when no MGF peak list files were available.
Benchmark Data Set
D’Angelo et al. recently
published a TMT benchmark data set containing an experiment where
12 human proteins were spiked into an Escherichia coli background[24] using various concentrations
(PRIDE Archive identifier PXD005486). D’Angelo et al. used
this data set to assess the number of proteins that were incorrectly
identified as being regulated. As every protein was added using varying
concentrations among the samples, a standard statistical analysis
of the spiked-in proteins was not possible. Therefore, our analysis
focuses on the accuracy of the derived quantitative estimates for
the spiked proteins and the (unchanged) background E. coli proteins.The complete analysis was performed using IsoProt
version 0.2. Spectra were identified using MSGF+[18] through SearchGUI version 3.3.3.[16] The precursor tolerance was set to 20 ppm and the fragment tolerance
to 0.03 Da. One missed cleavages was allowed. Carbamidomethylation
and TMT 10-plex of K,TMT 10-plex of peptide N-term were set as fixed
modifications. Oxidation of M was set as variable modification. PSMs
were filtered at a target false discovery rate (FDR) of 0.01 using
the target-decoy approach. UniProt E. coli sequences
(version August 2018) and the spiked human protein sequences, also
from UniProt, were used for spectra identification.Quantitative
analysis was done using the R Bioconductor package
MSnbase version 2.7.1.[19] Protein summarization
was performed using the “medpolish” method as implemented
by MSnbase. Modified peptides were not used for quantitation. Only
proteins with at least two identified peptides were accepted for further
analysis. Differential expression was assessed using the R Bioconductor
package limma version 3.34.[20]
Cerebral
Malaria Pathogenesis
The study uses TMT6 labeling
to compare mouse blood with different stages of cerebral malaria (d3,
ECM) to noninfected mice (NI).[25] Four replicates
of each of the three sample types were arranged in two TMT6 sets and
run separately, corresponding to a similar case as in Figure C, now having three conditions
being distributed over two separate runs on the mass spectrometer.
Peak list data files (MGF file format) were downloaded from PRIDE
Archive (PXD003772).The analysis was again performed using
IsoProt version 0.2 (see above) with the precursor tolerance set to
10 ppm and the fragment tolerance to 0.05 Da. One missed cleavage
was allowed. Carbamidomethylation and TMT 6-plex of K,TMT 6-plex of
peptide N-term were set as fixed modifications. Oxidation of M was
set as variable modification. PSMs were filtered at a target FDR of
0.01 using the target-decoy approach. SwissProt sequences from mouse
(January 2018) were used for spectra identification. Only proteins
with at least two identified peptides were accepted for further analysis.
Nonmuscle Invasive and Muscle-Invasive Bladder Cancer
The
study compares tumor tissue samples from nonmuscle invasive and
muscle-invasive bladder cancer.[26] MGF files
were downloaded from PRIDE Archive (PXD002170).The analysis
was again performed using IsoProt version 0.2 (see above) with the
precursor tolerance set to 10 ppm and the fragment tolerance to 0.05
Da. One missed cleavage was allowed. Carbamidomethylation and iTRAQ
8-plex of K, iTRAQ 8-plex of Y, iTRAQ 8-plex of peptide N-term were
set as fixed modifications. Oxidation of M was set as variable modification.
PSMs were filtered at a target FDR of 0.01 using the target-decoy
approach. Sequences from SwissProt sequences from human (January 2017)
were used for spectra identification. Only proteins with at least
two identified peptides were accepted for further analysis.
Results
IsoProt allows users running the full data analysis
of iTRAQ/TMT
experiments in a straightforward and reproducible way. The protocol
supports different experimental designs including multiple runs on
the mass spectrometer and differently labeled multiple samples. Additionally,
the open layout of the protocol allows complex adjustments and modifications
at all stages of the workflow.
A Fully Reproducible Environment
The protocol can be
run on any computer with a functional Docker environment, by just
downloading and running the available Docker image. This is fully
automated through our “ProtProtocol docker-launcher”
tool (https://github.com/ProtProtocols/docker-launcher). Hence, the
protocol avoids all possible platform- and operating system-specific
installation issues and provides identical results independent of
operating system, its configuration, and computer hardware.Every IsoProt release has a stable version number that points to
a specific docker image. Therefore, by citing the used IsoProt version
number, it will always be possible to exactly restore the used analysis
environment, including the versions of all used software tools. Once
the protocol has been executed, it is possible to save it, including
all generated figures, as a standard HTML page. Therefore, the complete
analysis workflow can be easily made available, for example, at the
time of review, and be viewed with a standard web browser. Additionally,
all user-entered parameters are stored in text files next to the analyses
results which can easily be reused for future projects (see https://protprotocols.github.io/documentation/isoprot/save_analysis for details). For an overview of the visualizations, see Figure .
Simple Example
Workflow
IsoProt can be tested using
an example data set that is small enough to run in under 10 min on
a standard computer. The data set is part of the IsoProt Docker image,
and necessary parameters settings are preloaded when starting IsoProt.
The database search via SearchGUI and validation via PeptideShaker
result in a tab-delimited file containing detailed information on
all PSMs. Search and output parameters are automatically saved for
future reference. Additionally, a “methods” section
is generated that can be included in a manuscript and describes all
used settings. Each spectrum file is processed separately to match
and quantify PSMs that passed the identification FDR (default 0.01).
The mass distribution of all matched fragment ions allows control
for critical channels with inefficient labeling (Figure A). All PSM quantifications
are saved in a separate file (AllQuantPSMs.csv).The output
of all files of each run on the mass spectrometer are merged, normalized,
and visualized for quality control. Violin plots of normalized PSM
intensities compare the intensity distributions (Figure C). Channels with different
distributions can identify problematic samples or changes within the
entire proteome. Six different histograms counting PSM, peptide, protein,
and protein group numbers allows determining protein coverage and
uniqueness by the available mass spectra. Similarity between samples
is assessed through scatter plots comparing all quantified spectra
from all ion channels (Figure B).Using the default parameters, the PSMs are summarized
to proteins
using median summarization after outlier removal requiring a minimum
of 1 PSM per protein. In addition, the protocol supports iPQF,[27] mean expression, median expression (without
outlier removal), and robust summarization as methods. A violin plot
of protein ratios versus mean of all channels shows whether the analyzed
samples exhibit similar distributions on the protein level (Figure D).Quantifications
from different runs (only one in the example) are
merged and submitted to a principal component analysis (Figure E). This places all samples
in a two-dimensional space and color codes different treatment groups.
Studies where the samples of the different types are not placed as
distinguishable groups are unlikely to provide differentially regulated
proteins. Additionally, potential systematic biases can quickly be
discovered using this plot.The example set quantified a total
of 221 protein groups. LIMMA
statistical tests did not find any regulated proteins with FDR <
0.05, which is in agreement with the original results. p-Values and false discovery rates (p-values corrected
for multiple testing) are visualized in histograms, volcano plots
(Figure F), and a
figure counting the number of differentially regulated proteins over
a range of FDRs. The latter can be used to identify a suitable combination
of the confidence threshold and the number of significant proteins.
It is advised to keep FDR < 0.1 as the number of false positives
becomes critically high otherwise.
Performance Tests by Reprocessing
Public Data
D’Angelo
et al. performed
a comparison of different approaches to analyze TMT data sets.[24] In their first data set, the authors spiked
different concentrations of 12 human proteins into an E. coli background. They used this data set to assess the type-I error as
the number of false positive proteins. Similar to the original study,
we assigned the first five channels to one treatment group, and the
second five channels to the second group. As expected, no proteins
were identified as being significantly regulated. The estimated log-fold
changes of the E. coli background proteins were all
close to 0 (Figure A).
Figure 3
(A) Log-fold changes of the E. coli background
proteins. This represents the expression of background proteins based
on a comparison of the first five channels against the other five
ones similar to the approach by D’Angelo et al. As expected,
the estimated log-fold changes are all closely centered around 0.
(B) Observed bias and RMSE of estimated fold-changes of the D’Angelo
et al. benchmark data set from our pipeline and the best-performing
pipeline published by the authors. (C) Variation of ground truth proteins
at different spike-in levels. Imputation by lowest value (green) leads
to increased variation compared to no imputation (red), except of
the lowest level.
(A) Log-fold changes of the E. coli background
proteins. This represents the expression of background proteins based
on a comparison of the first five channels against the other five
ones similar to the approach by D’Angelo et al. As expected,
the estimated log-fold changes are all closely centered around 0.
(B) Observed bias and RMSE of estimated fold-changes of the D’Angelo
et al. benchmark data set from our pipeline and the best-performing
pipeline published by the authors. (C) Variation of ground truth proteins
at different spike-in levels. Imputation by lowest value (green) leads
to increased variation compared to no imputation (red), except of
the lowest level.Proteins were spiked
twice using the same concentration in different
channels and only once for the two highest concentrations. Therefore,
only a single, or two replicate measurements at maximum are available
when comparing two concentrations. Since this setup prevents a standard
statistical evaluation, we focused on the accuracy of the estimated
fold changes using the same error measurements as in the original
manuscript. Similarly, we assessed the accuracy of our fold change
estimates using the bias and the root-mean-square error (RMSE). Across
most spiked fold-changes, we observed a comparable bias and RMSE (Figure B).For the
highest spiked fold-change, we observed slightly higher
average error rates than D’Angelo et al. This is most likely
caused by the fact that D’Angelo et al. imputed missing values
by taking the lowest observed intensity of the given PSM across all
samples. Thereby, missing values were automatically interpreted as
very low expression. As expected, the measured abundance of the lowest
concentrations showed larger variation with several missing values.
In our approach, these missing values were ignored thus leading to
less stable average fold-changes. Imputing missing values like D’Angelo
et al. did naturally reduced this variation leading to reduced error
rates. However, when we, for example, estimated the fold change of
the two highest protein concentrations (also a fold change of 2),
the bias is 0 with an RMSE of 0.2 improving the error rates dramatically.While D’Angelo et al.’s imputation approach is valid
if values can be expected to be missing not at random (i.e., because
of a concentration below the limit of detection), it is not valid
for values missing (completely) at random (i.e., because of inefficient
labeling).[28] Therefore, for the spiked
proteins D’Angelo et al.’s approach should only have
been applied to cases were the lowest amount of proteins were spiked.
Since it is generally unknown why a value is missing in actual experiments,
our pipeline is not using any imputation. Limma’s model treats
these values as “missing as random”, which we feel is
more appropriate for most biological studies.To estimate the
effect of these different approaches, we calculated
the variance of the spiked-in proteins as the sum of the absolute
difference between the duplicate measurements. Independent of the
used protein summarization method, imputing missing values increased
the variance of all but the duplicates with zero concentration of
the respective proteins (Figure C). In our opinion, this highlights the downside of
using “blind” imputation for all missing values as this
can result in increased noise levels or bias in the data set. The
complete output of our pipeline can be found in Supplementary File 1.
Cerebral Malaria Pathogenesis
The authors investigated
differences in the plasma proteome between healthy and malaria-infectedmice (two stages). The available two TMT 6plex sets were considered
to contain independent samples. IsoProt quantified more protein groups
(324 versus 289) when requiring a minimum of 2 unique PSMs and an
identification FDR < 1%. For the further comparison, we restricted
the IsoProt output to the uniquely identified 214 proteins (no peptides
shared with other proteins).In the original study, statistical
testing was carried out separately for the two TMT runs, yielding
a total of 54 (more precisely 43 as 11 were detected in both runs)
proteins found to be differentially regulated between Plasmodium
berghei ANKA (PbA)-infected (d8 postinfection, labeled ECM)
and noninfected (labeled NI) mice (Mann–Whitney U test, p ≤ 0.001). Since the authors did
not correct p-values for multiple testing, these
results cannot be considered significant. We found a total of 41 differentially
regulated proteins (FDR < 0.01) and an overlap of only 20 proteins
with the original study.Given the different statistical procedures,
we analyzed all proteins
that were found differentially regulated by either one of the methods.
All but four proteins found differentially regulated in the original
study were quantified by IsoProt and showed similar abundances in
both analyses (Figure A). Proteins only deemed significant in the original study were not
found significant by IsoProt mostly due to low fold-changes (Figure B).
Figure 4
A Comparison of fold-changes
of proteins differentially regulated
in the original study (A1) and IsoProt results (A2). Proteins found
differentially regulated in the original study were labeled red. (B)
Abundance profiles of Retinol-binding protein 4 (Q00724) (B1) and
disulfide-isomerase (P09103) (B2).
A Comparison of fold-changes
of proteins differentially regulated
in the original study (A1) and IsoProt results (A2). Proteins found
differentially regulated in the original study were labeled red. (B)
Abundance profiles of Retinol-binding protein 4 (Q00724) (B1) and
disulfide-isomerase (P09103) (B2).We further investigated the two proteins that mostly differed
between
the two types of analyses. Retinol-binding protein 4 (Q00724) was
the protein with the lowest FDR within the proteins found differentially
regulated only by IsoProt. Figure C shows PSM measurements for the 2 TMT runs of this
protein (scaled for better comparison). Summarized protein abundances
(thick lines) by median summarization with outlier removal show that
the PSMs of peptides with less differential behavior were removed.
By merging the observation of the two TMT runs, IsoProt increases
its statistical power and thus provides evidence for regulatory behavior
of this protein.On the other hand, protein protein disulfide-isomerase
(P09103)
was the protein with the highest FDR (least significant) in IsoProt
that was found significantly regulated in the original study (TMT-1, Figure D). Given only high
abundances in one of the two ECM replicates in TMT-1, manual interpretation
would discard this protein from being regulated (Figure D). The PSMs measured in the
second TMT-2 run confirm this observation. The complete output of
our pipeline can be found in Supplementary File 1.
Nonmuscle Invasive and Muscle-Invasive Bladder
Cancer
IsoProt quantified 1145 protein groups when restricting
to a minimum
of 2 unique peptides and 1% FDR, compared to 1092 in the original
study (minimum of 2 peptides, Occam razor principle for peptide inference
and 1% FDR). Both analyses had an overlap of 662 proteins. Despite
only having different bioinformatics workflows, the mean log-fold
changes of proteins between the two cancer subtypes were very different
(Figure A, Pearson’s
correlation of 0.78).
Figure 5
(A) Comparison of log-ratios between IsoProt output and
original
study. Pearson’s correlation between both quantification: 0.79.
(B–C) Volcano plots for results from statistical testing in
the original study (B) and in IsoProt (C). Colored points correspond
to proteins with a (uncorrected) p-value below 5%
in the other study, respectively. (D–E) Distribution of relative
protein abundances in original study (D) and IsoProt (E).
(A) Comparison of log-ratios between IsoProt output and
original
study. Pearson’s correlation between both quantification: 0.79.
(B–C) Volcano plots for results from statistical testing in
the original study (B) and in IsoProt (C). Colored points correspond
to proteins with a (uncorrected) p-value below 5%
in the other study, respectively. (D–E) Distribution of relative
protein abundances in original study (D) and IsoProt (E).IsoProt found one differentially regulated protein
(15-hydroxyprostaglandin
dehydrogenase, FDR < 0.01) after correction for multiple testing,
which was not carried out in the original study. In order to allow
a comparison of both results, we therefore also used uncorrected p-values for the following analysis. This is not recommended
as it is prone to greatly overestimate the number of regulated proteins.
When comparing these uncorrected p-values, the majority
of “significant” proteins were different between the
two studies (Figure B,C, colored points indicate p < 0.05 in the
other respective study).This striking difference in the statistical
results is due to different
normalization approaches used. Their effect can be seen in the distribution
of protein abundances (Figure D,E). The authors of the original study normalized the ratios
between cancer subtypes after protein summarization and averaging
of replicates. The more common and in our opinion correct approach
is to normalize the different channels (i.e., individual samples)
on the (measured) PSM or (aggregated) peptide level prior to the aggregated
analysis of these measurements on the protein level and, most importantly,
prior to merging any independent (i.e., replicate) measurements. Strong
deviations of individual channels which are visible on the peptide
level were thus discarded in the original study. The complete output
of our pipeline can be found in Supplementary File 1.
Discussion
IsoProt shows how the
ProtProtocols framework can be used to create
user-friendly, reproducible bioinformatic workflows. IsoProt makes
it simple to include the complete bioinformatic data processing workflow
as a supplementary file. Thereby, reviewers and other researchers
can easily assess the used methods.Encapsulating protocols
into docker containers preserves the complete
setup including all software versions which can be referenced through
a single protocol version number. This allows anyone to replicate
the results at any later stage without having to worry that older
software might no longer work. Once a given version of the protocol
is downloaded, users can be sure that it will behave in exactly the
same way on all supported platforms.The use of docker makes
the protocol highly portable. Docker currently
supports Windows, Linux, and Mac OS making our protocol truly multiplatform.
The fact that the protocol can be installed through a single command
makes it trivial to move the setup from one machine to another. With
our “ProtProtocol docker-launcher” tool, the protocol
can even be installed with the click of a single button. This should
greatly reduce the effort in setting up a complex proteomics analysis
environment. Unfortunately, Docker support for Windows is not yet
fully stable. Therefore, several Windows users experienced issues
when installing Docker which prevented them from using IsoProt. Even
though this currently reduces the ease-of-use of ProtProtocols on
Windows machines, we believe that this will quickly be improved since
Microsoft recently became an official partner of Docker.[29]IsoProt’s performance was tested
on three publicly available
data sets. The results highlight that subtle differences in the data
analysis can lead to considerable differences in the final results.
Such differences can only be identified by reproducing the complete
environment of the analysis workflow, something that is very difficult
to realize when only relying on information from a scientific paper.
Thus, more complete and easily readable information on the used workflow
and its parameters, or even the entire computational environment,
will considerably improve paper reviews as well as reproducing and
discussing results from already published studies. Such workflows
will further increase quality and credibility of both scientific studies
and the presenting journals. IsoProt enables users to easily provide
such complete information on their analysis. Our approach facilitates
comparison with other data analysis pipelines or testing of robustness
to parameter changes with minimal efforts requiring only peak list
files, their relation to the experimental design and main parameters
for identification and quantification.All of these developments
are available as free and open-source
software. Thereby, we encourage other researchers to use the ProtProtocol
infrastructure as starting point to develop their own analysis workflows
and make them available to the community. All our tools are modularized
and prepared to support and simplify such external developments. Since
Docker has become an industry standard for containerized applications
long-term support seems to be guaranteed for these developments.In summary, we developed a user-friendly environment for fully
reproducible data analysis and exemplified its use through a complete
workflow for the analysis of data from isobarically labeled mass spectrometry
experiments.
Authors: Sangtae Kim; Nikolai Mischerikow; Nuno Bandeira; J Daniel Navarro; Louis Wich; Shabaz Mohammed; Albert J R Heck; Pavel A Pevzner Journal: Mol Cell Proteomics Date: 2010-09-09 Impact factor: 5.911
Authors: Marc Vaudel; Julia M Burkhart; René P Zahedi; Eystein Oveland; Frode S Berven; Albert Sickmann; Lennart Martens; Harald Barsnes Journal: Nat Biotechnol Date: 2015-01 Impact factor: 54.908
Authors: Hannes L Röst; Timo Sachsenberg; Stephan Aiche; Chris Bielow; Hendrik Weisser; Fabian Aicheler; Sandro Andreotti; Hans-Christian Ehrlich; Petra Gutenbrunner; Erhan Kenar; Xiao Liang; Sven Nahnsen; Lars Nilse; Julianus Pfeuffer; George Rosenberger; Marc Rurik; Uwe Schmitt; Johannes Veit; Mathias Walzer; David Wojnar; Witold E Wolski; Oliver Schilling; Jyoti S Choudhary; Lars Malmström; Ruedi Aebersold; Knut Reinert; Oliver Kohlbacher Journal: Nat Methods Date: 2016-08-30 Impact factor: 28.547
Authors: Gina D'Angelo; Raghothama Chaerkady; Wen Yu; Deniz Baycin Hizal; Sonja Hess; Wei Zhao; Kristen Lekstrom; Xiang Guo; Wendy I White; Lorin Roskos; Michael A Bowen; Harry Yang Journal: J Proteome Res Date: 2017-08-18 Impact factor: 4.466
Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971
Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971
Authors: Felipe da Veiga Leprevost; Björn A Grüning; Saulo Alves Aflitos; Hannes L Röst; Julian Uszkoreit; Harald Barsnes; Marc Vaudel; Pablo Moreno; Laurent Gatto; Jonas Weber; Mingze Bai; Rafael C Jimenez; Timo Sachsenberg; Julianus Pfeuffer; Roberto Vera Alvarez; Johannes Griss; Alexey I Nesvizhskii; Yasset Perez-Riverol Journal: Bioinformatics Date: 2017-08-15 Impact factor: 6.937
Authors: Veit Schwämmle; Christina E Hagensen; Adelina Rogowska-Wrzesinska; Ole N Jensen Journal: Mol Cell Proteomics Date: 2020-05-18 Impact factor: 5.911