High-throughput omics data often contain systematic biases introduced during various steps of sample processing and data generation. As the source of these biases is usually unknown, it is difficult to select an optimal normalization method for a given data set. To facilitate this process, we introduce the open-source tool "Normalyzer". It normalizes the data with 12 different normalization methods and generates a report with several quantitative and qualitative plots for comparative evaluation of different methods. The usefulness of Normalyzer is demonstrated with three different case studies from quantitative proteomics and transcriptomics. The results from these case studies show that the choice of normalization method strongly influences the outcome of downstream quantitative comparisons. Normalyzer is an R package and can be used locally or through the online implementation at http://quantitativeproteomics.org/normalyzer .
High-throughput omics data often contain systematic biases introduced during various steps of sample processing and data generation. As the source of these biases is usually unknown, it is difficult to select an optimal normalization method for a given data set. To facilitate this process, we introduce the open-source tool "Normalyzer". It normalizes the data with 12 different normalization methods and generates a report with several quantitative and qualitative plots for comparative evaluation of different methods. The usefulness of Normalyzer is demonstrated with three different case studies from quantitative proteomics and transcriptomics. The results from these case studies show that the choice of normalization method strongly influences the outcome of downstream quantitative comparisons. Normalyzer is an R package and can be used locally or through the online implementation at http://quantitativeproteomics.org/normalyzer .
High-throughput technologies such as DNA
microarrays and mass spectrometry (MS) generate vast amount of information-rich
transcriptomics, proteomics, and metabolomics data. These technologies
have made significant progress in the past decade enabling detection
and expression level quantification of thousands of genes, proteins,
and metabolites in biological samples. Technical advancement of MS-based
instruments in recent years has increased the detection accuracy and
reduced the data generation time. This enables accurate detection
and quantitative comparison of thousands of proteins from two to several
samples at a time. However, high-throughput omics data often contain
systematic biases introduced during various steps of sample processing
and data generation. Failing to account for these biases could lead
to misleading conclusions from quantitative analysis. Data normalization,
if properly done, reduces systematic biases and is thus necessary
prior to any downstream quantitative analysis. Different normalization
methods address systematic biases in the data differently, and thus
choosing an optimal normalization method for a given data set is critical.
As the source of systematic bias in the data is usually unknown, an
exhaustive comparative evaluation of both un-normalized data and the
data normalized through different methods is required to select a
suitable normalization method. For a detailed review on normalization
of label-free proteomics data, refer to Karpievitch et al.[1]Different normalization methods for omics
data have been evaluated lately, and it is apparent that different
methods produce considerably different results.[2−8] Callister et al. evaluated four different normalization methods
for label-free proteomics data and concluded that methods based on
linear regression were most optimal but suggested that further investigation
is needed.[5] Kultima et al.[3] proposed a new normalization method, RegrRun, which performed
best among 10 different methods on peptidomics data. Choe et al.[2] evaluated four different normalization methods
for DNA-microarray data and concluded that the LOESS method is most
optimal. Lyutvinskiy et al.[9] developed
a normalization strategy for label-free proteomics data to account
for fluctuations in the electrospray ionization in the time domain.
Wang et al.[10] hypothesized that the missing
values in the proteomics data set are non-random and proposed a two-step
approach where data are first normalized by top 80 order statistics
to estimate a scaling factor for each sample, followed by missing
value imputation taking into account the scaling factor for each sample.
Webb-Robertson et al.[8] proposed a statistical
selection strategy called SPANS based on Rank Invariant Peptides and
provided a tool to evaluate different peptide selection methods for
normalization and subsequent normalization with emphasis on possible
bias introduced by the normalization. It is thus apparent that suitability
of a normalization method is dependent on the intrinsic characteristics
of the data.Evaluation of data normalization can be done both
quantitatively and qualitatively. Quantitative analysis is mainly
based on the measure of dispersion around the mean within and between
groups. The most common quantitative measures are standard deviation
(SD), coefficient of variation (CV), median absolute deviation (MAD),
and pooled estimate of variance (PEV). SD can be either positive or
negative and is described relative to the sample mean, making it difficult
to compare samples with differing mean. Measuring PEV could be an
alternative for comparisons as it is always positive. CV measures
variation as a percentage of mean and thus can be expressed independently
of the mean, making it easier to compare variability between samples.
However, CV is highly sensitive when the sample mean is close to zero
as even low variation could produce high CV. Moreover, SD, PEV, and
CV are sensitive to outliers. MAD measures the median of the absolute
deviations around the sample median and thus is more robust and less
sensitive to outliers. These methods were used previously for normalization
evaluation of omics data.[3,5,7] Qualitative evaluation can be based on boxplots, MA plots, dendrograms,
or correlation plots. Optimally, a normalization method for a given
data set should be selected on the basis of both quantitative and
qualitative evaluation measures and by further analysis of previously
known housekeeping genes or proteins.Here, we introduce Normalyzer,
a new tool developed to evaluate the suitability of different normalization
methods for a given data set based on commonly used quantitative and
qualitative parameters. Normalyzer can be used for normalizing data
from DNA microarrays, label-free proteomics, metabolomics, targeted
mass spectrometry, or quantitative RT-PCR as long as the data are
approximately normally distributed and are formatted as per the requirements.
Normalyzer is fully automated and outputs normalized data from 12
different normalization methods along with an evaluation report. It
is an open-source tool and can be run online with a user-friendly
interface or can be installed locally as an R-package. Here, the usability
of Normalyzer is demonstrated with three different case studies.
Methods
Implementation
Normalyzer is implemented in R using Bioconductor[11] packages. The Normalyzer R-package can be downloaded from
(http://quantitativeproteomics.org/normalyzer) and can
be installed locally with R (version 3.0). Installation and usage
instructions can be found at the above URL. An online service with
a graphical user interface is also provided at the Web site.
Data Requirements
Normalyzer accepts data with raw intensities in a tab-separated
format. The raw data should not be in logarithmic scale. Any number
of rows and column annotations can be included if labeled accordingly.
The data set should be relatively large, preferably at least a few
hundred variables, and the observations need to contain replicate
groupings to enable normalization evaluation. The data can be read
in from a text file or as a data frame to facilitate inclusion of
Normalyzer in existing pipelines.A challenge with shotgun proteomics
data is the occurrence of missing values due either to peptide quantities
being below the detection limit or other technical issues. Imputation
of missing values could in some cases lead to erroneous results and
thus should be done with precaution.
Normalization Methods
Several popular normalization methods are included, such as total
intensity (TI), median intensity (MedI), average intensity (AI), quantile
(preprocessCore package),[12] NormFinder[13] (NF), Variance Stabilizing Normalization (VSN,
vsn package),[14] Robust Linear Regression
(RLR), and LOESS (limma package).[15] These
methods are implemented as global normalization methods (denoted by
‘G’). Furthermore, VSN, LOESS, and RLR are also implemented
as local methods (denoted by ‘R’) wherein the replicate
groups are normalized separately. Due to computational reasons, NormFinder
is automatically turned off for data sets where the number of variables
with non-missing values is higher than 1000. Missing values (denoted
NA) are tolerated differently by different normalization methods.
Missing values are excluded during the log2 transformation,
TI, MedI, AI, and RLR normalization; thus, NAs remain NAs even after
normalization and only numerical data are normalized. For determining
the control variables by NormFinder, only variables with no missing
values are considered. For LOESS and VSN normalization, the data set
is processed as-is and all warnings (if any) generated during LOESS
and VSN normalization are saved to the warnings file. The data normalized
by these methods are then evaluated both quantitatively and qualitatively.
Evaluation Measures
To aid in the selection of an optimal
normalization method, different quantitative and qualitative statistical
measures are considered. The results from these measures are plotted
and saved to the report. Measures include total intensity, total missing
values, Pooled intragroup Coefficient of Variation (PCV), Pooled intragroup
Median Absolute Deviation (PMAD), Pooled intragroup estimate of variance
(PEV), stable variables plot, CV-intensity plot, dendrograms, Pearson
and Spearman correlation, MA-plots,[16] boxplots,
density plots, Q-Q plots, Multidimensional scaling (MDS) plots, meanSD
plot, and Relative Log Expression (RLE) plots as illustrated in Figure 1.
Figure 1
Normalyzer workflow highlighting types of input data,
normalization, analysis methods, and final output.
Normalyzer workflow highlighting types of input data,
normalization, analysis methods, and final output.
Case Studies
To evaluate the performance
of Normalyzer, three different data sets with varying characteristics
were selected. Case studies 1 and 2 contain benchmark data generated
by spiked-in variables at varying concentrations, but with controlled
background and negligible biological variation, whereas case study
3 contains experimental data with considerable biological variation.
Case Study
1: LC–MS/MS Proteomics Benchmark Data
A previously
published shotgun proteomics data set[17] was used in this case study. The samples consist of 48 human proteins
(UPS1, Sigma) spiked-in at five different known concentrations (0.25,
0.74, 2.2, 6.7, and 20 fmol/μL) in a standard yeast lysate.
The raw data (OrbitrapO@65) were downloaded from the CPTAC data portal
and were converted to mzML with MS Numpress compressed binaries (https://github.com/ms-numpress/ms-numpress) and MGF using Proteowizard.[18] The files were processed in the Proteios Software
Environment (ProSE)[19] through a label-free
quantitative workflow described previously.[20] MS/MS identification was performed in Mascot Server 2.4.1 (http://www.matrixscience.com) with a database consisting of S. cerevisiae proteins from SwissProt (downloaded 20 October
2009) and the protein sequences found in the Sigma UPS1 protein set,
concatenated with an equal size decoy database. Match tolerances were
7 ppm for precursors and 0.5 Da for fragments. Carbamidomethylation
of cysteine was used as fixed modification setting and oxidation of
methionine as variable, and one missed cleavage was allowed. The resulting
data set with raw intensities for 36,484 features was used for evaluation
of normalization methods in Normalyzer.
Case Study 2: DNA Microarray
Benchmark Data
A previously published benchmark data set
with 3,860 spiked in cRNAs generated with Affymetrix GeneChips[2] was used to evaluate Normalyzer performance on
array data. The samples consist of 1,309 cRNAs spiked in at differing
concentrations between S and C samples and 2,551 cRNAs spiked in at
identical relative concentrations. The S and C samples were hybridized
in triplicate to Affymetrix GeneChips (six arrays). The raw data were
downloaded and preprocessed by MAS5 in R/Bioconductor.[11,21] Filtering of probe sets to retain those with more than one present
call in six samples resulted in a final data set with 4,156 probe
sets that was used in Normalyzer.
Case Study 3: LC–MS/MS
Proteomics Biological Data
Shotgun proteomics data generated
from the secreted protein fraction of P. infestans infected leaves of three potato (S. tuberosum)
cultivars from a previous study (Ali et al., submitted, ProteomeXchange
DOI 10.6019/PXD000435) was used as the third case study. It consists
of label-free quantitative mass spectrometry data with up to five
replicates collected just before infection and at three different
time points post-infection. Sample processing was conducted essentially
as described previously,[22] and the data
were processed as in Sandin et al.[20] with
msInspect peptide feature detection.[23] The
extracted and aligned features were used for the present study. Singly
charged features and features with missing values in more than 40
samples were excluded. The data with raw intensities for 16,896 features
from 60 samples were normalized in Normalyzer.
Results
and Discussion
The aim of Normalyzer is to aid in the selection
of an optimal normalization method for a given data set based on quantitative
and qualitative aspects of data variability. Normalyzer can be run
both online with a Graphical User Interface or offline as an R package.
Any type of omics data is supported as long as the basic data requirements
are fulfilled. Normalyzer evaluates the suitability of 12 normalization
methods for the uploaded data using quantitative and qualitative parameters
(Figure 1). It should be noted that most normalization
methods assume that the majority of variables are relatively stable
between samples, and data that do not fulfill this requirement could
be biased after global normalization. Therefore, methods are also
implemented to normalize locally within replicate groups. These methods
are denoted ‘R’ in the report, while methods denoted
‘G’ are global. However, for most data sets global normalization
should be the first choice, since local normalization may skew group
comparisons.
Output
The output from Normalyzer is a report with
quantitative and qualitative evaluation measures of the normalization
outcome. The total missing value plot and the total intensity plot
summarize raw data characteristics and together with the MDS plot
can be used to identify outlier samples due to sample degradation
or other reasons. The PCV, PMAD, and PEV plots represent variability
within replicates and help in the selection of normalization methods
based on low intragroup variability. The variability within replicates
suggests if the replicates are well correlated but fail to explore
global alignment. In the stable variables plot, global variance of
5% of least DE variables are plotted against %PCV compared to log2. This plot helps in the exploration of both inter- and intragroup
variance in the data, for detection of possible bias introduced during
normalization, as normalization should not introduce variation in
these variables.[8] Qualitative plots such
as boxplots, MA plots, dendrograms, correlation plots, meanSD plot,
MDS, and RLE plots explore data from all samples and guide in the
method selection process. The data normalized by different methods
are also exported together with the report and are ready for postnormalization
analysis. Additional documentation including a flow chart for the
decision-making process and a detailed explanation of various methods
can be downloaded from the Normalyzer homepage.
Evaluation
of Normalyzer
Features and analytical capabilities of Normalyzer
were evaluated by three different case studies. Normalyzer reports
from the three case studies are in the Supporting
Information.
Case Study 1
The processed data
set contains 36,484 features from a reconstituted yeast proteomics
standard spiked-in with different levels of the Sigma UPS1 equimolar
protein standard. From the Normalyzer report, it is apparent that
there is an almost 3-fold difference in the total intensity between
samples (Figure 2a). This suggests technical
variation in the data set, and thus, normalization of the data set
is necessary prior to quantitative analysis. The Normalyzer report
showed that there was a decrease in PCV by 30–40% in the normalized
data sets compared to un-normalized log2 transformed data
(Figure 2b). Among the global normalization
methods, relative-PCV was lowest (59%) in LOESS-G and VSN-G normalized
data.
Figure 2
Case study 1. Benchmark data generated by shotgun proteomics. (a)
Summed raw intensity from all peptides in each sample. (b) Relative
pooled intragroup coefficient of variation (PCV). For percentage estimation,
PCV in the un-normalized log2 transformed data is considered
as 100%. (c) Mean R2 values generated from observed and
theoretical values for the UPS1 peptides in the dilution series. (d)
Receiver operating characteristics (ROC) curves generated from the
UPS1 proteins from differently normalized data sets with one-way ANOVA.
UPS1 proteins were considered true positives, and the background proteins
were considered true negatives.
Case study 1. Benchmark data generated by shotgun proteomics. (a)
Summed raw intensity from all peptides in each sample. (b) Relative
pooled intragroup coefficient of variation (PCV). For percentage estimation,
PCV in the un-normalized log2 transformed data is considered
as 100%. (c) Mean R2 values generated from observed and
theoretical values for the UPS1peptides in the dilution series. (d)
Receiver operating characteristics (ROC) curves generated from the
UPS1 proteins from differently normalized data sets with one-way ANOVA.
UPS1 proteins were considered true positives, and the background proteins
were considered true negatives.Out of 36,484 peptides, 304 peptides were from 43 Sigma UPS1
proteins. The UPS1 proteins were spiked in known absolute concentration
in a dilution series (0.25, 0.74, 2.2, 6.7, and 20 fmol/μL).
Thus, the spike-in protein set can be used for estimating the observed
and theoretical correlation of log2 transformed peptide
intensities. The mean coefficient of determination (R2)
estimated from the un-normalized log2 transformed data
and the theoretical values was 0.81, while the mean R2 of
LOESS-G, RLR-G, and VSN-G was >0.9 (Figure 2c). LOESS-G normalized data had the highest mean R2 of
0.92, which supports the results from Normalyzer. Receiver operating
characteristic (ROC) curves (Figure 2d) generated
from the detected UPS1 proteins corroborate the suitability of LOESS-G
normalization for this data set. From the results it is clear that
all normalization methods performed better than just log2 transformation, and also that the choice of normalization method
was of importance for this data set.
Case Study 2
To
test the applicability of Normalyzer on DNA microarray data, a previously
published Affymetrix microarray data set with 4,156 probe sets[2] was analyzed using Normalyzer. Results based
on PCV suggested VSN-G and VSN-R as the most optimal methods for normalizing
this data set (Figure 3a). However, the meanSD
plot in the report showed that VSN-R normalized data contained bias
introduced during the normalization step, leaving VSN-G as the most
optimal normalization method (Figure 3b). As
the data set was generated from spiked-in transcript levels, it was
possible to calculate the ROC curve as an orthogonal evaluation of
the normalization outcome (Figure 3c). The
results from Normalyzer strongly support VSN-G normalization for this
data set, and this is well in line with the orthogonal ROC calculations.
Interestingly, in this analysis, the LOESS normalization method used
in the original paper was not ranked the best, and this highlights
the benefit of evaluating different normalization methods for any
given data set.
Figure 3
Case study 2. Benchmark data generated by Affymetrix microarray.
(a) Percent PCV averaged over all groups. For percentage estimation,
variability in un-normalized log2 transformed data is considered
as 100%. (b) MeanSDplot of VSN-G and VSN-R normalized data. (c) ROC
curves generated from the spiked-in probe sets from differently normalized
data sets with one-way ANOVA.
Case study 2. Benchmark data generated by Affymetrix microarray.
(a) Percent PCV averaged over all groups. For percentage estimation,
variability in un-normalized log2 transformed data is considered
as 100%. (b) MeanSDplot of VSN-G and VSN-R normalized data. (c) ROC
curves generated from the spiked-in probe sets from differently normalized
data sets with one-way ANOVA.
Case Study 3
Finally, Normalyzer was used to select a normalization
method for a shotgun proteomics data set with large intragroup variation
in protein content and signal. Among the three case studies, this
data set shows the highest variation in the total intensity within
replicates (Figure 4a) and missing values (Figure 4b), indicating a clear need for normalization. Overall,
the replicate samples in the normalized data sets had reduced variance
compared to the log2 transformed data, and among the global
normalization methods, LOESS-G, MedI-G, and Quantile normalized data
had the least relative-PCV (Figure 4c). Further
analysis of the RLE plots from the Normalyzer report indicate that
samples in LOESS-G normalized data are centered better than the MedI-G
and Quantile normalized data set (Figure 4d).
Thus, for this data set, LOESS-G normalization could be an optimal
normalization method. As there was no a priori information
regarding expected sample protein content we evaluated the data set
using standard statistical methods for quantitative comparisons. Both
one-way ANOVA (Figure 4e) and Kruskal–Wallis
test (Figure 4f) showed that LOESS-G normalized
data contained a higher number of significantly differentially expressed
peptides compared to un-normalized log2 transformed data.
Indeed, the number of peptides passing the statistical tests as significantly
regulated at a constant false discovery rate varied considerably between
the normalization strategies. This highlights the need for selection
of an appropriate normalization strategy, as downstream processing
will be significantly affected by the choice.
Figure 4
Case study 3. Biological
data generated by shotgun proteomics from P. infestans infected potato leaves. (a) Summed raw intensity from all peptides
in each samples. (b) Summed missing values in samples. (c) Relative
PCV. (d) RLE plots for selected data sets. (e) One-way ANOVA (FDR
< 0.05) and (f) Kruskal–Wallis test for statistical significance
(FDR < 0.05).
Case study 3. Biological
data generated by shotgun proteomics from P. infestans infected potato leaves. (a) Summed raw intensity from all peptides
in each samples. (b) Summed missing values in samples. (c) Relative
PCV. (d) RLE plots for selected data sets. (e) One-way ANOVA (FDR
< 0.05) and (f) Kruskal–Wallis test for statistical significance
(FDR < 0.05).
Conclusion
In conclusion, effectiveness of normalization methods is dependent
on the data, and extensive evaluation of different methods is necessary
before choosing a method. Normalyzer is developed to aid in this selection
process. The Normalyzer report is designed to help users narrow down
the normalization methods. As seen in case study 2, normalization
methods could sometimes be prone to overfitting, introducing additional
bias to the data. Thus, while evaluating normalization methods, equal
importance should be given to quantitative and qualitative plots and
also to the existing knowledge on housekeeping genes or protein expression
levels. As the tool is open-source, new normalization methods can
be added-in and can be modified further for compatibility with existing
pipelines. It can also be run in parallel with the SPANS[8] tool to further evaluate peptide selection for
normalization.While the present version of Normalyzer incorporates
normalization methods for log-normally distributed data, the framework
can readily be extended with other normalization methods that are
better suited for count data from RNaseq experiments. We thus believe
that Normalyzer will guide researchers in selecting the most appropriate
normalization method for their omics data sets.
Authors: Stephen J Callister; Richard C Barry; Joshua N Adkins; Ethan T Johnson; Wei-Jun Qian; Bobbie-Jo M Webb-Robertson; Richard D Smith; Mary S Lipton Journal: J Proteome Res Date: 2006-02 Impact factor: 4.466
Authors: Yuliya V Karpievitch; Thomas Taverner; Joshua N Adkins; Stephen J Callister; Gordon A Anderson; Richard D Smith; Alan R Dabney Journal: Bioinformatics Date: 2009-07-14 Impact factor: 6.937
Authors: Kim Kultima; Anna Nilsson; Birger Scholz; Uwe L Rossbach; Maria Fälth; Per E Andrén Journal: Mol Cell Proteomics Date: 2009-07-12 Impact factor: 5.911
Authors: Bobbie-Jo M Webb-Robertson; Melissa M Matzke; Jon M Jacobs; Joel G Pounds; Katrina M Waters Journal: Proteomics Date: 2011-11-17 Impact factor: 3.984
Authors: Amanda G Paulovich; Dean Billheimer; Amy-Joan L Ham; Lorenzo Vega-Montoto; Paul A Rudnick; David L Tabb; Pei Wang; Ronald K Blackman; David M Bunk; Helene L Cardasis; Karl R Clauser; Christopher R Kinsinger; Birgit Schilling; Tony J Tegeler; Asokan Mulayath Variyath; Mu Wang; Jeffrey R Whiteaker; Lisa J Zimmerman; David Fenyo; Steven A Carr; Susan J Fisher; Bradford W Gibson; Mehdi Mesri; Thomas A Neubert; Fred E Regnier; Henry Rodriguez; Cliff Spiegelman; Stephen E Stein; Paul Tempst; Daniel C Liebler Journal: Mol Cell Proteomics Date: 2009-10-26 Impact factor: 5.911
Authors: Alisa O Tokareva; Vitaliy V Chagovets; Alexey S Kononikhin; Natalia L Starodubtseva; Eugene N Nikolaev; Vladimir E Frankevich Journal: Anal Bioanal Chem Date: 2021-03-24 Impact factor: 4.142
Authors: Krishna D. B. Anapindi; Ning Yang; Elena V Romanova; Stanislav S Rubakhin; Alycia Tipton; Isaac Dripps; Zoie Sheets; Jonathan V Sweedler; Amynah A Pradhan Journal: Mol Cell Proteomics Date: 2019-10-24 Impact factor: 5.911
Authors: Christopher B Lietz; Thomas Toneff; Charles Mosier; Sonia Podvin; Anthony J O'Donoghue; Vivian Hook Journal: J Am Soc Mass Spectrom Date: 2018-03-19 Impact factor: 3.109