Sanjeevan Jahagirdar1, Edoardo Saccenti1. 1. Laboratory of Systems and Synthetic Biology, Wageningen University & Research, Stippeneng 4, 6708 WE Wageningen, The Netherlands.
Abstract
Networks and network analyses are fundamental tools of systems biology. Networks are built by inferring pair-wise relationships among biological entities from a large number of samples such that subject-specific information is lost. The possibility of constructing these sample (individual)-specific networks from single molecular profiles might offer new insights in systems and personalized medicine and as a consequence is attracting more and more research interest. In this study, we evaluated and compared LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) and ssPCC (single sample network based on Pearson correlation) in the metabolomics context of metabolite-metabolite association networks. We illustrated and explored the characteristics of these two methods on (i) simulated data, (ii) data generated from a dynamic metabolic model to simulate real-life observed metabolite concentration profiles, and (iii) 22 metabolomic data sets and (iv) we applied single sample network inference to a study case pertaining to the investigation of necrotizing soft tissue infections to show how these methods can be applied in metabolomics. We also proposed some adaptations of the methods that can be used for data exploration. Overall, despite some limitations, we found single sample networks to be a promising tool for the analysis of metabolomics data.
Networks and network analyses are fundamental tools of systems biology. Networks are built by inferring pair-wise relationships among biological entities from a large number of samples such that subject-specific information is lost. The possibility of constructing these sample (individual)-specific networks from single molecular profiles might offer new insights in systems and personalized medicine and as a consequence is attracting more and more research interest. In this study, we evaluated and compared LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) and ssPCC (single sample network based on Pearson correlation) in the metabolomics context of metabolite-metabolite association networks. We illustrated and explored the characteristics of these two methods on (i) simulated data, (ii) data generated from a dynamic metabolic model to simulate real-life observed metabolite concentration profiles, and (iii) 22 metabolomic data sets and (iv) we applied single sample network inference to a study case pertaining to the investigation of necrotizing soft tissue infections to show how these methods can be applied in metabolomics. We also proposed some adaptations of the methods that can be used for data exploration. Overall, despite some limitations, we found single sample networks to be a promising tool for the analysis of metabolomics data.
Humans
exhibit great phenotypic diversity in both healthy and pathophysiological
conditions as a result of molecular regulatory and metabolic systems
underlying the functioning of living organisms. It is now widely recognized
that phenotypic diversity cannot be understood and characterized by
analyzing single molecular markers such as genes, metabolites, or
proteins alone: what is relevant is the complex web of interactions
underlying the molecular mechanisms maintaining the functioning of
the organism.[1−3]These molecular interactions are well captured
and modeled using
the formalism of network inference and analysis,[4−6] where molecular
entities such as genes, proteins, and metabolites are represented
as nodes and their mutual relationships as edges, which can be different
in nature, representing physical interactions, as in protein–protein
interaction networks, regulation, as in gene regulatory networks,
or similar concentration patterns, as in metabolite–metabolite
association networks.[7]It has been
shown that network-based biomarkers, for example, sub-network
markers,[8] network biomarkers,[9] and edge biomarkers[10] are superior to the traditional single-molecule biomarkers for accurately
characterizing disease states due to their additional information
on interactions and networks.In the quest for personalized
medicine,[11] it is of paramount importance
to elucidate the molecular mechanisms
which underlay the subject-specific response to pathophysiological
stimuli, resulting from the dysfunction of individual-specific networks/systems
rather than just the malfunction of a singular biological entity.
In this light, networks and network analyses have the potential of
being pivotal in personalized medicine if there exists the possibility
of their extension from a population level to the individual-specific
level.However, since several samples are required to define
the associations
(like in the form of correlations[12]) among
molecular elements, such as metabolites or genes, there exists no
straightforward approach to infer an individual-specific network by
profiling metabolite concentrations or gene expression from a single
sample. If such an approach was demonstrated, it would be a very desirable
situation due to the fact that it is rarely possible to obtain multiple
samples from the same subjects, given the necessity of designing complex
and expensive longitudinal studies. On the contrary, a single bio-fluid
sample (such as blood and urine) is usually easy to obtain even in
common clinical practice.There is growing research interest
in the possibility of the construction
of such individual-specific networks by expression profiling of a
single sample and several methods have been proposed.[13−16]Here, we present a comparative review of two methods for single
sample network inference with the aim of evaluating their possible
application to metabolomics data to obtain metabolite–metabolite
single-sample association networks. We focused on LIONESS proposed
by Kuijjer et al.(14) and
ssPCC by Liu et al.(13) We
chose these two approaches among others since they adopt similar albeit
different philosophies and thus are directly comparable, both are
(or may be) based on correlations, and are easy to implement.We analyzed these two methods
for their ability to produce single
sample or sample-specific networks from metabolite concentrations,
and we explored and compared their characteristics on data generated
fromNumerical
simulationsA dynamic
metabolic model22
publicly available metabolomic
data sets.We then applied the two methods
on a study case pertaining to the
metabolomics investigation of necrotizing soft tissue infections (NSTIs)[17] in order to showcase the deployment of single
sample network inference in real-life metabolomics applications. Additionally,
we suggest some potentially new use of single sample networks for
sample exploration and classification.
Material and Methods
Basics
of Networks
A network is a graphical representation
of relationships among objects. A network consists of nodes which
represent biological features (genes, proteins, and metabolites) connected
by links or edges which represent pair-wise relationships between
the biological features.This representation shifts the focus
toward the relationships among biological entities rather than on
their levels; in this light, network and network analysis are fundamental
tools from the systems biology toolbox to investigate and understand
metabolomics data.[12] When the nodes are
metabolites, the networks can be termed metabolite–metabolite
association networks.
Methods for Single Sample Networks Inference
LIONESS:
Linear Interpolation to Obtain Network Estimates for
Single Samples
LIONESS is an approach developed by Kuijjer et al. in the context of gene regulatory networks.[14,18]This approach starts by considering a n × m data matrix X(α) and the
corresponding m × m network E(α) (i.e,. the so-called aggregate network) with edges e between nodes i and j and the network E(α– constructed from the (n –
1) × m data matrix X(α–, that is a matrix with all but the qth sample, which we refer to as the q-sample for
the sake of simplicity. A graphical illustration of the LIONESS procedure
is given in Figure .
Figure 1
Visual illustration of the Lioness estimation of a single sample
based on two aggregate network models, one reconstructed using all
biological samples in a given data set and the other using all except
the sample of interest (q, q-sample).
Figure reproduced and adapted from the original publication,[14] published under Creative Commons license CC
BY-NC-ND 4.0.
Visual illustration of the Lioness estimation of a single sample
based on two aggregate network models, one reconstructed using all
biological samples in a given data set and the other using all except
the sample of interest (q, q-sample).
Figure reproduced and adapted from the original publication,[14] published under Creative Commons license CC
BY-NC-ND 4.0.LIONESS assumes that the aggregate
network E(α) built from n samples is the mean of networks constructed
from every single sample from the data set X(α) containing n samples. This assumption is then extrapolated
to define the edge e(α) in E(α) as the linear combination of the weights of
that edge across a set of n networkswhere w( represents the relative contribution of each single sample
network to the aggregate network andSimilarly, for the network E(α– constructed from all but the q-sample, the edge between eα– is defined aswhereFrom eqs and 3, the authors definedas long as the
assumption holds that every
sample makes an equal proportional contribution to the aggregate networks E(α), which makes w(α) constant.Combining eqs and 3 and solving for the edge e( for the q-sample gives the general
LIONESS equationwhich defines the edge between node i and j of the single sample network for
the q-sample. The term 1/w(α) gives the weight of each sample, and can be set to n if all samples are given the same weight, obtainingwhich
will be used throughout this study.
In matrix notation, eq becomesThe LIONESS equation does not depend on the particular methods
used to infer the networks which can be estimated with any approach
or different association measures; the most common approach is to
use correlation, but mutual information can also be used. However,
the authors reported sub-optimal performance when using mutual information
to measure associations and single sample edges. In a related study,
we also reported the sub-optimal performance of mutual information
for metabolite association network estimation.[19] For this reason, we will focus on Pearson’s correlation
also because this will allow direct comparison with the ssPCC method
(see the section ssPCC: Single Sample Network
Based on Pearson’s Correlation). In this case, the LIONESS eq becomeswhere r is the Pearson correlation
between variable (metabolite) i and j. A summary of the notation used
is given in Table .
Table 1
Summary of the Notation Used in the
Paper to Define the Lioness and ssPCC Edges
original notation
definition
alternative
notation
correlation notation
definition
ssPCC
network built using all samples in the
reference data set
PCCn
rij(n)
aggregated network built using all samples in the reference
plus the q-sample
PCCn+1
rij(n+q)
single sample
network for the q sample
ΔPCCn
PCCn+1 – PCCn
rij(q)
rij(n+q) – rij(q)
Lioness
aggregated network built using all
samples
eij(α)
eij(n)
rij(α)
network built using all samples but the q-sample
eij(α–q)
eij(n–q)
rij(α–q)
single sample network for the q-sample
eij(q)
n(eij(α) – eij(α–q)) + eij(α–q)
rij(q)
n(rij(α) – rij(α–q)) + rij(α–q)
Choice of the Aggregate Network
The LIONESS algorithm
outputs a single sample network for each sample in a given data set
given an aggregate network. If the data set contains n1 samples from group 1 and n2 samples from group 2, there is the legitimate question of whether
to build the aggregate network using all n1 + n2 samples or to build two different
separate reference networks, one for group 1 and one for group 2.
In the original paper,[14] the authors investigate
the use of non-homogeneous background (page 13 of the Supporting Information[14]) and reported minimal differences. However,
in this study we explored both implementations that we dubbed, for
convenience, as LIONESS single (LIONESS-S) and LIONESS double (LIONESS-D)LIONESS-S
Consider all samples to build
the aggregated network and build single sample networks referring
to the pool of all samples, orLIONESS-D Consider two different aggregate
networks E1(α) and E2(α) from the two group samples and use
them to build two sets of single sample networks, one for group 1
and one for group 2.
ssPCC: Single Sample Network
Based on Pearson’s Correlation
The single sample network
based on Pearson’s correlation
(which we abbreviate as ssPCC) was proposed by Liu et al.(13) for building sample specific networks
in the context of gene regulatory networks for disease characterization.
As such, it relies on the availability of a n × mX set of reference
or control samples to contrast a set of case (possibly disease, in
general from a different condition) q-samples. The
ssPCC aims to define the single sample network specific to the q-sample(s).Using the same notation used in the original
publication, the single specific network for the q-samples obtained using as reference the n samples
in X is given bywhere PCC is
the Pearson’s correlation matrix[20] calculated from the reference set X, and PCC is the correlation
matrix calculated from the (n + 1) × m set made of X + the q-sample. The PCC is referred to as the “Reference network”, while PCC is referred to as the “Perturbed
network”. Thus, the single sample network for the q-sample is considered to be the perturbation of the correlation of
the m variables in X caused by the addition of the q-sample which
comes from a different population. A graphical illustration of the
ssPCC procedure is given in Figure .
Figure 2
Visual illustration of the ssPCC estimation of single
sample network.
(A) For a group of n reference samples, a reference
network PCC can be constructed using
Pearson’s correlations. (B) A new sample q is added and the perturbed network PCC with this additional sample is calculated. The difference between
PCC and PCC is due to sample q. (C) Differential network ΔPCC is constructed taking the difference between
perturbed and the reference network PCCn + 1 – PCC.
Figure reproduced and adapted from the original publication,[13] published under Creative Commons license CC
BY-NC-ND.
Visual illustration of the ssPCC estimation of single
sample network.
(A) For a group of n reference samples, a reference
network PCC can be constructed using
Pearson’s correlations. (B) A new sample q is added and the perturbed network PCC with this additional sample is calculated. The difference between
PCC and PCC is due to sample q. (C) Differential network ΔPCC is constructed taking the difference between
perturbed and the reference network PCCn + 1 – PCC.
Figure reproduced and adapted from the original publication,[13] published under Creative Commons license CC
BY-NC-ND.Using an edge notation similar
to the one used for LIONESS, the
single sample network for the q-sample can be rewritten
aswhere the superscript n + q indicates the addition of the sample q to the n samples of the reference matrix X and r indicates Pearson’s sample correlation
between variables i and j. A summary
of the notation used is given in Table .The authors proposed to assess the significance
of an edge in the
single sample network by means of permutation but found that the procedure
could be conveniently substituted with a Z-test which
is much faster and gives equivalent results (see the section in the
original paper[13]). They propose the following Z statistic testwhich, considering eq , can be conveniently
rewritten using the
edge notation in terms of the sample correlation between variable i and jThe Z-statistic
is then confronted with the critical
values of a standard normal distribution to assess significance.
ssPCC
for a two Group Case
The ssPCC algorithm outputs
single sample networks only for the case group and not for the reference
group. This setting does not allow, per se, to build single sample
networks for all samples (i.e., case and reference
samples) as in LIONESS. We attempted to bypass this limitation by
building single sample networks also for the reference data set by
contrasting each sample in the reference data set against the remaining
samples, that is, considering each reference sample as a q-sample.
Data Simulations
Numerical Simulations
Simulation
Scheme 1
We simulate n ×
2 reference data set X by
sampling from a bivariate normal distributionwith population μ0 = (0,
0) andwhere ρ0 is the population
(expected) value of r( (i.e., of PCC) from ssPCC eqs and 10.We let ρ0 vary over the values −0.9,
−0.7, −0.5, −0.3, 0, 0.3, 0.5, 0.7, and 0.9 to
define 9 different reference covariance/correlation structures. The q-sample for which the single sample network is sought using
ssPCC, that is, the sample to be added to X to calculate PCC is
also drawn from a bivariate normal distribution withwith population
mean μ = (0, 0) and correlation
matrixWe let ρ vary in the range [0,
+1] in increments of 0.1 to define 11 different covariance/correlation
structures for variables x and y. We take the differenceas a measure of the perturbation
effect when
a sample q is added to X to estimate r( (i.e., PCC): when ρ =
ρ0, the q-samples and the reference
samples come from the same distribution, which implies that there
is no perturbation, hence the expected value of r( and r( is the same. As ρ increases with respect to ρ0, the perturbation
increases, and in consequence ΔPCC also increases.
Simulation Scheme 2
This simulation
is similar to the Simulation Scheme 1. The only difference is
thatthat is
the q-samples come
from a population with both different mean and correlation structure.
Simulation Scheme 3
We generated m × m (with m = 20) random
correlation matrices Σ (with elements
ρ ≥ 0 and ρ ≠ ρ for all possible variable pairs) satisfying
the propertyThis was achieved using the vine method[21,22] by sampling from a beta distribution Beta(α,β). The
variance σ2 of the beta distribution was set to 0.1,
and the mean μ was numerically optimized to have the sampled
data obtain the required average correlation ρ equal to 0.1
to 0.9 in steps of 0.1, within a 5% precision. The mean μ and
variance σ2 link the α and β parameters
by the relationshipsThe optimized μopt values
(0.113, 0.116, 0.123,
0.135, 0.163, 0.201, 0.262, and 0.382) were used to calculate the
α and β shape parameters to be passed to the algorithm.The correlation matrices Σ were
used to simulate multivariate normally distributed data N(μ,Σ) to be
used as a reference and as q-samples. In particular,
data with an average correlation of 0.6 and 0.9 were used as reference
data sets and data with an average correlation in the range 0.1–0.9
were used as q-samples.
Multivariate
Data Simulation
The covariance/correlation
matrices defined in the Simulation Schemes 1–3 were used to
generate normally distributed multivariate data using the Matlab mvnrnd
function.
Data Simulation Using a Dynamic Metabolic Model
To
generate metabolites with plausible concentration patterns as observed
in metabolomics data, we used a dynamic metabolic model, as described
in ref (19). For the
sake of completeness, we report here the full simulation strategy.The model describes the activation of NF-κβ complex
(nuclear factor kappa-light-chain-enhancer of activated B cells) and
the corresponding response of the intracellular signaling pathway
when exposed to lipo-polysaccharide that activates an inflammatory
response. It consists of 59 ordinary differential equations recounting
the reactions involving 35 metabolites. The model was obtained from
the BioModels database[23] (www.ebi.ac.uk/biomodels/) with accession number BIOMD0000000489. Full details on the model
building and accessory files can be found in the original publication.[24]The model dynamics were constructed and
simulated using ordinary
differential equations representing 3 types of reactions, namelyReversible reactions
using mass action
lawIrreversible reactions
using mass
action lawEnzymatic
reactions using Michaelis–Menten
kinetics.We give three reactions used
in the model to showcase examples
of ordinary differential equations corresponding to each type of reaction
and the kinetics involved.where IRF3 (Interferon Regulatory
Factor 3),
IKK(IkB kinase), and TAK1:TAB1:TAB2:TRAF6 (complex of mitogen-activated
protein kinase kinases) are three of the metabolites/compounds utilized
in the NF-κβ activation and corresponding signaling pathway,
[P] represents the addition of phosphoryl group via the process of
phosphorylation, [metabolite] represents the concentration of the
metabolites, and i is the reaction number. For detailed description
and information on all metabolites and reactions involved in the NF-κβ
model, we refer the reader to the original publication and its Supporting
Information.[24]Subject-specific concentration
profiles were obtained by varying
the kinetic constants Km, k, kf, and kr for all of the 59 reactions and the initial concentrations c for the 4 metabolites with
non-null initial concentrations in order to generate subject-specific
profiles. All these constants, Km, k, kf, kr, and c were varied between bounds (a, b) ± 10% of the original values (see eq and following) presented in the original
publication. This was achieved by sampling from a uniform distribution U(a,b) to obtain values
for each subject. For the j-th individual, the values kf, kr, k, km, and c for the i-th reaction were defined asThe rationale
of this approach is that models with different parameters
will produce different metabolite profiles, like those observed when
sampling different subjects in real-life metabolomics experiments.Using this approach, we generated 500 individual profiles from
which we built data sets of different sizes by random sampling.To mimic different conditions, we introduced perturbations to the
model by manipulating the kinetic constants in the following mannerhere ϵ
is used as a scaling parameter,
the same for all reactions. The value of ϵ was varied over the
values 1/10, 1/5, 1/3, 1/2, 1/1.5, 1, 1.5, 2, 3, 5, and 10 which were
used to create subject-specific profiles in a similar manner as described
above.Using the same ϵ for all reactions allows us to
investigate
the performance of the SSN method as a function of the perturbation.
From the pool of 500 samples, we randomly sampled (with replacement)
subsets of different sizes (n = 10, 25, 50, 100,
250, and 500).
Power of the ssPCC Test
We investigated
the actual
power of the test (i.e., the probability of rejecting
the null hypothesis H0 when actually false)
by means of the Simulation Schemes 1 and 2.For each combination
of Σ and Σ, we generated a n × 2 reference data
set X from which we calculated
PCC and k q-samples
to obtain k values of ΔPCC which were tested for significance at the 0.05 level and recorded
how many times H0 was correctly rejected
to calculate the actual power of the test. In our Simulation Scheme 1, the Null hypothesis H0 is always false, except when Σ = Σ. The overall procedure was repeated 1000 times, and the results
were averaged over the repetitions. The actual power was calculated
for n = 25, 250, and 25,000.The relative frequency
of the rejection of H0 when Σ = Σ (i.e.,H0 true) is the actual α level
of test, that is, the actual false positive rate (Type I error).
Principal Component Analysis
We explored differences
among single sample networks with principal component analysis (PCA).
Each single sample matrix was vectorized and PCA was applied on the
edges to investigate the patterns of similarity/difference among the
networks. Networks were vectorized by taking only the upper diagonal
part of the network (given the symmetry) so that only edges are considered
instead of m2. Every m × m single sample network was then collapsed
to a vector, and the different networks were
then collected in a matrix form suitable for PCA.
Random Forest
Prediction Models
Random Forest[25] was used to build classification models to explore
the use of single sample network edges in a prediction context, that
is, to explore whether the use of the edge weights between pair of
metabolites possesses higher predictive power than the original metabolite
concentrations. We focused on two-group scenarios, which are also
the most commonly encountered in metabolomics applications, applying
this approach to several public metabolomics data sets.We built
single sample networks using both LIONESS implementations (Single
and Double, see the section Choice of the Aggregate
Network) and ssPCC (see the section ssPCC for a Two Group Case).The single sample networks
were processed for Random Forest as
described in the case of PCA (see the section Principal
Component Analysis).We used the standard Breiman’s
Random Forest implementation
which uses the Gini impurity as loss.[26] We set the number of trees to 1000 and used the default value of (where p is the number
of variables) for the “mtry” parameter. We used a 2/3
+ 1/3 data split (training + validation) to obtain an unbiased estimation
of the classification. We took into account data unbalance using the
“strata” option. Each model fitting was repeated 100
times to take into account the variability due to the resampling step
used by the RF algorithm to randomly select the same number of subjects
from each group and so to build the model on balanced data. The resampling
was nested within the cross-validation step used to assess the quality
of the prediction models. All results are given as the arithmetic
mean over 100 iterations.
Pathway Enrichment Analysis
Pathway
enrichment analysis
was performed using the built-in function available in the MetaboAnalyst
4.0[27] online server (www.metaboanalyst.ca) using
the hypergeometric test. The Benjamini–Hochberg method was
used for the false discovery rate (fdr) correction.[28] We considered significantly enriched those pathways with
fdr < 0.01.
Experimental Data
Metabolomic Study Case
As a study case, we considered
a data set from a metabolomic investigation of NSTIs. The data set
consists of plasma metabolite profiles acquired via GC–MS on
34 NSTI patients enrolled in the INFECT project (Clinicaltrials.gov, NCT01790698).
In addition, 24 patients with no known infections were included as
controls.The patients had NSTI of different microbial etiology
and were classified into polymicrobial and monomicrobial NSTI.This data is available at the NIH Common Fund’s National
Metabolomics Data Repository (NMDR) website, the Metabolomics Workbench
(www.metabolomicsworkbench.org) with Project ID ST00127 where it can be accessed via the Project
DOI (DOI: 10.21228/M85H5H). We refer to the original publication[17] for more details on the study design, sample
collection, and processing, GC–MS experiments, and clinical
information.
Compendium of Publicly Available Metabolomics
Data
To explore further the characteristics of the two methods
for inference
of single sample networks, we used a compendium of 22 data sets that
we compiled for a previous study.[19]Ten of these data sets were derived from the compendium assembled
by Mendez et al.(29) including
studies representative of three of the most frequent metabolomic experimental
platforms (nuclear magnetic resonance, NMR; gas chromatography mass
spectrometry, GC–MS; and liquid chromatography mass spectrometry,
LC–MS) concerning the metabolomic profiling of varying biofluids
(serum, plasma, urine, and feces). All studies follow a two-group
design (either as primary design or secondary result of the publication
or as a subset of a multi-class study) and have varying sample sizes
and number of metabolites (variables) present. We have made use of
the processed and cleaned data available in Mendez et al.(29) to which we refer the reader for more
detailed information on the processing and cleaning of data sets.
Metabolites having missing values were either deleted (e.g., dataset MTBLS136) or imputed using the Random Forest procedure
implemented in the R package missForest[30] (e.g., data set ST001047). We also included other
data sets derived from tissue (fat), plant, and fruit extracts along
with microbiome data (16S sequencing) and other chemical-based assays
on various fluids such as coffee, wine, and oil and finally two transcriptomic
data sets. Relevant references and attributes for all data sets are
shown in Table .
Table 3
Random
Forest Classifications of 22
Case-Control Metabolomics Data Sets Using Metabolite Concentrations
and Single Sample Network Edges as Described in the Section Random
Forest Prediction Modelsa
Abbreviations:
CD: Crohn’s
disease. UCL=: Ulcerative colitis. Assay stands for chemical assay.
The % of edges associated with differential
expressed metabolites is shown together with the average ssPCC edges
and with the average correlation of the reference network calculated
on 10 case-control metabolomics data sets. More information on the
data can be found in Table .Abbreviations:
CD: Crohn’s
disease. UCL=: Ulcerative colitis. Assay stands for chemical assay.
Software
Calculations
were performed using R,[31] Matlab,[32] and Python.[33] Our
R implementations for LIONESS and ssPCC
are available at www.systemsbiology.nl under the software tab. Original R package for LIONESS by Kuijjer et al.(18) can be also obtained
at github.com/kuijjerlab/lionessR and bioconductor.org/packages/lionessR.
Results and Discussion
We begin
by noticing that LIONESS and ssPCC are not context or
data dependent or depending on how the networks are inferred. Both
methods have been originally applied to gene regulatory networks but
the statistical framework is totally general; they are both based
on “manipulation” of correlations but the way the correlations
are calculated and manipulated is totally independent of their origin.
There is nothing in how the methods are formulated that is specifically
depending on or descending from the correlations originating from
gene regulation patterns. The two frameworks are fully generalizable
to different biological contexts and applications. Here, we explore
their applicability to metabolite–metabolite correlation networks;
different approaches can be used to calculate the reference networks
depending on the applications, but the way the single sample networks
are obtained does not depend on the application.We investigated the actual
power of the ssPCC test (i.e., the probability of
rejecting a null hypothesis when actually false) using the Simulation
Scheme 1 described in the Material and Methods section.As shown in Figure we observed limited power when the reference samples
and the q-samples are drawn from a multivariate normal
distribution with different population covariance–correlation
matrices but with the same population means (Simulation 1). The results
shown are for n = 250.
Figure 3
Actual power of the ssPCC
proposed testing procedure based on a Z-test (see
the section ssPCC: Single
Sample Network Based on Pearson’s Correlation and eq ) as a function of the
effect size defined in the section Power of the
ssPCC Test using the Simulation Scheme 1. The nine panels (A–I)
correspond to the different correlation ρ0 of the
ssPCC reference network (PCC, see eq ): the values are −0.9,
−0.7, −0.5, −0.3, 0, 0.3, 0.5, 0.7, and 0.9,
respectively. For each value of ρ0, we let ρ (see eq ) to vary in the range [0:1] with increments of 0.1.
The effect size is defined as ρ – ρ0, so that for each ρ0 there are 11 effect size values, indicated on the y-axis of the bar plots. Results are shown for sample size n = 1. The actual power is calculated over 1000 tests. The
vertical dashed red lines indicate the 5% level.
Actual power of the ssPCC
proposed testing procedure based on a Z-test (see
the section ssPCC: Single
Sample Network Based on Pearson’s Correlation and eq ) as a function of the
effect size defined in the section Power of the
ssPCC Test using the Simulation Scheme 1. The nine panels (A–I)
correspond to the different correlation ρ0 of the
ssPCC reference network (PCC, see eq ): the values are −0.9,
−0.7, −0.5, −0.3, 0, 0.3, 0.5, 0.7, and 0.9,
respectively. For each value of ρ0, we let ρ (see eq ) to vary in the range [0:1] with increments of 0.1.
The effect size is defined as ρ – ρ0, so that for each ρ0 there are 11 effect size values, indicated on the y-axis of the bar plots. Results are shown for sample size n = 1. The actual power is calculated over 1000 tests. The
vertical dashed red lines indicate the 5% level.The power of the test is heavily dependent on the correlation among x and y in the reference set, that is,
on PCC; the maximum power achieved is
slightly higher than 60% when r( = 0.9(PCC) (panel A) and reaches
its minimum when r( is below 0.5
(panels E–G).Unfortunately, this is the most common
case in metabolomics studies:
metabolomics correlations are mostly positive and smaller than 0.6
as also shown in Table where the distribution characteristics of correlation values (that
is, r(, PCC are given for 10 case-control metabolomics data
sets. Thus the most common situation observed in real-life metabolomics
studies is the situation in which the test has less power to detect
differences between the q-sample and the reference
set.
Table 2
Association Between Metabolite Differential
Expression and Single Sample Edges Found to Be Significant with the
Proposed ssPCC Testa
% SSN edges
associated with DE metabolites
average
ssPCC SSN edge
correlation
in reference Data
data set
mean
95%
CI
mean
95%
CI
mean
95%
CI
1
MTBLS136
56.1
39.6
77.8
0.01
0.01
0.01
0.03
–0.17
0.43
2
MTBLS161
82.3
42.6
98.3
0.02
0.00
0.12
0.35
–0.38
0.79
3
MTBLS404
67.9
0.0
97.3
0.01
0.00
0.05
0.30
–0.05
0.73
4
MTBLS547
76.3
0.0
100.0
0.01
0.00
0.07
0.29
–0.22
0.84
5
MTBLS90
84.7
69.3
97.8
0.01
0.00
0.01
0.06
–0.18
0.62
6
MTBLS92
85.0
65.3
98.3
0.01
0.00
0.03
0.21
–0.09
0.80
7
ST000369
39.5
14.6
68.1
0.11
0.08
0.17
0.03
–0.31
0.50
8
ST000369
54.1
32.0
77.8
0.02
0.01
0.08
0.03
–0.29
0.45
9
ST001000
75.3
0.0
98.2
0.02
0.00
0.09
0.14
–0.28
0.95
10
ST001047
75.3
0.0
98.2
0.02
0.00
0.09
0.14
–0.28
0.95
The % of edges associated with differential
expressed metabolites is shown together with the average ssPCC edges
and with the average correlation of the reference network calculated
on 10 case-control metabolomics data sets. More information on the
data can be found in Table .
The power of the test depends of course on n and
should tend to 1 as n → ∞. However,
we did not observe any strong dependence on n: for n = 25, the maximum power was ≈64% which increased
to ≈66% for n = 25,000 when PCC = 0.9, which indicates that the size of the reference
data set has little influence on the actual power of the test.The limited power of the test under Simulation
Scheme 1 can be understood by considering eq in the original publication.The authors
derived an interesting relationship (in the case n ≫ 1) linking the ΔPCC,
that is, r( and the difference
of the level of variable i and j measured on the q-sample with respect to the average
level of the same variables in the reference datawhereSeveral interesting observations
can be derived from eq :ΔPCC → 0 if Δx and Δy are zero, that is if the q sample is from a population
with the same average level (μ =
μ) of the reference population
as in Simulation 1.PCC does
not appear in the (re)definition of ΔPCC: only the difference in the levels of X and Y with respect to the correlation of X and Y in the reference define ΔPCC.This explains the very limited
power observed in Figure . If the reference samples
and the q-samples are drawn from a multivariate distribution
with the same population, it means that both Δx and Δy tend to 0 and ΔPCC tends to zero even if PCC ≠ 0.which does not
depend explicitly on n, since PCC is fixed a priori,
this explains why increasing the dimensionality of the reference data
set X has little influence
on the power of the test.orIf the number of samples in the reference
data set n is very large, adding the q sample has practically no influence: it is not the perturbation
of the correlation (i.e., PCC – PCC) that it is tested,
but some function of the differences of X and Y with respect to the average values in reference data.
This observation is supported by empirical evidence shown in Table where SSN networks
were built for 10 case-control metabolomics data sets: the vast majority
of significant SSN edges are associated with metabolites whose concentrations
are significantly different between the two conditions.If we plug eq in eq , we obtain the following expression for the Z-statisticΔPCC can be
different from zero also when PCC is
zero, that is when the reference samples and the q-samples are from populations with the same covariance–correlation
structure; this happens when Δx, Δy ≠ 0. This explains the slightly inflated Type I
error observed in Figure .
Figure 4
Actual
Type I error (α) of the ssPCC proposed testing procedure
based on a Z-test (see the section ssPCC: Single Sample Network Based on Pearson’s Correlation and eq ) as a function
of the effect size defined in the section Power
of the ssPCC Test. The nominal 5% level is indicated by the
vertical dashed red line. The six panels (A–F) correspond to
the different sample sizes used, which are 10, 50, 100, 500, 1000,
and 10,000, respectively. The actual α is calculated over 1000
tests.
ΔPCC can be zero even if Δx, Δy, and PCC are all different from zero.
This can happen, for instance, whenActual
Type I error (α) of the ssPCC proposed testing procedure
based on a Z-test (see the section ssPCC: Single Sample Network Based on Pearson’s Correlation and eq ) as a function
of the effect size defined in the section Power
of the ssPCC Test. The nominal 5% level is indicated by the
vertical dashed red line. The six panels (A–F) correspond to
the different sample sizes used, which are 10, 50, 100, 500, 1000,
and 10,000, respectively. The actual α is calculated over 1000
tests.
Relationship between LIONESS
and ssPCC Single Sample Networks
If a correlation is used
as a measure of variable association,
the two methods are functionally related, and from the definition
of the ssPCC and LIONESS edges, it follows that the latter can be
written as a function of the ssPCC edges. In particular, when the
data set X(α) used to build the aggregated
network in LIONESS is equivalent to the reference data set used in
ssPCC, there is an almost perfect linear relationship between the
edges of the q-sample network estimated using LIONESS
and using ssPCC, as can be seen in Figure panel A. This relationship deteriorates
when X(α) is not equal to X, that is when there is more than one
sample belonging to a group different from the reference (Figure panel B). Note the
different scale of the edge weights: for ssPCC, the edges are bounded
between −2 and +2, being defined as the difference between
two correlations; and for LIONESS, if correlations are used, the edges
are bounded between 1 – 2n and 2n – 1.
Figure 5
Comparison of the Lioness and ssPCC single sample edges.
(A) Scatter
plot of the edges for a q-sample obtained when the
data set X(α) used to build the aggregated
network in Lioness is equivalent to the reference data set used in
ssPCC. (B) Scatter plot of the edges for a q-sample
obtained when X(α) is not equal to X, i.e., when
there is more than one sample belonging to a group different from
the reference. Note the different scale of the edge weights.
Comparison of the Lioness and ssPCC single sample edges.
(A) Scatter
plot of the edges for a q-sample obtained when the
data set X(α) used to build the aggregated
network in Lioness is equivalent to the reference data set used in
ssPCC. (B) Scatter plot of the edges for a q-sample
obtained when X(α) is not equal to X, i.e., when
there is more than one sample belonging to a group different from
the reference. Note the different scale of the edge weights.A relationship similar to eq can be derived also for LIONESS. It is enough
to note
that r( is actually the
correlation calculated using all samples (thus including the q-samples i.e., r(α)) in LIONESS, and PCC is the correlation
calculated using all but the q-sample, that is r(α–. The LIONESS
edge can be rewritten asThis
also establishes that the edges of a single sample network
estimated with LIONESS are a function of both the perturbation of
the correlation and the difference between the levels of the q-sample and the mean of the remaining n – 1 samples.
Comparison of ssPCC and LIONESS on Simulated
Data
Simulated Data with Different Levels of Correlation
The first simulation entails the case of the reference data (m = 20 variables) with zero mean and average correlation
equal to 0.6; this value was chosen because most parts of the observed
metabolomic correlations are smaller than 0.6.[34] The q-sample comes from data with an average
correlation between 0 and 0.9 and a mean equal to 0 or 10. The results
are shown in Figure , panels A–D. As can be seen, there is no obvious separation
of the single sample networks as a function of the effect size; although,
a slight separation appears when the mean of the reference and q-sample differ, especially between single sample networks
from a sample with extreme average correlation (i.e., 0 and 0.9).
Figure 6
Scatter plot from PCA of the single sample networks obtained
with
LIONESS-S (left column) and ssPCC (right column). Panels (A–D)
refer to simulation with reference data (20 variables) with average
correlation equal to 0.6, while the q-sample is from
a population with correlation between 0 and 0.9 and mean equal to
0 (panels A,B) or 10 (panels C,D). Panels (E–H) refer to simulation
with reference data with average correlation 0.9, while the q-sample is from a population with correlation between 0
and 0.9 and mean equal to 0 [i.e., data from simulation
1, panels (E,F)] or 10 [i.e., data from simulation
2, panels (G,H)]. Each point in the PCA plot is a vectorized version
of the single sample networks, color-coded according to the population
correlation from which the q-samples are sampled.
Scatter plot from PCA of the single sample networks obtained
with
LIONESS-S (left column) and ssPCC (right column). Panels (A–D)
refer to simulation with reference data (20 variables) with average
correlation equal to 0.6, while the q-sample is from
a population with correlation between 0 and 0.9 and mean equal to
0 (panels A,B) or 10 (panels C,D). Panels (E–H) refer to simulation
with reference data with average correlation 0.9, while the q-sample is from a population with correlation between 0
and 0.9 and mean equal to 0 [i.e., data from simulation
1, panels (E,F)] or 10 [i.e., data from simulation
2, panels (G,H)]. Each point in the PCA plot is a vectorized version
of the single sample networks, color-coded according to the population
correlation from which the q-samples are sampled.When the same exercise is repeated with reference
data with an
average correlation equal to 0.9 (Figure , panel E–H), the results are similar
with a slight separation emerging only when the reference samples
and the q-samples come from populations with different
means.
Comparison on NF-κβ Dynamic Model Data
We compared the LIONESS and ssPCC on data simulated from a dynamic
model for the NF-κβ pathway where model parameters were
manipulated to introduce increasing levels of perturbation with respect
to the original unperturbed model; this was accomplished by multiplying/dividing
the kinetic parameter (see eqs with ϵ = 1, 1.5, 2, 3, 5, and 10). Results are shown
in Figure ; n = 50 samples were considered for each configuration.
Figure 7
Scatter
plot from PCA of the single sample networks obtained with
LIONESS-S (left column) and ssPCC (right column) on data simulated
using a dynamic model of the NF-kB metabolic pathway. Panels (A,B)
give results on data obtained by multiplying the kinetic parameters
by a factor ϵ = 1, 1.5, 2, 3, 5, and 10; see eq . Panels (C,D) give results on
data obtained by multiplying the kinetic parameters by a factor ϵ
= 1/10, 1/5, 1/3, 1/2, and 1/1.5). Each point in the PCA plot is a
vectorized version of the single sample network, color-coded by the
magnitude of the perturbation. For each ϵ configuration, n = 50 samples were considered.
Scatter
plot from PCA of the single sample networks obtained with
LIONESS-S (left column) and ssPCC (right column) on data simulated
using a dynamic model of the NF-kB metabolic pathway. Panels (A,B)
give results on data obtained by multiplying the kinetic parameters
by a factor ϵ = 1, 1.5, 2, 3, 5, and 10; see eq . Panels (C,D) give results on
data obtained by multiplying the kinetic parameters by a factor ϵ
= 1/10, 1/5, 1/3, 1/2, and 1/1.5). Each point in the PCA plot is a
vectorized version of the single sample network, color-coded by the
magnitude of the perturbation. For each ϵ configuration, n = 50 samples were considered.In the case of the multiplicative perturbation of the model, networks
tend to cluster according to the level of perturbation with highly
perturbed networks clustering away from those corresponding to low
perturbation. This is particularly evident for ssPCC-derived networks
with very clear separation among the clusters (Figure panel B); however, the separation among
LIONESS-based networks is much less evident (panel A).For data
from the perturbed model obtained by dividing the kinetic
constants (see eq with ϵ = 1/10, 1/5, 1/3, 1/2, and 1/1.5), the single sample
networks obtained with LIONESS are not resolved, and it is not possible
to distinguish among the different groups corresponding to the different
perturbation levels (Figure , panel C). On the contrary, the networks obtained with ssPCC
are very well resolved, and clear differences appear among the groups
(Figure , panel D).
Use of Single Sample Edges for Group Prediction and Classification
We explored the potential of single sample network edges for classification
purposes by replacing actual observed metabolite concentrations with
the pairwise edges, the rationale being that, as follows from eqs and 32, single-sample edges are a function of both difference in
correlation and in level and thus, in some case, can bear more information
than level alone.To this scope, we compared the accuracy of
Random Forest classification models on 25 publicly available data
sets (see the section Compendium of Publicly
Available Metabolomics Data). For each data set, we built four
different Random Forest classification models using:Original concentration/abundance
profiles.The edges
of the single sample networks
built using ssPCC.The
edges of the single sample networks
built using LIONESS and all samples to build the aggregate network
(LIONESS-D).The edges
of the single sample networks
built using LIONESS and only group-specific samples to build the aggregate
network (LIONESS-S).In total, we have
three ways to build single sample network edges
for a two-class problem to be used for classification purposes.The results are given in Table . Under the assumption that single sample edges hold
more information content than the concentration profiles, we expected
the Random Forest models built on edges to have higher accuracy than
those built on the original concentration values.We observed
that in general, classification accuracy is already
very high when using the original values. However, in 13 cases out
of 22, the use of ssPCC edges gives better (or equal) classification
accuracy.For some data sets, the use of single-sample edges
in place of
concentrations resulted in better discrimination. For instance, for
data 13 the accuracy increases from 89.9 to 100% and better discrimination
between cases and controls is also obtained for data sets 1, 3, 4,
5, 12, 15, 16, 17, 18, 20, and 22. In other cases, a reduction in
accuracy can be observed.When using the LIONESS-D approach
(akin to the strategy we devised
for ssPCC), the accuracy was better only in 5 cases out of 22, while
when using the LIONESS-S accuracy was better only in one case (data
set 19): in all other cases the accuracy was remarkably lower.When Random Forest was run on edges from LIONESS-S, discrimination
accuracy was lower with the sole exception of data set 19. This can
be explained by the process of separating case and control data before
running the LIONESS algorithm. In this format, the matrix E is built as a function of E(α) as shown in eq ; however, here all samples in E(α) come from the same classification group causing
the variation between single samples to be much lower due to the lower
difference in Eα and E(α– as compared to that
in LIONESS-D.There is a large difference depending on how the
aggregate network
is built. In the original publication,[14] the authors discussed the problem of how to build the aggregate
network when in the presence of samples from non-homogeneous populations
(basically considering the two approaches used here). They found minimal
difference (verbatim) in the accuracy of the reconstruction of the
single sample networks but did not explore the edge reconstruction
in a discriminant/classification setting.We can also comment
that calculating correlations from a non-homogeneous
population is not statistically a sound strategy since sample correlation
must be calculated from samples drawn from the same population. It
is simple to show with simulations that if half of the samples come
from a population with correlation ρ = 1 and half from a population
with ρ = 0, the sample correlation will be around 0.5 and this
will lower the single sample edge.Classification models built
from ssPCC and LIONESS-D edges often
yielded similar albeit lower accuracy. This is not too unexpected
as when LIONESS is expressed as a function of ssPCC, as shown in eq , it descends that the
LIONESS single sample edge also depends on Δx and Δx which are the deviations of the q-sample from the average of the reference data. In the LIONESS-D
approach, the deviations Δx and Δx are calculated from each q-sample from data sets
that are homogeneous to the q-sample and thus can
be expected to be small, and this also lowers the edges’ values.
This is not the case for all ssPCC edges; in our modification, the
deviations for q-samples in the control group are
estimated from samples homogeneous to the q-samples,
but for the control group, the q-samples are not
homogeneous to the reference (which is made from control samples)
and this makes the edges larger. This can explain the markedly different
behavior of the different single sample edges when used for discrimination
between two groups.We shall conclude by remarking that in a
classification setting
the use of single sample edges as derived from ssPCC and LIONESS-D
can be used only in an exploratory or confirmatory setting but not
to predict new, unknown, q-samples; the network must
be constructed by contrasting the unknown sample either with the case
or the control group without knowing to which group the q-sample actually belongs. Unknown q-samples can
be predicted using the edges from LIONESS when the aggregates network
is constructed using all samples from both groups simultaneously.
Metabolomics Case Study: NSTIs
In order to delve deeper
into the characteristics of the two methods and to investigate whether
(possibly) new biological information can be gained from the use of
single sample network, we analyzed in detail metabolomics data concerning
metabolite plasma profiles collected from patients suffering from
NSTIs, fast-spreading, aggressive bacterial infections associated
with high morbidity and mortality.[35,36] The study
comprised 34 NSTI patients and 24 surgery patients with no known infection
or morbidity acting as controls. This data has been previously analyzed
using standard statistical (univariate) approaches and differential
correlation analysis.[17]
Single Sample Network Analysis
We began by building
the aggregate reference networks for ssPCC and LIONESS: we used the
LIONESS-S approach here as referenced in section LIONESS: Linear Interpolation
to Obtain Network Estimates for Single Samples, as this is the approach
put forth in the original publication.[14] The reference networks are given in Figures A and 10A, respectively.
Figure 9
(A) Aggregate (reference) network for Lioness (B) Lioness
single
sample network for NSTI patient n. 24. The top (in
absolute value) 0.05% edges are shown. The link width is proportional
to the edge weight.
Figure 10
(A) Aggregate (reference)
network for ssPCC (B) ssPCC single sample
network for NSTI patient n. 24. The top (in absolute
value) 0.05% edges are shown. The link width is proportional to the
edge weight.
Clustering dendrogram of the single sample
networks for the metabolomics
study case. The single samples for the subjects affected by NSTIs
are shown. Samples are color-coded by distance/similarity: red, for
LIONESS (using LIONESS-S implementation) and blue for ssPCC single
sample networks. Distances are calculated on the vectorized networks.(A) Aggregate (reference) network for Lioness (B) Lioness
single
sample network for NSTI patient n. 24. The top (in
absolute value) 0.05% edges are shown. The link width is proportional
to the edge weight.(A) Aggregate (reference)
network for ssPCC (B) ssPCC single sample
network for NSTI patient n. 24. The top (in absolute
value) 0.05% edges are shown. The link width is proportional to the
edge weight.As expected, the two aggregate
results/reference networks are rather
different with different relevant patterns of correlations. We recall
that in this case, for ssPCC the reference network is the correlation
matrix obtained from the control group, while for LIONESS, the aggregate
is obtained from the correlation of the complete data set (NSTI +
control samples).As can be seen, there are obvious differences.
While the ssPCC
reference network seems to cluster around Maltose, the LIONESS aggregate
network seems to cluster around valine and ribitol.We performed
pathway analysis on the top 25 perturbed edges for
both networks with the aim of assessing network properties and identifying
structural and functional units in the metabolic networks.[37] We found significant enrichment for aminoacyl-tRNA
biosynthesis (P-value = 3.29 × 10–5, fdr = 0.0028) and lysine degradation pathways (P-value = 1.33 × 10–4, fdr = 0.0056) for the
LIONESS aggregate network (Figure A) and significant enrichment for the aminoacyl-tRNA
biosynthesis (P-value = 2.45 × 10–5, fdr = 0.002) and valine, leucine, and isoleucine biosynthesis pathways
(P-value = 1.15 × 10–4, fdr
= 0.005) for the ssPCC reference network (Figure A). These results indicate that the single
sample networks are going to be constructed against a background that
may encode for different biological phenomena and as such, they will
bear different biological information. This can be seen also from
the results shown in Figure where ssPCC and LIONESS edges are contrasted: two different
reference networks result in related but different single sample edges.It is interesting to note that no significant enrichment (after
adjustment for multiple corrections) was found when pathway enrichment
analysis was performed only on differentially abundant metabolites
(see Table in Afzal et al.(17)). This is a clear indication
that information on metabolic rewiring and/or disruption is reflected
not only in a change in metabolite levels but also in changes in the
correlations between metabolite concentrations. In this case, single
sample edges carry more information about changes in metabolism in
NSTI than simple metabolite abundances.We constructed the single
sample networks for the 34 NSTI patients;
as shown in Figure the single sample networks obtained using ssPCC and LIONESS are
markedly different, confirming what was observed using simulated data
(see the sections Numerical Simulations and Data simulation using a dynamic metabolic model and Figures and 7). LIONESS and ssPCC networks cluster separately and, in general,
LIONESS sample edges show higher variability than the corresponding
ssPCC edges. In particular, there is a group of networks (corresponding
to samples 5, 6, 10, 15, 20, 28, and 29) that are markedly different
from the others. This is particularly evident for the LIONESS edges.
All samples belong to patients with concurrent comorbidities with
NSTI; all these patients are female, except patients 28 and 29.
Figure 8
Clustering dendrogram of the single sample
networks for the metabolomics
study case. The single samples for the subjects affected by NSTIs
are shown. Samples are color-coded by distance/similarity: red, for
LIONESS (using LIONESS-S implementation) and blue for ssPCC single
sample networks. Distances are calculated on the vectorized networks.
Moreover, the ssPCC edges of sample 18 are more similar to LIONESS
edges than to the other ssPCC samples.We then focused on single
sample networks built from the same sample
profile using the two methods. We built correlation matrices among
ssPCC and LIONESS edges and selected the NSTI patient for whom the
single sample networks obtained with ssPCC and LIONESS were most different
(i.e., the least correlated). We recovered from this
analysis the single sample network for sample 24 who is a NSTI patient
having a polymicrobial etiology. The corresponding single sample networks
for ssPCC and LIONESS are given in Figures B and 10B where only
the largest edges are shown (see figure caption for more details).As can be seen, there are obvious differences. In the LIONESS single
sample network (Figure B), the edges connecting acetaminophen and glucuronic acid are the
most disrupted. In the ssPCC single sample network (Figure B), the edges connecting acetaminophen,
α-tocopherol (vitamin E), maltose, and proline are altered.
In the original publication,[17] standard
differential network analysis was performed to compare metabolite–metabolite
connectivity in NSTI and surgery control (see Table in Afzal et al.[17]); glucuronic acid, and maltose were among the most differentially
connected metabolites but not acetaminophen and α-tocopherol.
Table 4
Accuracy of the Random Forest Models
Constructed Using the Concentration/Abundance Profiles and the Edges
of Single Sample Networks Built from the Metabolomics Data Set Investigating
NSTIs
classification
accuracy
Model
Conc
ssPCC
LIONESS-S
LIONESS-D
NSTI vs Controls
94.7
98.3
98.3
100.0
mono vs poly
87.1
82.5
82.5
82.5
StreptococcusvsStaphylococcus
85.6
75.9
75.9
79.3
Univariate Analysis of
the Single Sample Network Edges
Furthermore, we compared
the edges of LIONESS and ssPCC single sample
networks between NSTI patients and controls using a t-test and we compared the results with those obtained in the original
publication using the metabolite concentrations. In particular, we
compared NSTI and control group on:Metabolite concentrationsMetabolite–metabolite single
sample network edges defined using ssPCCMetabolite–metabolite single
sample network edges defined using LIONESS-SMetabolite–metabolite single
sample network edges defined using LIONESS-D.Results are given in Table that contains the top 10 metabolites/edges
(sorted by P-value). In general, we observed that
metabolites whose concentration is different between NSTI patients
and controls are in general those whose single sample edges are found
to discriminate among the two groups.
Table 6
Results of t-Test
on the Single Sample Edges Obtained Using Lioness (Two Implementations,
Mono and Double) and ssPCC Together with Results from a t-Test on the Metabolite Concentrations (Column “Conc”)a
Concentrations
ssPCC edges
LIONESS-S edges
LIONESS-D edges
1
1-stearoyl sn-glycro-3-phosphocholine
RI 2997 m/z 184 1 m/z 184 & RI 2354 m/z 290 1 m/z 522
cystathionine & glutaric acid
RI 2997 m/z 184 1 m/z 184 & RI
2354 m/z 290 1 m/z 522
2
1-palmitoyl sn-glycero-3-phosphocholine
RI 2997 m/z 184 1 m/z 184 & RI 1232 m/z 174 1 m/z 175
cystathionine &
naproxen tms
RI 2997 m/z 184 1 m/z 184 & RI 1232 m/z 174 1 m/z 175
The top 10 metabolites and metabolite–metabolite
edges are shown in decreasing order of importance (given by the Mean
Decrease Gini Index).The top 10 edges (sorted by P-values) are given.
Random Forest Analysis
on Single Sample Network Edges
Following the same strategy
outlined in Section Random Forest Prediction Models, we performed Random Forest
classifications to analyze the prediction accuracy of single sample
edges in comparison with the simple correlations. In particular, Random
Forest classification models were built for the following two-group
comparisons:NSTI (n = 34) versus
controls (n = 24)Mono-microbial NSTI (n = 26) versus
poly-microbial NSTI (n = 7)Streptococcus NSTI
(n = 20) versus Staphylococcus
aureus NSTI (n = 8)The models were built using the original concentration/abundance
profiles, the edges of the single sample networks were built using
ssPCC, the edges of the single sample networks were built using LIONESS-S,
and the edges of the single sample networks were built using LIONESS-D.
Classification accuracies are given in Table . Regarding the comparison between NSTI and
controls, the use of the single sample network edges increases the
accuracy (up to 100%), although the use of simple concentrations gives
excellent classification (94.7%). The logic behind the use of single
sample network edges is that additional information is contained in
the relationships (or disruption thereof) among (pairs of) metabolites
that is contained in, or is additional to, the metabolite levels and
thus better accuracy should be, in principle, obtained. However, this
is not always the case. For the comparison between mono versus poly
microbial infection and Streptococcus versus Staphylococcus infection,
the use of single sample network results in decreased accuracy. This
may well depend on the limited sample size used to build the aggregated/reference
networks that can lead to instability in the estimation of these networks[12] and, as a consequence, low-quality estimation
of the single sample networks. We also compared the edge and metabolite
importance in the Random Forest models using the mean decrease Gini
index as the importance measure. The top metabolites and edges from
the models are shown in Table . The metabolites whose edges mostly contribute to the separation
between NSTI and controls tend to be the metabolites whose concentration
is also different between the two groups, confirming that the single
sample edges bear content of both concentration and pairwise relationships
as also discussed in the section ssPCC: Single
Sample Network Based on Pearson’s Correlation.
Table 5
Random
Forest Classification of NSTI
Patients and Controls Using Metabolite Concentrations and ssPCC and
LIONESS Single-Sample Network Edgesa
The top 10 metabolites and metabolite–metabolite
edges are shown in decreasing order of importance (given by the Mean
Decrease Gini Index).
Conclusions
In this study, we investigated and assessed
the utility of two
methods for the inference of single sample networks in a metabolomics
context: LIONESS (Linear Interpolation to Obtain Network Estimates
for Single Samples)[14] and ssPCC (single
sample network based on Pearson’s correlation).[13]The two methods are functionally related
and when compared on simulated
data with different correlative properties, we found both methods
to have limited ability to describe different situations. However,
on data from a NF-kB dynamic metabolic model, we found that only the
ssPCC single sample networks are able to describe different situations
arising from perturbations of the model.We found the statistical
procedure proposed in ssPCC to have limited
power and to be heavily dependent on the particular reference networks
and of little utility for practical applications.We explore
the potential of single sample edges to be used in place
of concentration to discriminate between groups in a case–control
scenario. To this scope, we used two different implementations of
LIONESS and proposed a work-around to adapt ssPCC to this scenario.
Using Random Forest as a classification algorithm, we found that in
13 cases out of 22 the use of ssPCC edges gives better (or equal)
classification accuracy than (to) the use of actual metabolite concentrations,
while, overall, the use of LIONESS sample resulted in worse prediction
models.We found that single sample networks built off of a
control group
(like ssPCC and LIONESS-S) yield better results than those built off
of similar samples in a classification setting; however, this approach
does not allow for generalization and as such should be used as an
exploratory tool.We finally applied the two methods to analyze
a metabolomics study
pertaining to NSTIs.NSTI-specific single sample networks obtained
with the two methods
are markedly different and are likely to describe different ongoing
biological processes. We found that single sample edges, either from
ssPCC or LIONESS, gave better prediction results in distinguishing
between NSTI patients and controls but not in other comparisons aimed
to distinguish between disease etiology. In general, ssPCC edges found
to be important in discriminating groups involve metabolite pairs
that are found important when comparing groups with a standard t-test performed on concentrations.There are still
some aspects of single sample networks to be elucidated,
the most important being a proper statistical validation for both
ssPCC and LIONESS.We used both methods taking Pearson’s
correlation as an
association measure and building networks using the standard correlation
matrix, but the two methods are general. Other measures of association
can be used, such as partial correlation, as proposed by the authors
of ssPCC,[38] or mutual information, although
the latter measure does not give any improvements, at least for what
concerns metabolomics data.[19] In addition,
more sophisticated approaches can be used for network inference as
proposed by Kuijjer et al.(14)Overall, we think single sample networks seem to be a promising
tool for the analysis of metabolomics data. For instance, we have
shown that single sample edges provide more information than simple
concentrations for pathway analysis since they encode both levels
and relationships between metabolites. Single sample networks can
be used as exploratory tools and can be paired with standard univariate
and multivariate tools.
Authors: P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton Journal: Nat Rev Cancer Date: 2004-03 Impact factor: 60.716
Authors: Antonio Rosato; Leonardo Tenori; Marta Cascante; Pedro Ramon De Atauri Carulla; Vitor A P Martins Dos Santos; Edoardo Saccenti Journal: Metabolomics Date: 2018-02-27 Impact factor: 4.290