Literature DB >> 33267585

Evaluation of Single Sample Network Inference Methods for Metabolomics-Based Systems Medicine.

Sanjeevan Jahagirdar¹, Edoardo Saccenti¹.

Abstract

Networks and network analyses are fundamental tools of systems biology. Networks are built by inferring pair-wise relationships among biological entities from a large number of samples such that subject-specific information is lost. The possibility of constructing these sample (individual)-specific networks from single molecular profiles might offer new insights in systems and personalized medicine and as a consequence is attracting more and more research interest. In this study, we evaluated and compared LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples) and ssPCC (single sample network based on Pearson correlation) in the metabolomics context of metabolite-metabolite association networks. We illustrated and explored the characteristics of these two methods on (i) simulated data, (ii) data generated from a dynamic metabolic model to simulate real-life observed metabolite concentration profiles, and (iii) 22 metabolomic data sets and (iv) we applied single sample network inference to a study case pertaining to the investigation of necrotizing soft tissue infections to show how these methods can be applied in metabolomics. We also proposed some adaptations of the methods that can be used for data exploration. Overall, despite some limitations, we found single sample networks to be a promising tool for the analysis of metabolomics data.

Entities: Chemical Disease Gene Species

Keywords: biological networks; correlation; necrotizing soft tissue infections; network inference

Mesh：

Year: 2020 PMID： 33267585 PMCID： PMC7786380 DOI： 10.1021/acs.jproteome.0c00696

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Humans exhibit great phenotypic diversity in both healthy and pathophysiological conditions as a result of molecular regulatory and metabolic systems underlying the functioning of living organisms. It is now widely recognized that phenotypic diversity cannot be understood and characterized by analyzing single molecular markers such as genes, metabolites, or proteins alone: what is relevant is the complex web of interactions underlying the molecular mechanisms maintaining the functioning of the organism.[1−3] These molecular interactions are well captured and modeled using the formalism of network inference and analysis,[4−6] where molecular entities such as genes, proteins, and metabolites are represented as nodes and their mutual relationships as edges, which can be different in nature, representing physical interactions, as in protein–protein interaction networks, regulation, as in gene regulatory networks, or similar concentration patterns, as in metabolite–metabolite association networks.[7] It has been shown that network-based biomarkers, for example, sub-network markers,[8] network biomarkers,[9] and edge biomarkers[10] are superior to the traditional single-molecule biomarkers for accurately characterizing disease states due to their additional information on interactions and networks. In the quest for personalized medicine,[11] it is of paramount importance to elucidate the molecular mechanisms which underlay the subject-specific response to pathophysiological stimuli, resulting from the dysfunction of individual-specific networks/systems rather than just the malfunction of a singular biological entity. In this light, networks and network analyses have the potential of being pivotal in personalized medicine if there exists the possibility of their extension from a population level to the individual-specific level. However, since several samples are required to define the associations (like in the form of correlations[12]) among molecular elements, such as metabolites or genes, there exists no straightforward approach to infer an individual-specific network by profiling metabolite concentrations or gene expression from a single sample. If such an approach was demonstrated, it would be a very desirable situation due to the fact that it is rarely possible to obtain multiple samples from the same subjects, given the necessity of designing complex and expensive longitudinal studies. On the contrary, a single bio-fluid sample (such as blood and urine) is usually easy to obtain even in common clinical practice. There is growing research interest in the possibility of the construction of such individual-specific networks by expression profiling of a single sample and several methods have been proposed.[13−16] Here, we present a comparative review of two methods for single sample network inference with the aim of evaluating their possible application to metabolomics data to obtain metabolite–metabolite single-sample association networks. We focused on LIONESS proposed by Kuijjer et al.(14) and ssPCC by Liu et al.(13) We chose these two approaches among others since they adopt similar albeit different philosophies and thus are directly comparable, both are (or may be) based on correlations, and are easy to implement. We analyzed these two methods for their ability to produce single sample or sample-specific networks from metabolite concentrations, and we explored and compared their characteristics on data generated from Numerical simulations A dynamic metabolic model 22 publicly available metabolomic data sets. We then applied the two methods on a study case pertaining to the metabolomics investigation of necrotizing soft tissue infections (NSTIs)[17] in order to showcase the deployment of single sample network inference in real-life metabolomics applications. Additionally, we suggest some potentially new use of single sample networks for sample exploration and classification.

Material and Methods

Basics of Networks

A network is a graphical representation of relationships among objects. A network consists of nodes which represent biological features (genes, proteins, and metabolites) connected by links or edges which represent pair-wise relationships between the biological features. This representation shifts the focus toward the relationships among biological entities rather than on their levels; in this light, network and network analysis are fundamental tools from the systems biology toolbox to investigate and understand metabolomics data.[12] When the nodes are metabolites, the networks can be termed metabolite–metabolite association networks.

Methods for Single Sample Networks Inference

LIONESS: Linear Interpolation to Obtain Network Estimates for Single Samples

LIONESS is an approach developed by Kuijjer et al. in the context of gene regulatory networks.[14,18] This approach starts by considering a n × m data matrix X(α) and the corresponding m × m network E(α) (i.e,. the so-called aggregate network) with edges e between nodes i and j and the network E(α– constructed from the (n – 1) × m data matrix X(α–, that is a matrix with all but the qth sample, which we refer to as the q-sample for the sake of simplicity. A graphical illustration of the LIONESS procedure is given in Figure .

Figure 1

Visual illustration of the Lioness estimation of a single sample based on two aggregate network models, one reconstructed using all biological samples in a given data set and the other using all except the sample of interest (q, q-sample). Figure reproduced and adapted from the original publication,[14] published under Creative Commons license CC BY-NC-ND 4.0. LIONESS assumes that the aggregate network E(α) built from n samples is the mean of networks constructed from every single sample from the data set X(α) containing n samples. This assumption is then extrapolated to define the edge e(α) in E(α) as the linear combination of the weights of that edge across a set of n networkswhere w( represents the relative contribution of each single sample network to the aggregate network and Similarly, for the network E(α– constructed from all but the q-sample, the edge between eα– is defined aswhere From eqs and 3, the authors definedas long as the assumption holds that every sample makes an equal proportional contribution to the aggregate networks E(α), which makes w(α) constant. Combining eqs and 3 and solving for the edge e( for the q-sample gives the general LIONESS equationwhich defines the edge between node i and j of the single sample network for the q-sample. The term 1/w(α) gives the weight of each sample, and can be set to n if all samples are given the same weight, obtainingwhich will be used throughout this study. In matrix notation, eq becomes The LIONESS equation does not depend on the particular methods used to infer the networks which can be estimated with any approach or different association measures; the most common approach is to use correlation, but mutual information can also be used. However, the authors reported sub-optimal performance when using mutual information to measure associations and single sample edges. In a related study, we also reported the sub-optimal performance of mutual information for metabolite association network estimation.[19] For this reason, we will focus on Pearson’s correlation also because this will allow direct comparison with the ssPCC method (see the section ssPCC: Single Sample Network Based on Pearson’s Correlation). In this case, the LIONESS eq becomeswhere r is the Pearson correlation between variable (metabolite) i and j. A summary of the notation used is given in Table .

Table 1

Summary of the Notation Used in the Paper to Define the Lioness and ssPCC Edges

		original notation	definition	alternative notation	correlation notation	definition
ssPCC	network built using all samples in the reference data set	PCC_n			r_ij⁽ⁿ⁾
	aggregated network built using all samples in the reference plus the q-sample	PCC_n+1			r_ij^(n+q)
	single sample network for the q sample	ΔPCC_n	PCC_n+1 – PCC_n		r_ij^(q)	r_ij^(n+q) – r_ij^(q)
Lioness	aggregated network built using all samples	e_ij^(α)		e_ij⁽ⁿ⁾	r_ij^(α)
	network built using all samples but the q-sample	e_ij^(α–q)		e_ij^(n–q)	r_ij^(α–q)
	single sample network for the q-sample	e_ij^(q)	n(e_ij^(α) – e_ij^(α–q)) + e_ij^(α–q)		r_ij^(q)	n(r_ij^(α) – r_ij^(α–q)) + r_ij^(α–q)

Choice of the Aggregate Network

The LIONESS algorithm outputs a single sample network for each sample in a given data set given an aggregate network. If the data set contains n1 samples from group 1 and n2 samples from group 2, there is the legitimate question of whether to build the aggregate network using all n1 + n2 samples or to build two different separate reference networks, one for group 1 and one for group 2. In the original paper,[14] the authors investigate the use of non-homogeneous background (page 13 of the Supporting Information[14]) and reported minimal differences. However, in this study we explored both implementations that we dubbed, for convenience, as LIONESS single (LIONESS-S) and LIONESS double (LIONESS-D) LIONESS-S Consider all samples to build the aggregated network and build single sample networks referring to the pool of all samples, or LIONESS-D Consider two different aggregate networks E1(α) and E2(α) from the two group samples and use them to build two sets of single sample networks, one for group 1 and one for group 2.

ssPCC: Single Sample Network Based on Pearson’s Correlation

The single sample network based on Pearson’s correlation (which we abbreviate as ssPCC) was proposed by Liu et al.(13) for building sample specific networks in the context of gene regulatory networks for disease characterization. As such, it relies on the availability of a n × mX set of reference or control samples to contrast a set of case (possibly disease, in general from a different condition) q-samples. The ssPCC aims to define the single sample network specific to the q-sample(s). Using the same notation used in the original publication, the single specific network for the q-samples obtained using as reference the n samples in X is given bywhere PCC is the Pearson’s correlation matrix[20] calculated from the reference set X, and PCC is the correlation matrix calculated from the (n + 1) × m set made of X + the q-sample. The PCC is referred to as the “Reference network”, while PCC is referred to as the “Perturbed network”. Thus, the single sample network for the q-sample is considered to be the perturbation of the correlation of the m variables in X caused by the addition of the q-sample which comes from a different population. A graphical illustration of the ssPCC procedure is given in Figure .

Figure 2

Visual illustration of the ssPCC estimation of single sample network. (A) For a group of n reference samples, a reference network PCC can be constructed using Pearson’s correlations. (B) A new sample q is added and the perturbed network PCC with this additional sample is calculated. The difference between PCC and PCC is due to sample q. (C) Differential network ΔPCC is constructed taking the difference between perturbed and the reference network PCCn + 1 – PCC. Figure reproduced and adapted from the original publication,[13] published under Creative Commons license CC BY-NC-ND. Using an edge notation similar to the one used for LIONESS, the single sample network for the q-sample can be rewritten aswhere the superscript n + q indicates the addition of the sample q to the n samples of the reference matrix X and r indicates Pearson’s sample correlation between variables i and j. A summary of the notation used is given in Table . The authors proposed to assess the significance of an edge in the single sample network by means of permutation but found that the procedure could be conveniently substituted with a Z-test which is much faster and gives equivalent results (see the section in the original paper[13]). They propose the following Z statistic testwhich, considering eq , can be conveniently rewritten using the edge notation in terms of the sample correlation between variable i and j The Z-statistic is then confronted with the critical values of a standard normal distribution to assess significance.

ssPCC for a two Group Case

The ssPCC algorithm outputs single sample networks only for the case group and not for the reference group. This setting does not allow, per se, to build single sample networks for all samples (i.e., case and reference samples) as in LIONESS. We attempted to bypass this limitation by building single sample networks also for the reference data set by contrasting each sample in the reference data set against the remaining samples, that is, considering each reference sample as a q-sample.

Data Simulations

Numerical Simulations

Simulation Scheme 1

We simulate n × 2 reference data set X by sampling from a bivariate normal distributionwith population μ0 = (0, 0) andwhere ρ0 is the population (expected) value of r( (i.e., of PCC) from ssPCC eqs and 10. We let ρ0 vary over the values −0.9, −0.7, −0.5, −0.3, 0, 0.3, 0.5, 0.7, and 0.9 to define 9 different reference covariance/correlation structures. The q-sample for which the single sample network is sought using ssPCC, that is, the sample to be added to X to calculate PCC is also drawn from a bivariate normal distribution withwith population mean μ = (0, 0) and correlation matrix We let ρ vary in the range [0, +1] in increments of 0.1 to define 11 different covariance/correlation structures for variables x and y. We take the differenceas a measure of the perturbation effect when a sample q is added to X to estimate r( (i.e., PCC): when ρ = ρ0, the q-samples and the reference samples come from the same distribution, which implies that there is no perturbation, hence the expected value of r( and r( is the same. As ρ increases with respect to ρ0, the perturbation increases, and in consequence ΔPCC also increases.

Simulation Scheme 2

This simulation is similar to the Simulation Scheme 1. The only difference is thatthat is the q-samples come from a population with both different mean and correlation structure.

Simulation Scheme 3

We generated m × m (with m = 20) random correlation matrices Σ (with elements ρ ≥ 0 and ρ ≠ ρ for all possible variable pairs) satisfying the property This was achieved using the vine method[21,22] by sampling from a beta distribution Beta(α,β). The variance σ2 of the beta distribution was set to 0.1, and the mean μ was numerically optimized to have the sampled data obtain the required average correlation ρ equal to 0.1 to 0.9 in steps of 0.1, within a 5% precision. The mean μ and variance σ2 link the α and β parameters by the relationships The optimized μopt values (0.113, 0.116, 0.123, 0.135, 0.163, 0.201, 0.262, and 0.382) were used to calculate the α and β shape parameters to be passed to the algorithm. The correlation matrices Σ were used to simulate multivariate normally distributed data N(μ,Σ) to be used as a reference and as q-samples. In particular, data with an average correlation of 0.6 and 0.9 were used as reference data sets and data with an average correlation in the range 0.1–0.9 were used as q-samples.

Multivariate Data Simulation

The covariance/correlation matrices defined in the Simulation Schemes 1–3 were used to generate normally distributed multivariate data using the Matlab mvnrnd function.

Data Simulation Using a Dynamic Metabolic Model

To generate metabolites with plausible concentration patterns as observed in metabolomics data, we used a dynamic metabolic model, as described in ref (19). For the sake of completeness, we report here the full simulation strategy. The model describes the activation of NF-κβ complex (nuclear factor kappa-light-chain-enhancer of activated B cells) and the corresponding response of the intracellular signaling pathway when exposed to lipo-polysaccharide that activates an inflammatory response. It consists of 59 ordinary differential equations recounting the reactions involving 35 metabolites. The model was obtained from the BioModels database[23] (www.ebi.ac.uk/biomodels/) with accession number BIOMD0000000489. Full details on the model building and accessory files can be found in the original publication.[24] The model dynamics were constructed and simulated using ordinary differential equations representing 3 types of reactions, namely Reversible reactions using mass action law Irreversible reactions using mass action law Enzymatic reactions using Michaelis–Menten kinetics. We give three reactions used in the model to showcase examples of ordinary differential equations corresponding to each type of reaction and the kinetics involved.where IRF3 (Interferon Regulatory Factor 3), IKK(IkB kinase), and TAK1:TAB1:TAB2:TRAF6 (complex of mitogen-activated protein kinase kinases) are three of the metabolites/compounds utilized in the NF-κβ activation and corresponding signaling pathway, [P] represents the addition of phosphoryl group via the process of phosphorylation, [metabolite] represents the concentration of the metabolites, and i is the reaction number. For detailed description and information on all metabolites and reactions involved in the NF-κβ model, we refer the reader to the original publication and its Supporting Information.[24] Subject-specific concentration profiles were obtained by varying the kinetic constants Km, k, kf, and kr for all of the 59 reactions and the initial concentrations c for the 4 metabolites with non-null initial concentrations in order to generate subject-specific profiles. All these constants, Km, k, kf, kr, and c were varied between bounds (a, b) ± 10% of the original values (see eq and following) presented in the original publication. This was achieved by sampling from a uniform distribution U(a,b) to obtain values for each subject. For the j-th individual, the values kf, kr, k, km, and c for the i-th reaction were defined as The rationale of this approach is that models with different parameters will produce different metabolite profiles, like those observed when sampling different subjects in real-life metabolomics experiments. Using this approach, we generated 500 individual profiles from which we built data sets of different sizes by random sampling. To mimic different conditions, we introduced perturbations to the model by manipulating the kinetic constants in the following mannerhere ϵ is used as a scaling parameter, the same for all reactions. The value of ϵ was varied over the values 1/10, 1/5, 1/3, 1/2, 1/1.5, 1, 1.5, 2, 3, 5, and 10 which were used to create subject-specific profiles in a similar manner as described above. Using the same ϵ for all reactions allows us to investigate the performance of the SSN method as a function of the perturbation. From the pool of 500 samples, we randomly sampled (with replacement) subsets of different sizes (n = 10, 25, 50, 100, 250, and 500).

Power of the ssPCC Test

We investigated the actual power of the test (i.e., the probability of rejecting the null hypothesis H0 when actually false) by means of the Simulation Schemes 1 and 2. For each combination of Σ and Σ, we generated a n × 2 reference data set X from which we calculated PCC and k q-samples to obtain k values of ΔPCC which were tested for significance at the 0.05 level and recorded how many times H0 was correctly rejected to calculate the actual power of the test. In our Simulation Scheme 1, the Null hypothesis H0 is always false, except when Σ = Σ. The overall procedure was repeated 1000 times, and the results were averaged over the repetitions. The actual power was calculated for n = 25, 250, and 25,000. The relative frequency of the rejection of H0 when Σ = Σ (i.e.,H0 true) is the actual α level of test, that is, the actual false positive rate (Type I error).

Principal Component Analysis

We explored differences among single sample networks with principal component analysis (PCA). Each single sample matrix was vectorized and PCA was applied on the edges to investigate the patterns of similarity/difference among the networks. Networks were vectorized by taking only the upper diagonal part of the network (given the symmetry) so that only edges are considered instead of m2. Every m × m single sample network was then collapsed to a vector, and the different networks were then collected in a matrix form suitable for PCA.

Random Forest Prediction Models

Random Forest[25] was used to build classification models to explore the use of single sample network edges in a prediction context, that is, to explore whether the use of the edge weights between pair of metabolites possesses higher predictive power than the original metabolite concentrations. We focused on two-group scenarios, which are also the most commonly encountered in metabolomics applications, applying this approach to several public metabolomics data sets. We built single sample networks using both LIONESS implementations (Single and Double, see the section Choice of the Aggregate Network) and ssPCC (see the section ssPCC for a Two Group Case). The single sample networks were processed for Random Forest as described in the case of PCA (see the section Principal Component Analysis). We used the standard Breiman’s Random Forest implementation which uses the Gini impurity as loss.[26] We set the number of trees to 1000 and used the default value of (where p is the number of variables) for the “mtry” parameter. We used a 2/3 + 1/3 data split (training + validation) to obtain an unbiased estimation of the classification. We took into account data unbalance using the “strata” option. Each model fitting was repeated 100 times to take into account the variability due to the resampling step used by the RF algorithm to randomly select the same number of subjects from each group and so to build the model on balanced data. The resampling was nested within the cross-validation step used to assess the quality of the prediction models. All results are given as the arithmetic mean over 100 iterations.

Pathway Enrichment Analysis

Pathway enrichment analysis was performed using the built-in function available in the MetaboAnalyst 4.0[27] online server (www.metaboanalyst.ca) using the hypergeometric test. The Benjamini–Hochberg method was used for the false discovery rate (fdr) correction.[28] We considered significantly enriched those pathways with fdr < 0.01.

Experimental Data

Metabolomic Study Case

As a study case, we considered a data set from a metabolomic investigation of NSTIs. The data set consists of plasma metabolite profiles acquired via GC–MS on 34 NSTI patients enrolled in the INFECT project (Clinicaltrials.gov, NCT01790698). In addition, 24 patients with no known infections were included as controls. The patients had NSTI of different microbial etiology and were classified into polymicrobial and monomicrobial NSTI. This data is available at the NIH Common Fund’s National Metabolomics Data Repository (NMDR) website, the Metabolomics Workbench (www.metabolomicsworkbench.org) with Project ID ST00127 where it can be accessed via the Project DOI (DOI: 10.21228/M85H5H). We refer to the original publication[17] for more details on the study design, sample collection, and processing, GC–MS experiments, and clinical information.

Compendium of Publicly Available Metabolomics Data

To explore further the characteristics of the two methods for inference of single sample networks, we used a compendium of 22 data sets that we compiled for a previous study.[19] Ten of these data sets were derived from the compendium assembled by Mendez et al.(29) including studies representative of three of the most frequent metabolomic experimental platforms (nuclear magnetic resonance, NMR; gas chromatography mass spectrometry, GC–MS; and liquid chromatography mass spectrometry, LC–MS) concerning the metabolomic profiling of varying biofluids (serum, plasma, urine, and feces). All studies follow a two-group design (either as primary design or secondary result of the publication or as a subset of a multi-class study) and have varying sample sizes and number of metabolites (variables) present. We have made use of the processed and cleaned data available in Mendez et al.(29) to which we refer the reader for more detailed information on the processing and cleaning of data sets. Metabolites having missing values were either deleted (e.g., dataset MTBLS136) or imputed using the Random Forest procedure implemented in the R package missForest[30] (e.g., data set ST001047). We also included other data sets derived from tissue (fat), plant, and fruit extracts along with microbiome data (16S sequencing) and other chemical-based assays on various fluids such as coffee, wine, and oil and finally two transcriptomic data sets. Relevant references and attributes for all data sets are shown in Table .

Table 3

Random Forest Classifications of 22 Case-Control Metabolomics Data Sets Using Metabolite Concentrations and Single Sample Network Edges as Described in the Section Random Forest Prediction Modelsa

								classification accuracy
	study ID	refs	platform	type	obs	var	design	conc	ssPCC	LIONESS-S	LIONESS-D
1	KODAMA	(39)	NMR	urine	80 (40/40)	490	subject (A/B)	96.9	100.0	87.5	100.0
2	MTBLS123	(40)	NMR	urine	151 (79/72)	63	shock (pre/post)	99.1	97.4	56.3	81.5
3	MTBLS136	(41)	LC–MS	serum	668 (337/331)	371	postmenopausal hormone (estrogen/estrogen + progesterone)	99.1	99.6	59.4	89.8
4	MTBLS161	(42)	NMR	serum	59 (34/25)	30	chronic fatigue syndrome (case/control)	96.0	96.6	67.8	86.4
5	MTBLS404	(43)	LC–MS	urine	184 (101/83)	120	sex (M/F)	98.3	100.0	67.4	100.0
6	MTBLS547	(44)	LC–MS	caecal	97 (46/51)	35	high fat diet (case/control)	99.9	97.9	84.5	94.8
7	MTBLS90	(45)	LC–MS	plasma	968 (485/483)	189	sex (M/F)	99.4	90.8	63.6	91.3
8	MTBLS92	(46)	LC–MS	plasma	253 (142/111)	138	breast cancer chemotherapy (before/after)	98.3	92.5	62.7	89.3
9	pgmm	(47)	assay	oil	50 (25/25)	7	region (A/B)	100.0	97.3	92.0	98.7
10	pgmm	(48)	assay	coffee	43 (36/7)	12	variety (Arabica/Robusta)	100.0	97.7	95.3	83.7
11	pgmm	(49)	assay	wine	130 (59/71)	27	type (Barolo/Grignolino)	100.0	100.0	81.5	94.6
12	ST000061		GC–MS	tissue	118 (59/59)	157	subcutaeus/visceral fat	94.7	99.1	78.6	87.2
13	ST000369	(50)	GC–MS	serum	80 (49/31)	181	adenocarcinoma (case/control)	89.9	100.0	55.0	80.0
14	ST000496	(51)	GC–MS	saliva	100 (50/50)	69	debridement (pre/post)	99.3	96.0	63.0	91.0
15	ST001000	(52)	LC–MS	stool	121 (68/53)	124	inflammatory bowel diseases (CD/UC)	91.7	98.3	66.9	96.7
16	ST001047	(53)	NMR	urine	83 (43/40)	149	gastric cancer (gastric cancer/healthy)	93.4	100.0	61.4	88.0
17	ST001243	(54)	GC–MS	plasma	98 (48/50)	69	trisomy 21 (yes/no)	99.0	100.0	79.2	91.7
18		(55)	NMR	urine	50 (25/25)	200	cachexia (case/control)	92.2	93.5	70.1	94.8
19		(55)	NMR	urine	77 (47/30)	63		94.4	94.0	96.0	98.0
20		(55)	NMR	urine	60 (30/30)	63		99.4	96.2	67.4	86.6
21		(56)	GC–MS	urine	301 (129/172)	324	sex (M/F)	98.2	99.0	72.4	90.0
22		(57)	GC–MS	plant	70 (35/35)	67	light/dark	91.5	94.3	62.9	84.3

Abbreviations: CD: Crohn’s disease. UCL=: Ulcerative colitis. Assay stands for chemical assay.

The % of edges associated with differential expressed metabolites is shown together with the average ssPCC edges and with the average correlation of the reference network calculated on 10 case-control metabolomics data sets. More information on the data can be found in Table . Abbreviations: CD: Crohn’s disease. UCL=: Ulcerative colitis. Assay stands for chemical assay.

Software

Calculations were performed using R,[31] Matlab,[32] and Python.[33] Our R implementations for LIONESS and ssPCC are available at www.systemsbiology.nl under the software tab. Original R package for LIONESS by Kuijjer et al.(18) can be also obtained at github.com/kuijjerlab/lionessR and bioconductor.org/packages/lionessR.

Results and Discussion

We begin by noticing that LIONESS and ssPCC are not context or data dependent or depending on how the networks are inferred. Both methods have been originally applied to gene regulatory networks but the statistical framework is totally general; they are both based on “manipulation” of correlations but the way the correlations are calculated and manipulated is totally independent of their origin. There is nothing in how the methods are formulated that is specifically depending on or descending from the correlations originating from gene regulation patterns. The two frameworks are fully generalizable to different biological contexts and applications. Here, we explore their applicability to metabolite–metabolite correlation networks; different approaches can be used to calculate the reference networks depending on the applications, but the way the single sample networks are obtained does not depend on the application. We investigated the actual power of the ssPCC test (i.e., the probability of rejecting a null hypothesis when actually false) using the Simulation Scheme 1 described in the Material and Methods section. As shown in Figure we observed limited power when the reference samples and the q-samples are drawn from a multivariate normal distribution with different population covariance–correlation matrices but with the same population means (Simulation 1). The results shown are for n = 250.

Figure 3

Actual power of the ssPCC proposed testing procedure based on a Z-test (see the section ssPCC: Single Sample Network Based on Pearson’s Correlation and eq ) as a function of the effect size defined in the section Power of the ssPCC Test using the Simulation Scheme 1. The nine panels (A–I) correspond to the different correlation ρ0 of the ssPCC reference network (PCC, see eq ): the values are −0.9, −0.7, −0.5, −0.3, 0, 0.3, 0.5, 0.7, and 0.9, respectively. For each value of ρ0, we let ρ (see eq ) to vary in the range [0:1] with increments of 0.1. The effect size is defined as ρ – ρ0, so that for each ρ0 there are 11 effect size values, indicated on the y-axis of the bar plots. Results are shown for sample size n = 1. The actual power is calculated over 1000 tests. The vertical dashed red lines indicate the 5% level. The power of the test is heavily dependent on the correlation among x and y in the reference set, that is, on PCC; the maximum power achieved is slightly higher than 60% when r( = 0.9(PCC) (panel A) and reaches its minimum when r( is below 0.5 (panels E–G). Unfortunately, this is the most common case in metabolomics studies: metabolomics correlations are mostly positive and smaller than 0.6 as also shown in Table where the distribution characteristics of correlation values (that is, r(, PCC are given for 10 case-control metabolomics data sets. Thus the most common situation observed in real-life metabolomics studies is the situation in which the test has less power to detect differences between the q-sample and the reference set.

Table 2

Association Between Metabolite Differential Expression and Single Sample Edges Found to Be Significant with the Proposed ssPCC Testa

		% SSN edges associated with DE metabolites			average ssPCC SSN edge			correlation in reference Data
	data set	mean	95% CI		mean	95% CI		mean	95% CI
1	MTBLS136	56.1	39.6	77.8	0.01	0.01	0.01	0.03	–0.17	0.43
2	MTBLS161	82.3	42.6	98.3	0.02	0.00	0.12	0.35	–0.38	0.79
3	MTBLS404	67.9	0.0	97.3	0.01	0.00	0.05	0.30	–0.05	0.73
4	MTBLS547	76.3	0.0	100.0	0.01	0.00	0.07	0.29	–0.22	0.84
5	MTBLS90	84.7	69.3	97.8	0.01	0.00	0.01	0.06	–0.18	0.62
6	MTBLS92	85.0	65.3	98.3	0.01	0.00	0.03	0.21	–0.09	0.80
7	ST000369	39.5	14.6	68.1	0.11	0.08	0.17	0.03	–0.31	0.50
8	ST000369	54.1	32.0	77.8	0.02	0.01	0.08	0.03	–0.29	0.45
9	ST001000	75.3	0.0	98.2	0.02	0.00	0.09	0.14	–0.28	0.95
10	ST001047	75.3	0.0	98.2	0.02	0.00	0.09	0.14	–0.28	0.95

The power of the test depends of course on n and should tend to 1 as n → ∞. However, we did not observe any strong dependence on n: for n = 25, the maximum power was ≈64% which increased to ≈66% for n = 25,000 when PCC = 0.9, which indicates that the size of the reference data set has little influence on the actual power of the test. The limited power of the test under Simulation Scheme 1 can be understood by considering eq in the original publication. The authors derived an interesting relationship (in the case n ≫ 1) linking the ΔPCC, that is, r( and the difference of the level of variable i and j measured on the q-sample with respect to the average level of the same variables in the reference datawhere Several interesting observations can be derived from eq : ΔPCC → 0 if Δx and Δy are zero, that is if the q sample is from a population with the same average level (μ = μ) of the reference population as in Simulation 1. PCC does not appear in the (re)definition of ΔPCC: only the difference in the levels of X and Y with respect to the correlation of X and Y in the reference define ΔPCC. This explains the very limited power observed in Figure . If the reference samples and the q-samples are drawn from a multivariate distribution with the same population, it means that both Δx and Δy tend to 0 and ΔPCC tends to zero even if PCC ≠ 0.which does not depend explicitly on n, since PCC is fixed a priori, this explains why increasing the dimensionality of the reference data set X has little influence on the power of the test.or If the number of samples in the reference data set n is very large, adding the q sample has practically no influence: it is not the perturbation of the correlation (i.e., PCC – PCC) that it is tested, but some function of the differences of X and Y with respect to the average values in reference data. This observation is supported by empirical evidence shown in Table where SSN networks were built for 10 case-control metabolomics data sets: the vast majority of significant SSN edges are associated with metabolites whose concentrations are significantly different between the two conditions. If we plug eq in eq , we obtain the following expression for the Z-statistic ΔPCC can be different from zero also when PCC is zero, that is when the reference samples and the q-samples are from populations with the same covariance–correlation structure; this happens when Δx, Δy ≠ 0. This explains the slightly inflated Type I error observed in Figure .

Figure 4

Actual Type I error (α) of the ssPCC proposed testing procedure based on a Z-test (see the section ssPCC: Single Sample Network Based on Pearson’s Correlation and eq ) as a function of the effect size defined in the section Power of the ssPCC Test. The nominal 5% level is indicated by the vertical dashed red line. The six panels (A–F) correspond to the different sample sizes used, which are 10, 50, 100, 500, 1000, and 10,000, respectively. The actual α is calculated over 1000 tests.

ΔPCC can be zero even if Δx, Δy, and PCC are all different from zero. This can happen, for instance, when Actual Type I error (α) of the ssPCC proposed testing procedure based on a Z-test (see the section ssPCC: Single Sample Network Based on Pearson’s Correlation and eq ) as a function of the effect size defined in the section Power of the ssPCC Test. The nominal 5% level is indicated by the vertical dashed red line. The six panels (A–F) correspond to the different sample sizes used, which are 10, 50, 100, 500, 1000, and 10,000, respectively. The actual α is calculated over 1000 tests.

Relationship between LIONESS and ssPCC Single Sample Networks

If a correlation is used as a measure of variable association, the two methods are functionally related, and from the definition of the ssPCC and LIONESS edges, it follows that the latter can be written as a function of the ssPCC edges. In particular, when the data set X(α) used to build the aggregated network in LIONESS is equivalent to the reference data set used in ssPCC, there is an almost perfect linear relationship between the edges of the q-sample network estimated using LIONESS and using ssPCC, as can be seen in Figure panel A. This relationship deteriorates when X(α) is not equal to X, that is when there is more than one sample belonging to a group different from the reference (Figure panel B). Note the different scale of the edge weights: for ssPCC, the edges are bounded between −2 and +2, being defined as the difference between two correlations; and for LIONESS, if correlations are used, the edges are bounded between 1 – 2n and 2n – 1.

Figure 5

Comparison of the Lioness and ssPCC single sample edges. (A) Scatter plot of the edges for a q-sample obtained when the data set X(α) used to build the aggregated network in Lioness is equivalent to the reference data set used in ssPCC. (B) Scatter plot of the edges for a q-sample obtained when X(α) is not equal to X, i.e., when there is more than one sample belonging to a group different from the reference. Note the different scale of the edge weights. A relationship similar to eq can be derived also for LIONESS. It is enough to note that r( is actually the correlation calculated using all samples (thus including the q-samples i.e., r(α)) in LIONESS, and PCC is the correlation calculated using all but the q-sample, that is r(α–. The LIONESS edge can be rewritten as This also establishes that the edges of a single sample network estimated with LIONESS are a function of both the perturbation of the correlation and the difference between the levels of the q-sample and the mean of the remaining n – 1 samples.

Comparison of ssPCC and LIONESS on Simulated Data

Simulated Data with Different Levels of Correlation

The first simulation entails the case of the reference data (m = 20 variables) with zero mean and average correlation equal to 0.6; this value was chosen because most parts of the observed metabolomic correlations are smaller than 0.6.[34] The q-sample comes from data with an average correlation between 0 and 0.9 and a mean equal to 0 or 10. The results are shown in Figure , panels A–D. As can be seen, there is no obvious separation of the single sample networks as a function of the effect size; although, a slight separation appears when the mean of the reference and q-sample differ, especially between single sample networks from a sample with extreme average correlation (i.e., 0 and 0.9).

Figure 6

Scatter plot from PCA of the single sample networks obtained with LIONESS-S (left column) and ssPCC (right column). Panels (A–D) refer to simulation with reference data (20 variables) with average correlation equal to 0.6, while the q-sample is from a population with correlation between 0 and 0.9 and mean equal to 0 (panels A,B) or 10 (panels C,D). Panels (E–H) refer to simulation with reference data with average correlation 0.9, while the q-sample is from a population with correlation between 0 and 0.9 and mean equal to 0 [i.e., data from simulation 1, panels (E,F)] or 10 [i.e., data from simulation 2, panels (G,H)]. Each point in the PCA plot is a vectorized version of the single sample networks, color-coded according to the population correlation from which the q-samples are sampled.

Comparison on NF-κβ Dynamic Model Data

We compared the LIONESS and ssPCC on data simulated from a dynamic model for the NF-κβ pathway where model parameters were manipulated to introduce increasing levels of perturbation with respect to the original unperturbed model; this was accomplished by multiplying/dividing the kinetic parameter (see eqs with ϵ = 1, 1.5, 2, 3, 5, and 10). Results are shown in Figure ; n = 50 samples were considered for each configuration.

Figure 7

Scatter plot from PCA of the single sample networks obtained with LIONESS-S (left column) and ssPCC (right column) on data simulated using a dynamic model of the NF-kB metabolic pathway. Panels (A,B) give results on data obtained by multiplying the kinetic parameters by a factor ϵ = 1, 1.5, 2, 3, 5, and 10; see eq . Panels (C,D) give results on data obtained by multiplying the kinetic parameters by a factor ϵ = 1/10, 1/5, 1/3, 1/2, and 1/1.5). Each point in the PCA plot is a vectorized version of the single sample network, color-coded by the magnitude of the perturbation. For each ϵ configuration, n = 50 samples were considered. In the case of the multiplicative perturbation of the model, networks tend to cluster according to the level of perturbation with highly perturbed networks clustering away from those corresponding to low perturbation. This is particularly evident for ssPCC-derived networks with very clear separation among the clusters (Figure panel B); however, the separation among LIONESS-based networks is much less evident (panel A). For data from the perturbed model obtained by dividing the kinetic constants (see eq with ϵ = 1/10, 1/5, 1/3, 1/2, and 1/1.5), the single sample networks obtained with LIONESS are not resolved, and it is not possible to distinguish among the different groups corresponding to the different perturbation levels (Figure , panel C). On the contrary, the networks obtained with ssPCC are very well resolved, and clear differences appear among the groups (Figure , panel D).

Use of Single Sample Edges for Group Prediction and Classification

We explored the potential of single sample network edges for classification purposes by replacing actual observed metabolite concentrations with the pairwise edges, the rationale being that, as follows from eqs and 32, single-sample edges are a function of both difference in correlation and in level and thus, in some case, can bear more information than level alone. To this scope, we compared the accuracy of Random Forest classification models on 25 publicly available data sets (see the section Compendium of Publicly Available Metabolomics Data). For each data set, we built four different Random Forest classification models using: Original concentration/abundance profiles. The edges of the single sample networks built using ssPCC. The edges of the single sample networks built using LIONESS and all samples to build the aggregate network (LIONESS-D). The edges of the single sample networks built using LIONESS and only group-specific samples to build the aggregate network (LIONESS-S). In total, we have three ways to build single sample network edges for a two-class problem to be used for classification purposes. The results are given in Table . Under the assumption that single sample edges hold more information content than the concentration profiles, we expected the Random Forest models built on edges to have higher accuracy than those built on the original concentration values. We observed that in general, classification accuracy is already very high when using the original values. However, in 13 cases out of 22, the use of ssPCC edges gives better (or equal) classification accuracy. For some data sets, the use of single-sample edges in place of concentrations resulted in better discrimination. For instance, for data 13 the accuracy increases from 89.9 to 100% and better discrimination between cases and controls is also obtained for data sets 1, 3, 4, 5, 12, 15, 16, 17, 18, 20, and 22. In other cases, a reduction in accuracy can be observed. When using the LIONESS-D approach (akin to the strategy we devised for ssPCC), the accuracy was better only in 5 cases out of 22, while when using the LIONESS-S accuracy was better only in one case (data set 19): in all other cases the accuracy was remarkably lower. When Random Forest was run on edges from LIONESS-S, discrimination accuracy was lower with the sole exception of data set 19. This can be explained by the process of separating case and control data before running the LIONESS algorithm. In this format, the matrix E is built as a function of E(α) as shown in eq ; however, here all samples in E(α) come from the same classification group causing the variation between single samples to be much lower due to the lower difference in Eα and E(α– as compared to that in LIONESS-D. There is a large difference depending on how the aggregate network is built. In the original publication,[14] the authors discussed the problem of how to build the aggregate network when in the presence of samples from non-homogeneous populations (basically considering the two approaches used here). They found minimal difference (verbatim) in the accuracy of the reconstruction of the single sample networks but did not explore the edge reconstruction in a discriminant/classification setting. We can also comment that calculating correlations from a non-homogeneous population is not statistically a sound strategy since sample correlation must be calculated from samples drawn from the same population. It is simple to show with simulations that if half of the samples come from a population with correlation ρ = 1 and half from a population with ρ = 0, the sample correlation will be around 0.5 and this will lower the single sample edge. Classification models built from ssPCC and LIONESS-D edges often yielded similar albeit lower accuracy. This is not too unexpected as when LIONESS is expressed as a function of ssPCC, as shown in eq , it descends that the LIONESS single sample edge also depends on Δx and Δx which are the deviations of the q-sample from the average of the reference data. In the LIONESS-D approach, the deviations Δx and Δx are calculated from each q-sample from data sets that are homogeneous to the q-sample and thus can be expected to be small, and this also lowers the edges’ values. This is not the case for all ssPCC edges; in our modification, the deviations for q-samples in the control group are estimated from samples homogeneous to the q-samples, but for the control group, the q-samples are not homogeneous to the reference (which is made from control samples) and this makes the edges larger. This can explain the markedly different behavior of the different single sample edges when used for discrimination between two groups. We shall conclude by remarking that in a classification setting the use of single sample edges as derived from ssPCC and LIONESS-D can be used only in an exploratory or confirmatory setting but not to predict new, unknown, q-samples; the network must be constructed by contrasting the unknown sample either with the case or the control group without knowing to which group the q-sample actually belongs. Unknown q-samples can be predicted using the edges from LIONESS when the aggregates network is constructed using all samples from both groups simultaneously.

Metabolomics Case Study: NSTIs

In order to delve deeper into the characteristics of the two methods and to investigate whether (possibly) new biological information can be gained from the use of single sample network, we analyzed in detail metabolomics data concerning metabolite plasma profiles collected from patients suffering from NSTIs, fast-spreading, aggressive bacterial infections associated with high morbidity and mortality.[35,36] The study comprised 34 NSTI patients and 24 surgery patients with no known infection or morbidity acting as controls. This data has been previously analyzed using standard statistical (univariate) approaches and differential correlation analysis.[17]

Single Sample Network Analysis

We began by building the aggregate reference networks for ssPCC and LIONESS: we used the LIONESS-S approach here as referenced in section LIONESS: Linear Interpolation to Obtain Network Estimates for Single Samples, as this is the approach put forth in the original publication.[14] The reference networks are given in Figures A and 10A, respectively.

Figure 9

(A) Aggregate (reference) network for Lioness (B) Lioness single sample network for NSTI patient n. 24. The top (in absolute value) 0.05% edges are shown. The link width is proportional to the edge weight.

Figure 10

(A) Aggregate (reference) network for ssPCC (B) ssPCC single sample network for NSTI patient n. 24. The top (in absolute value) 0.05% edges are shown. The link width is proportional to the edge weight.

Clustering dendrogram of the single sample networks for the metabolomics study case. The single samples for the subjects affected by NSTIs are shown. Samples are color-coded by distance/similarity: red, for LIONESS (using LIONESS-S implementation) and blue for ssPCC single sample networks. Distances are calculated on the vectorized networks. (A) Aggregate (reference) network for Lioness (B) Lioness single sample network for NSTI patient n. 24. The top (in absolute value) 0.05% edges are shown. The link width is proportional to the edge weight. (A) Aggregate (reference) network for ssPCC (B) ssPCC single sample network for NSTI patient n. 24. The top (in absolute value) 0.05% edges are shown. The link width is proportional to the edge weight. As expected, the two aggregate results/reference networks are rather different with different relevant patterns of correlations. We recall that in this case, for ssPCC the reference network is the correlation matrix obtained from the control group, while for LIONESS, the aggregate is obtained from the correlation of the complete data set (NSTI + control samples). As can be seen, there are obvious differences. While the ssPCC reference network seems to cluster around Maltose, the LIONESS aggregate network seems to cluster around valine and ribitol. We performed pathway analysis on the top 25 perturbed edges for both networks with the aim of assessing network properties and identifying structural and functional units in the metabolic networks.[37] We found significant enrichment for aminoacyl-tRNA biosynthesis (P-value = 3.29 × 10–5, fdr = 0.0028) and lysine degradation pathways (P-value = 1.33 × 10–4, fdr = 0.0056) for the LIONESS aggregate network (Figure A) and significant enrichment for the aminoacyl-tRNA biosynthesis (P-value = 2.45 × 10–5, fdr = 0.002) and valine, leucine, and isoleucine biosynthesis pathways (P-value = 1.15 × 10–4, fdr = 0.005) for the ssPCC reference network (Figure A). These results indicate that the single sample networks are going to be constructed against a background that may encode for different biological phenomena and as such, they will bear different biological information. This can be seen also from the results shown in Figure where ssPCC and LIONESS edges are contrasted: two different reference networks result in related but different single sample edges. It is interesting to note that no significant enrichment (after adjustment for multiple corrections) was found when pathway enrichment analysis was performed only on differentially abundant metabolites (see Table in Afzal et al.(17)). This is a clear indication that information on metabolic rewiring and/or disruption is reflected not only in a change in metabolite levels but also in changes in the correlations between metabolite concentrations. In this case, single sample edges carry more information about changes in metabolism in NSTI than simple metabolite abundances. We constructed the single sample networks for the 34 NSTI patients; as shown in Figure the single sample networks obtained using ssPCC and LIONESS are markedly different, confirming what was observed using simulated data (see the sections Numerical Simulations and Data simulation using a dynamic metabolic model and Figures and 7). LIONESS and ssPCC networks cluster separately and, in general, LIONESS sample edges show higher variability than the corresponding ssPCC edges. In particular, there is a group of networks (corresponding to samples 5, 6, 10, 15, 20, 28, and 29) that are markedly different from the others. This is particularly evident for the LIONESS edges. All samples belong to patients with concurrent comorbidities with NSTI; all these patients are female, except patients 28 and 29.

Figure 8

Moreover, the ssPCC edges of sample 18 are more similar to LIONESS edges than to the other ssPCC samples. We then focused on single sample networks built from the same sample profile using the two methods. We built correlation matrices among ssPCC and LIONESS edges and selected the NSTI patient for whom the single sample networks obtained with ssPCC and LIONESS were most different (i.e., the least correlated). We recovered from this analysis the single sample network for sample 24 who is a NSTI patient having a polymicrobial etiology. The corresponding single sample networks for ssPCC and LIONESS are given in Figures B and 10B where only the largest edges are shown (see figure caption for more details). As can be seen, there are obvious differences. In the LIONESS single sample network (Figure B), the edges connecting acetaminophen and glucuronic acid are the most disrupted. In the ssPCC single sample network (Figure B), the edges connecting acetaminophen, α-tocopherol (vitamin E), maltose, and proline are altered. In the original publication,[17] standard differential network analysis was performed to compare metabolite–metabolite connectivity in NSTI and surgery control (see Table in Afzal et al.[17]); glucuronic acid, and maltose were among the most differentially connected metabolites but not acetaminophen and α-tocopherol.

Table 4

Accuracy of the Random Forest Models Constructed Using the Concentration/Abundance Profiles and the Edges of Single Sample Networks Built from the Metabolomics Data Set Investigating NSTIs

	classification accuracy
Model	Conc	ssPCC	LIONESS-S	LIONESS-D
NSTI vs Controls	94.7	98.3	98.3	100.0
mono vs poly	87.1	82.5	82.5	82.5
StreptococcusvsStaphylococcus	85.6	75.9	75.9	79.3

Univariate Analysis of the Single Sample Network Edges

Furthermore, we compared the edges of LIONESS and ssPCC single sample networks between NSTI patients and controls using a t-test and we compared the results with those obtained in the original publication using the metabolite concentrations. In particular, we compared NSTI and control group on: Metabolite concentrations Metabolite–metabolite single sample network edges defined using ssPCC Metabolite–metabolite single sample network edges defined using LIONESS-S Metabolite–metabolite single sample network edges defined using LIONESS-D. Results are given in Table that contains the top 10 metabolites/edges (sorted by P-value). In general, we observed that metabolites whose concentration is different between NSTI patients and controls are in general those whose single sample edges are found to discriminate among the two groups.

Table 6

Results of t-Test on the Single Sample Edges Obtained Using Lioness (Two Implementations, Mono and Double) and ssPCC Together with Results from a t-Test on the Metabolite Concentrations (Column “Conc”)a

	Concentrations	ssPCC edges	LIONESS-S edges	LIONESS-D edges
1	1-stearoyl sn-glycro-3-phosphocholine	RI 2997 m/z 184 1 m/z 184 & RI 2354 m/z 290 1 m/z 522	cystathionine & glutaric acid	RI 2997 m/z 184 1 m/z 184 & RI 2354 m/z 290 1 m/z 522
2	1-palmitoyl sn-glycero-3-phosphocholine	RI 2997 m/z 184 1 m/z 184 & RI 1232 m/z 174 1 m/z 175	cystathionine & naproxen tms	RI 2997 m/z 184 1 m/z 184 & RI 1232 m/z 174 1 m/z 175
3	isomaltose meox	alpha tocopherol & RI 2997 m/z 184 1 m/z 184	lidocaine tms & glutaric acid	alpha tocopherol & RI 2997 m/z 184 1 m/z 184
4	RI 1416 m/z 218 1 m/z 162	alpha tocopherol & tryptophan	cystathionine & lidocaine tms	α-tocopherol & tryptophan
5	α-tocopherol	1-palmitoyl sn-glycero-3-phosphocholine & maltose meox2	pseudo uridine penta & threonic acid	1-palmitoyl sn-glycero-3-phosphocholine & maltose meox2
6	tryptophan	isomaltose meox & RI 1232 m/z 174 1 m/z 175	3-hydroxyisovaleric acid & beta alanine	isomaltose meox & RI 1232 m/z 174 1 m/z 175
7	ribose meox	1-palmitoyl sn-glycero-3-phosphocholine & maltose meox1	2,3-dihydroxybutanoic acid & glycine	1-palmitoyl sn-glycero-3-phosphocholine & maltose meox1
8	glucoronic acid	RI 2997 m/z 184 1 m/z 184 & fumaric acid	heptadecanoic acid & asparagine	RI 2997 m/z 184 1 m/z 184 & fumaric acid
9	RI 1427 m/z 189 1	citric acid & glutamine-2	octadecanoic acid & asparagine	citric acid & glutamine-2
10	RI 2354 m/z 290 1 m/z 522	alpha tocopherol & RI 1416 m/z 218 1 m/z 162	octadecenoic acid-9-(z) & asparagine	alpha tocopherol & RI 1416 m/z 218 1 m/z 162

The top 10 edges (sorted by P-values) are given.

The top 10 metabolites and metabolite–metabolite edges are shown in decreasing order of importance (given by the Mean Decrease Gini Index). The top 10 edges (sorted by P-values) are given.

Random Forest Analysis on Single Sample Network Edges

Following the same strategy outlined in Section Random Forest Prediction Models, we performed Random Forest classifications to analyze the prediction accuracy of single sample edges in comparison with the simple correlations. In particular, Random Forest classification models were built for the following two-group comparisons: NSTI (n = 34) versus controls (n = 24) Mono-microbial NSTI (n = 26) versus poly-microbial NSTI (n = 7) Streptococcus NSTI (n = 20) versus Staphylococcus aureus NSTI (n = 8) The models were built using the original concentration/abundance profiles, the edges of the single sample networks were built using ssPCC, the edges of the single sample networks were built using LIONESS-S, and the edges of the single sample networks were built using LIONESS-D. Classification accuracies are given in Table . Regarding the comparison between NSTI and controls, the use of the single sample network edges increases the accuracy (up to 100%), although the use of simple concentrations gives excellent classification (94.7%). The logic behind the use of single sample network edges is that additional information is contained in the relationships (or disruption thereof) among (pairs of) metabolites that is contained in, or is additional to, the metabolite levels and thus better accuracy should be, in principle, obtained. However, this is not always the case. For the comparison between mono versus poly microbial infection and Streptococcus versus Staphylococcus infection, the use of single sample network results in decreased accuracy. This may well depend on the limited sample size used to build the aggregated/reference networks that can lead to instability in the estimation of these networks[12] and, as a consequence, low-quality estimation of the single sample networks. We also compared the edge and metabolite importance in the Random Forest models using the mean decrease Gini index as the importance measure. The top metabolites and edges from the models are shown in Table . The metabolites whose edges mostly contribute to the separation between NSTI and controls tend to be the metabolites whose concentration is also different between the two groups, confirming that the single sample edges bear content of both concentration and pairwise relationships as also discussed in the section ssPCC: Single Sample Network Based on Pearson’s Correlation.

Table 5

Random Forest Classification of NSTI Patients and Controls Using Metabolite Concentrations and ssPCC and LIONESS Single-Sample Network Edgesa

	Concentrations	ssPCC edges	LIONESS-S edges	LIONESS-D edges
1	1-palmitoyl sn-glycero-3-phosphocholine	1-stearoyl sn-glycero-3-phosphocholine & 1-palmitoyl sn-glycero-3-phosphocholine	lidocaine tms & glutaric acid	alpha tocopherol & RI 2997 m/z 184 1 m/z184
2	1-stearoyl sn-glycero-3-phosphocholine	RI 2997 m/z 184 1 m/z184 & RI 2354 m/z 290 1 m/z522	Naproxen tms & Lidocaine tms	RI 2997 m/z 184 1 m/z184 & maltose meox1
3	RI 2997 m/z 184 1 m/z184	1-palmitoyl sn-glycero-3-phosphocholine & RI 2354 m/z 290 1 m/z522	pseudo uridine penta & threonic acid	RI 2997 m/z 184 1 m/z184 & RI 2354 m/z 290 1 m/z522
4	RI 1416 m/z 218 1 m/z162	1-palmitoyl sn-glycero-3-phosphocholine & Citric acid	cystathionine & lidocaine tms	RI 2354 m/z 290 1 m/z522 & tryptophan
5	isomaltose meox	1-stearoyl sn-glycero-3-phosphocholine & RI 2354 m/z 290 1 m/z522	naproxen tms & glycine	1-stearoyl sn-glycero-3-phosphocholine & 1-palmitoyl sn-glycero-3-phosphocholine
6	pibose meox	1-palmitoyl sn-glycero-3-phosphocholine & RI 2997 m/z 184 1 m/z184	naproxen tms & glutamine	1-palmitoyl sn-glycero-3-phosphocholine & RI 2354 m/z 290 1 m/z522
7	citric acid	1-stearoyl sn-glycero-3-phosphocholine & citric acid	cystathionine & glutaric acid	RI 2354 m/z 290 1 m/z522 & RI 1416 m/z 218 1 m/z162
8	tryptophan	alpha tocopherol & tryptophan	naproxen tms & glutaric acid	RI 2354 m/z 290 1 m/z522 & citric acid
9	alpha tocopherol	1-stearoyl sn-glycero-3-phosphocholine & tryptophan	octadecenoic acid-9-(z) & asparagine-2	alpha tocopherol–tryptophan
10	RI 2354 m/z 290 1 m/z522	tryptophan & RI 1416 m/z 218 1 m/z162	naproxen tms & lysine	RI 1416 m/z 218 1 m/z162 & glyceric acid

The top 10 metabolites and metabolite–metabolite edges are shown in decreasing order of importance (given by the Mean Decrease Gini Index).

Conclusions

In this study, we investigated and assessed the utility of two methods for the inference of single sample networks in a metabolomics context: LIONESS (Linear Interpolation to Obtain Network Estimates for Single Samples)[14] and ssPCC (single sample network based on Pearson’s correlation).[13] The two methods are functionally related and when compared on simulated data with different correlative properties, we found both methods to have limited ability to describe different situations. However, on data from a NF-kB dynamic metabolic model, we found that only the ssPCC single sample networks are able to describe different situations arising from perturbations of the model. We found the statistical procedure proposed in ssPCC to have limited power and to be heavily dependent on the particular reference networks and of little utility for practical applications. We explore the potential of single sample edges to be used in place of concentration to discriminate between groups in a case–control scenario. To this scope, we used two different implementations of LIONESS and proposed a work-around to adapt ssPCC to this scenario. Using Random Forest as a classification algorithm, we found that in 13 cases out of 22 the use of ssPCC edges gives better (or equal) classification accuracy than (to) the use of actual metabolite concentrations, while, overall, the use of LIONESS sample resulted in worse prediction models. We found that single sample networks built off of a control group (like ssPCC and LIONESS-S) yield better results than those built off of similar samples in a classification setting; however, this approach does not allow for generalization and as such should be used as an exploratory tool. We finally applied the two methods to analyze a metabolomics study pertaining to NSTIs. NSTI-specific single sample networks obtained with the two methods are markedly different and are likely to describe different ongoing biological processes. We found that single sample edges, either from ssPCC or LIONESS, gave better prediction results in distinguishing between NSTI patients and controls but not in other comparisons aimed to distinguish between disease etiology. In general, ssPCC edges found to be important in discriminating groups involve metabolite pairs that are found important when comparing groups with a standard t-test performed on concentrations. There are still some aspects of single sample networks to be elucidated, the most important being a proper statistical validation for both ssPCC and LIONESS. We used both methods taking Pearson’s correlation as an association measure and building networks using the standard correlation matrix, but the two methods are general. Other measures of association can be used, such as partial correlation, as proposed by the authors of ssPCC,[38] or mutual information, although the latter measure does not give any improvements, at least for what concerns metabolomics data.[19] In addition, more sophisticated approaches can be used for network inference as proposed by Kuijjer et al.(14) Overall, we think single sample networks seem to be a promising tool for the analysis of metabolomics data. For instance, we have shown that single sample edges provide more information than simple concentrations for pathway analysis since they encode both levels and relationships between metabolites. Single sample networks can be used as exploratory tools and can be paired with standard univariate and multivariate tools.

41 in total

Review 1. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

2. MissForest--non-parametric missing value imputation for mixed-type data.

Authors: Daniel J Stekhoven; Peter Bühlmann
Journal: Bioinformatics Date: 2011-10-28 Impact factor: 6.937

3. Effects of Sample Size and Dimensionality on the Performance of Four Algorithms for Inference of Association Networks in Metabonomics.

Authors: Maria Suarez-Diez; Edoardo Saccenti
Journal: J Proteome Res Date: 2015-10-30 Impact factor: 4.466

Review 4. Interactome networks and human disease.

Authors: Marc Vidal; Michael E Cusick; Albert-László Barabási
Journal: Cell Date: 2011-03-18 Impact factor: 41.582

5. Predictors of mortality and limb loss in necrotizing soft tissue infections.

Authors: Daniel A Anaya; Kerry McMahon; Avery B Nathens; Stephen R Sullivan; Hugh Foy; Eileen Bulger
Journal: Arch Surg Date: 2005-02

6. Serum metabolomic profiles associated with postmenopausal hormone use.

Authors: Victoria L Stevens; Ying Wang; Brian D Carter; Mia M Gaudet; Susan M Gapstur
Journal: Metabolomics Date: 2018-07-06 Impact factor: 4.290

Review 7. A census of human cancer genes.

Authors: P Andrew Futreal; Lachlan Coin; Mhairi Marshall; Thomas Down; Timothy Hubbard; Richard Wooster; Nazneen Rahman; Michael R Stratton
Journal: Nat Rev Cancer Date: 2004-03 Impact factor: 60.716

8. Distinct signatures of dental plaque metabolic byproducts dictated by periodontal inflammatory status.

Authors: Akito Sakanaka; Masae Kuboniwa; Ei Hashino; Takeshi Bamba; Eiichiro Fukusaki; Atsuo Amano
Journal: Sci Rep Date: 2017-02-21 Impact factor: 4.379

Review 9. From correlation to causation: analysis of metabolomics data using systems biology approaches.

Authors: Antonio Rosato; Leonardo Tenori; Marta Cascante; Pedro Ramon De Atauri Carulla; Vitor A P Martins Dos Santos; Edoardo Saccenti
Journal: Metabolomics Date: 2018-02-27 Impact factor: 4.290

10. On the Use of Correlation and MI as a Measure of Metabolite-Metabolite Association for Network Differential Connectivity Analysis.

Authors: Sanjeevan Jahagirdar; Edoardo Saccenti
Journal: Metabolites Date: 2020-04-24

5 in total

1. Analysis of host-pathogen gene association networks reveals patient-specific response to streptococcal and polymicrobial necrotising soft tissue infections.

Authors: Sanjeevan Jahagirdar; Lorna Morris; Nirupama Benis; Oddvar Oppegaard; Mattias Svenson; Ole Hyldegaard; Steinar Skrede; Anna Norrby-Teglund; Vitor A P Martins Dos Santos; Edoardo Saccenti
Journal: BMC Med Date: 2022-05-04 Impact factor: 11.150

Review 2. New software tools, databases, and resources in metabolomics: updates from 2020.

Authors: Biswapriya B Misra
Journal: Metabolomics Date: 2021-05-11 Impact factor: 4.290

3. Analysis of metabolic dynamics during drought stress in Arabidopsis plants.

Authors: Fidel Lozano-Elena; Norma Fàbregas; Veredas Coleto-Alcudia; Ana I Caño-Delgado
Journal: Sci Data Date: 2022-03-21 Impact factor: 6.444

4. Exploration of Blood Lipoprotein and Lipid Fraction Profiles in Healthy Subjects through Integrated Univariate, Multivariate, and Network Analysis Reveals Association of Lipase Activity and Cholesterol Esterification with Sex and Age.

Authors: Yasmijn Balder; Alessia Vignoli; Leonardo Tenori; Claudio Luchinat; Edoardo Saccenti
Journal: Metabolites Date: 2021-05-18

Review 5. Defining Blood Plasma and Serum Metabolome by GC-MS.

Authors: Olga Kiseleva; Ilya Kurbatov; Ekaterina Ilgisonis; Ekaterina Poverennaya
Journal: Metabolites Date: 2021-12-24

5 in total