Literature DB >> 28823158

Improving Visualization and Interpretation of Metabolome-Wide Association Studies: An Application in a Population-Based Cohort Using Untargeted ¹H NMR Metabolic Profiling.

Raphaële Castagné, Claire Laurence Boulangé¹, Ibrahim Karaman, Gianluca Campanella, Diana L Santos Ferreira, Manuja R Kaluarachchi¹, Benjamin Lehne, Alireza Moayyeri², Matthew R Lewis¹, Konstantina Spagou¹, Anthony C Dona¹, Vangelis Evangelos³, Russell Tracy⁴, Philip Greenland⁵, John C Lindon^1,6, David Herrington⁷, Timothy M D Ebbels^1,6, Paul Elliott, Ioanna Tzoulaki, Marc Chadeau-Hyam.

Abstract

1H NMR spectroscopy of biofluids generates reproducible data allowing detection and quantification of small molecules in large population cohorts. Statistical models to analyze such data are now well-established, and the use of univariate metabolome wide association studies (MWAS) investigating the spectral features separately has emerged as a computationally efficient and interpretable alternative to multivariate models. The MWAS rely on the accurate estimation of a metabolome wide significance level (MWSL) to be applied to control the family wise error rate. Subsequent interpretation requires efficient visualization and formal feature annotation, which, in-turn, call for efficient prioritization of spectral variables of interest. Using human serum 1H NMR spectroscopic profiles from 3948 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), we have performed a series of MWAS for serum levels of glucose. We first propose an extension of the conventional MWSL that yields stable estimates of the MWSL across the different model parameterizations and distributional features of the outcome. We propose both efficient visualization methods and a strategy based on subsampling and internal validation to prioritize the associations. Our work proposes and illustrates practical and scalable solutions to facilitate the implementation of the MWAS approach and improve interpretation in large cohort studies.

Entities: Chemical

Keywords: MESA; cohort studies; full resolution 1H NMR; high-throughput analysis; metabolic profiling; metabolome wide association study; molecular epidemiology; multiple testing correction; results visualization and prioritization; significance level

Mesh：

Substances：
Blood Glucose

Year: 2017 PMID： 28823158 PMCID： PMC5633829 DOI： 10.1021/acs.jproteome.7b00344

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Over the past 15 years, improvements in high-throughput technologies have accelerated the simultaneous measurement of large numbers of metabolites in a single sample using NMR and mass spectrometry (MS), which are both widely used to characterize a biofluid using either untargeted or targeted profiling approaches.[1,2] Both technologies have emerged as efficient tools for identifying biomarkers of exposures as well as early disease manifestations, hence informing on molecular mechanisms involved in pathogenesis (e.g., in cancer, diabetes, cardiovascular, and neurological diseases).[3−6] Metabolic phenotyping uses robust and reliable analytical methods[7,8] that are ideally suited for untargeted profiling since prior knowledge about compounds present in a sample is not required. Such an agnostic discovery approach can inform complementary targeted methods by identifying novel chemical compounds that may contribute to the molecular pathways involved in complex phenotypes.[8−10] These screening exercises call upon the efficient analyses of high-dimensional data whose exploration poses complex methodological and interpretation problems.[11] Typically, untargeted NMR experiments in a large population study generate tens of thousands of data points for thousands of individuals. By adopting univariate approaches, the concept of the metabolome wide association study (MWAS)[12] has been proposed to analyze these data, and various statistical approaches to perform such analyses were established and have been explored and reviewed.[13−17] The complex correlation structures existing in metabolic profiles result in partially redundant signals across the metabolic spectra. These need to be accounted for while correcting for multiple testing, for instance by using a permutation-based procedure to derive the metabolome-wide significance level (MWSL) controlling the family wise error rate (FWER).[13] However, the comparative applicability of these approaches, as well as the visualization of the results they produce, has not been comprehensively explored. Here we provide, using a real-life example from the COMBI-BIO project, an extension of the MWAS methodology we initially developed using simulated data sets in a class discrimination context[13] and further extended to accommodate continuous outcomes. The current extension includes the computation of a stable and reproducible statistical significance threshold and proposes ways to estimate consistently across different types of NMR spectra as well as an internal validation procedure to assess the robustness of candidate associations. Our data set comprises two versions of 1H NMR nontargeted metabolic profiles in 3948 participants from the Multiethnic Study of Atherosclerosis (MESA) cohort.[18] As a proof-of-principle example, we focus on identifying the NMR spectral features associated with fasting blood serum glucose, which was measured by an independent technique (glucose oxidase method). By examining the full NMR spectrum, our analysis was designed to identify spectral regions corresponding to other metabolites beyond glucose itself that also vary with fasting serum glucose (1H NMR glucose associated peaks).

Materials and Methods

Study Population and Sample Selection

The MESA cohort has been described elsewhere[18] and includes 6814 participants (53% females, 47% males) aged 44–84 years (mean = 62 years) from four different ethnic groups: Chinese-American (n = 803), African-American (n = 1893), Hispanic (n = 1496), and Caucasian (n = 2622), all recruited between 2000 and 2002 at clinical centers in the United States. Participants were free of symptomatic cardiovascular disease at baseline, and demographic, medical history, anthropometric, lifestyle data, and serum samples were collected during the first examination (July 2000–August 2002), together with information on lipid or blood pressure treatment, and diabetes, and measures of systolic blood pressure. Serum samples were stored at −80 °C after collection. At enrolment, high density lipoprotein (HDL-C) was measured in EDTA plasma on the Roche/Hitachi 911 Automatic Analyzer (Roche Diagnostics Corporation, Indianapolis, IN), and low density lipoprotein (LDL-C) was calculated using the Friedewald equation,[19] together with fasting serum glucose using the glucose oxidase method on the Vitros analyzer (Johnson and Johnson Clinical Diagnostics). Ethical approval was obtained by local ethical review boards, and subsequent analysis was conducted in full accordance with the ethical approval obtained.

Samples Preparation and 1H NMR Spectroscopic Acquisition

The full sample preparation and quality control procedure have been extensively described elsewhere.[20] Briefly, serum samples were thawed on the day prior to analysis, and 300 μL of each sample was mixed with 300 μL of phosphate buffer (NaHPO4, 0.075M, pH = 7.4). Samples were processed in two phases (each corresponding to a separate analytical batch). Eppendorfs were used for phase 1, while 96-well plates were used for phase 2. After centrifugation (12 000g at 4 °C for 5 min), 550 μL of each sample-buffer mixture was manually transferred into SampleJet 5 mm diameter NMR tubes and kept at 4 °C until analysis. Different types of quality control (QC) samples were used for each phase as described elsewhere.[20] All QC pools were aliquoted in 350 μL and stored at −80 °C prior to analysis. 1H NMR spectra were acquired using a Bruker DRX600 spectrometer (Bruker Biospin, Rheinstetten, Germany) operating at 600 MHz. A standard water suppressed one-dimensional spectrum (usually termed NOESY) and a Carr–Purcell–Meiboom–Gill (CPMG) spectrum were obtained for each sample.[20]

1H NMR Metabolite Profiling

The metabolite profiling workflow has been detailed elsewhere.[20] For each biosample, two NMR profiles were generated: (i) a 1-D spectrum (NOESY) showing resonances from all proton-containing molecules in the sample, including broad, largely undefined bands from serum proteins, sharper and well-defined bands from serum lipoproteins (with some classification into their main groups), and sharp peaks from a range of small molecule metabolites such as amino acids, simple carbohydrates, organic acids, organic bases, and a number of osmolytes, and (ii) a Carr–Purcell–Meiboom–Gill (CPMG) spectrum that attenuates the peaks from the macromolecules and allows better definition of the small molecules. For both CPMG and NOESY NMR data, in-house written MATLAB (Mathworks Inc., USA) routines were utilized for phasing and baseline correction. Prior to spectral peak alignment, the region δ 4.400–5.100 corresponding to the H2O resonance was removed. Spectral peak alignment was performed by the Recursive Segment-wise Peak Alignment (RSPA) algorithm.[21] Regions where the peaks of different suspected contaminations (i.e., methanol) occurred were removed from the whole spectra (δ 1.180–1.240, δ 2.244–2.261 and δ 3.660–3.710). The remaining spectral regions were normalized by probabilistic quotient normalization using the median spectrum as the reference.[22] The normalized high resolution spectra contained 30 590 data points (variables) for both CPMG and NOESY data sets. To ensure comparability across batches (i.e., across studies and phases), each variable was mean-centered[23] (we will hereafter refer to this as mean corrected intensities). Score plots of the first few principal components for the resulting data set were finally used to identify and remove potential outlier samples (N = 3 for the CPMG and N = 4 for the NOESY data) from further analyses. The MESA study population was further examined for potential outliers using score plots from the first two principal components (Figure S-1), which showed strong homogeneity in the study participants.

Metabolite Assignments

Selected samples were analyzed using a range of 2-D NMR experiments such as total correlation spectroscopy (TOCSY) or heteronuclear single quantum coherence (HSQC) to aid molecular identification. We used approaches such as statistical total correlation spectroscopy (STOCSY) and STORM to help constrain possible molecular structures.[24,25] Peak identification in the 1H NMR data was supported by a semiautomatic clustering of the full resolution 1H NMR spectra (30 590 data points) using statistical recoupling of variables,[26] where the algorithm defines a cluster as containing 10 or more variables. Each cluster is subsequently checked by NMR experts to improve the data point grouping and identify peak overlaps. From our data, 136 and 159 clusters were identified in 1H NMR NOESY and CPMG data, respectively, each of them corresponding to a single or a group of peaks. Resulting spectral information was also compared to the literature[27,28] and to existing databases such as the Human Metabolome Database[29] (HMDB, http://www.hmdb.ca/) and the Biological Magnetic Resonance Data Bank[30] (BMRB, http://www.bmrb.wisc.edu). Resulting sets of NMR features were ultimately confirmed through spiking experiments using commercial standards. The level of assignment (LoA) we used was adapted from Sumner et al.[31] NMR features were also characterized by their peak multiplicity: singlet (s), doublet (d), triplet (t), quartet (q), doublet of doublet (dd), multiplet (m), broad peak (b), and noise (n). Ninety-two clusters were assigned to 44 unique metabolites for the NOESY data set, and 91 clusters were assigned to 48 unique metabolites for the CPMG data set.

Metabolome-Wide Association Study (MWAS)

Figure presents our analytical workflow. We adopted a MWAS approach using univariate linear regression models to systematically screen the 30 590 data points assayed in both NOESY and CPMG profiles.

Figure 1

Analytical workflow.

Analytical workflow. For a given data point, the linear model can be formulated aswhere X represents the normalized NMR intensity, Y is the outcome variable; here the log10 transformed blood glucose concentration and ε is the error term for an observation i. α is the intercept of the model, and ß1 measures how strongly each data point influences the outcome variable. FE is a vector of fixed effect observations for individual i and the vector B2 compiles the regression coefficients for each adjustment covariate; for model 1, age, gender, phase, and ethnicity, and for model 2, we additionally correct for LDL and HDL cholesterol, lipids and blood pressure treatment, systolic blood pressure, smoking status, and diabetes. Most of the recently published MWAs adopted a Bonferroni correction for multiple testing,[32−34] which ignores correlation across variables and therefore does not account for the (partial) redundancy across the statistical test performed. This may lead to an overly conservative multiple testing correction. To take into account the high degree of correlation in spectral data and prevent overcorrection for multiple testing, one computationally efficient method relies on the estimation of the effective number of tests (ENT), as defined by the virtual number of independent tests that are performed across the actual p tests performed. ENT measures the level of correlation within the spectral data and can be estimated through spectral decomposition of the correlation matrix of predictors.[35] From this estimate, the per-test significance level ensuring a FWER control can be defined as the Bonferroni-corrected threshold corresponding to that number of independent tests. However, this approach remains of limited use in real-life metabolomic data sets, notably because, for numerical reasons, the ENT is upper-bounded by the number of observations.[36] As a scalable alternative, we used a permutation-based method to estimate the metabolome wide significance level (MWSL or α′)[13] in which the outcome (glucose levels) is randomly shuffled across observations. Each permutation mimics the null hypothesis of no association, and we performed for each permuted data set a MWAS using the linear regression model described above. In that setting, and for a given permuted data set, the minimum p-value across all variables (denoted q) represents the largest significance level to be considered to ensure no false positive findings, and the MWSL α′ controlling the FWER at a level α can be derived from the distribution of q across the N (set here to 10 000) permutations. The effective number of tests (ENT) is then defined as the number of independent tests that would be required to obtain the same significance level using a Bonferroni correction: ENT= α/α′ and measures the level of correlation across the p tests performed. To assess the robustness of the ENT estimates and to circumvent the strong assumptions from the generalized linear model on data structure (independence of each data points, distribution of the residuals, variance structure, and linear relationship between response and predictors), we ran our permutation-based procedure for both data sets (NOESY and CPMG) for both models (1 and 2) and used different transformations of the glucose distribution: raw concentrations, truncated concentrations (129 outlying observations with more than 2 standard deviations away from the mean glucose level were discarded), and log10-transformed glucose levels (Figure S-2). To formally assess the sensitivity of our MWSL estimates to distributional features of the outcome, we also simulated continuous responses from gamma and several Gaussian distributions and compared resulting MWSL estimates to those obtained using measured glucose levels.

Sensitivity, Stability Analyses, and Results Prioritization

We performed further sensitivity analyses to assess the stability of the candidate associations we identified. These included a cross-validation procedure based on the independent subsampling (N = 100 times) of discovery (containing 80% of the observations) and replication (comprising the remaining 20% observation) data sets. For each 80:20 discovery and replication split, we performed a MWAS and used the MWSL to identify the candidate associations in the discovery set. For each discovery set (80% of the observations), the number of independent signals among the candidate associations was approximated by the number of principal components needed to explain more than 99% of their variance. That number of PCs was then used to compute the Bonferroni corrected MWSL used in the replication set.[37] Our strategy could be summarized as follows for a given split: MWAS: identify (N0) candidate associations in the discovery set with discovery p-value < MWSL. Estimate the number of independent signals these N0 correspond to run a PCA on the XN0, the matrix combining the N0 data points declared significant in the discovery set (a random 80% subsample or the full study population), and identify NPC, the number of PC’s needed to explain more than 99% of the variance in XN0. Replication: identify from the candidate signals (step 1) those replicating in the 20% replication set at a Bonferroni-corrected significance level accounting for NPC tests, α/ NPC, setting α = 5%. Steps 1–3 are repeated across the 100 independent splits and the proportion a given signal is identified in the discovery set, and replicated in the validation set is reported. Statistical analyses were all performed using R v3.1.2.[38]

Results and Discussion

A total of 3948 individuals from the MESA cohort were included in the analysis, and for each individual, 30 590 serum NMR features were measured for both NOESY and CPMG spectra. The characteristics of the study population are summarized in Table .

Table 1

Summary Characteristics of the Study Population: Multiethnic Study of Atherosclerosis Cohort

		N	% or mean (sd)
gender	men	1951	49.4
gender	women	1997	50.6
age (y)	all	3948	62.9 (10.3)
phase	1	1976	50
phase	2	1972	50
ethnicity	Caucasian	1521	38.5
	Hispanic	926	23.4
	African-American	968	24.5
	Chinese-American	533	13.5
body mass index (kg/m²)	all	3948	28.2 (5.4)
glucose (mg/dL)	all	3945	98.3 (31.1)
glucose (mg/dL)	missing	3
LDL cholesterol (mg/dL)	all	3884	117.3 (31.4)
LDL cholesterol (mg/dL)	missing	64
HDL cholesterol (mg/dL)	all	3942	50.7 (14.7)
HDL cholesterol (mg/dL)	missing	6
systolic blood pressure (mmHg)	all	3948	127.1 (21.3)
height (cm)	all	3948	166.4 (10.2)
diabetes	no	3387	85.8
diabetes	yes	561	14.2
lipids treatment	no	3286	83.2
lipids treatment	yes	662	16.8
blood pressure treatment	no	2449	62.1
blood pressure treatment	yes	1497	37.9
smoking	never	1988	50.4
	former	483	12.2
	current	1461	37
	missing	16	0.4

We first explored the sensitivity of the MWSL estimates to (i) the parametrization of the statistical model, (ii) the type of NMR data under investigation, and (iii) distributional features of the outcome of interest. As summarized in Table , we then ran the permutation procedure for two different models (Models 1 and 2), for both types of NMR data sets (NOESY and CPMG) and 3 versions of glucose blood concentrations: raw, log10 transformed and truncated (removing 129 outlying observations which were outside the 2 standard deviation range from the mean glucose level). Across all models investigated, MWSL estimates for CPMG are more stringent than for NOESY spectra and, correspondingly, ENT estimates for CPMG are much greater than those for NOESY data (ranging from 17 610 to 122 460 and from 3680 to 17 010 for CPMG and NOESY, respectively). This suggests stronger correlations within NOESY data, which is plausible given that NOESY NMR spectra contain stronger broad peaks from proteins and lipoproteins than CPMG spectra, such that data point intensities are highly correlated. As summarized in Table , irrespective of the type of spectrum, MWSL estimates (and corresponding ENT) seem to be only marginally affected by the number of confounders included in the model (i.e., comparing models 1 and 2). However, a systematic but moderate increase in the ENT is observed for the fully adjusted model (Model 2). This can be explained by the fact that the fully adjusted model removes spectral signals relating to the components of the Framingham risk scores (such as the conventionally measured lipoprotein levels), and these confounders were highly likely to be driving correlations across some data points.

Table 2

Significance Threshold α′ and Effective Number of Test (ENT) Based on a Bonferroni Correctionab

				NOESY (30 590 variables)		CPMG (30 590 variables)
phenotype	N	model		MWSL (FWER = 5%)		MWSL (FWER = 5%)
glucose	3945	1	α′ (×10^–5)	0.31 (0.26; 0.35)		0.04 (0.03; 0.05)
	3945	1	ENT (×10³)	16.33 (14.42; 19.51)	53.39	122.46 (108.67; 143.67)	400.32
	3866	2	α′ (×10^–5)	0.29 (0.25; 0.32)		0.04 (0.04; 0.05)
	3866	2	ENT (×10³)	17.01 (15.57; 19.88)	55.60	121.64 (107.21; 142.23)	397.64
log₁₀(glucose)	3945	1	α′ (×10^–5)	1.09 (0.96; 1.17)		0.22 (0.20; 0.23)
	3945	1	ENT (×10³)	4.59 (4.28; 5.22)	15.01	23.1 (21.7; 24.4)	75.37
	3866	2	α′ (×10^–5)	1.05 (0.96; 1.13)		0.21 (0.19; 0.22)
	3866	2	ENT (×10³)	4.75 (4.43; 5.02)	15.52	23.6 (22.7; 26.2)	77.07
glucose without outliers	3816	1	α′ (×10^–5)	1.36 (1.25; 1.45)		0.28 (0.26; 0.29)
	3816	1	ENT (×10³)	3.68 (3.45; 4.01)	12.02	18.02 (17.36; 19.07)	58.92
	3743	2	α′ (×10^–5)	1.06 (1.00; 1.10)		0.28 (0.27; 0.30)
	3743	2	ENT (×10³)	4.70 (4.53; 4.99)	15.36	17.61 (16.74; 18.74)	57.57

Bold figures are the ratios of effective/actual number of tests.

95% confidence intervals are given in parentheses. Figures are based on 10 000 permutations for each model (1 and 2) and given for the glucose, log10(glucose) and the glucose after outliers exclusion (N = 129 excluded). Bold figures are the ratios of effective/actual number of tests. For CPMG data, estimates of the MWSL using raw glucose concentrations led to an estimated ENT greater than the actual number of tests (ratio around 4.00). This might be attributed to the parametric assumption in generalized linear regression model (e.g., normality of the outcome and equal variance) underlying our permutation procedure being violated. This is supported by the distribution of the blood levels of glucose, which is right skewed, with several outlying observations (Figure S-2A). To account for this asymmetrical distribution, we first applied a log10 transformation to glucose levels (Figure S-2B). Although the transformed distribution remains right skewed, corresponding estimates of the MWSL were less stringent, and the effective to actual number of tests ratio dropped to around 0.75 for CPMG (and to 0.15 for NOESY). To remove the influence of outlying observations, we truncated the glucose distribution and removed observations more than 2 standard deviations away from the mean glucose level (Figure S-2C). While only a small number of observations were discarded (N = 129), this removal strongly impacted the MWSL estimates for CPMG data, and the effective to actual number of test ratio dropped to less than 60% for both models. These results suggest that MWSL estimates are sensitive to the shape of the distribution of the continuous outcome under investigation, and are specifically affected by both the relative weight of its tail, and by the presence of outlying observations. To formally assess the sensitivity of our MWSL estimates to the parametric form of the response variable, we ran a series of sensitivity analyses where we randomly sampled for each participant (and for each permutation) the glucose levels from (i) a Gamma distribution fitted on the measured glucose levels (shape = 15.90, scale = 6.18), and (ii) several Gaussian distributions (mean = 0, sd = 1; mean = 0, sd = 10; mean = 0, sd = 100, mean = mean(glucose), sd = sd(glucose). By construction, the Gamma-distributed response did not include outlying observations, but featured an inflated right tail, which provided less stringent MWSL for both NOESY. Results from the normally distributed outcomes showed consistent MWSL estimates and did not seem to be strongly affected by the parameters choices defining the Gaussian distribution (Table S-1, Figure S-3). Since the numbers of associated variables were only marginally affected by the way the ENT was computed in our example, we took forward the MWSL estimated from the Gaussian simulated outcome (mean = 0, sd = 1), which also seemed to provide the best balance between generalizability and simplicity (Figures S-3 and S-4). We further compared the proportion of associated variables from different multiple testing correction strategies including Bonferroni and FWER control, and Benjamini-Hochberg[39] false discovery rate (BH-FDR) procedure (Figure S-5). In all scenario considered, the largest proportion of associated variables was always observed when using a BH-FDR procedure, and the smallest proportion of associated variables was always observed when using a Bonferroni procedure (Figure S-5).

Metabolome Wide Association Study of Glucose: Visualization and Prioritization

We performed the MWAS of blood levels of glucose on both CPMG and NOESY spectra using both models and setting the MWSL to that estimated using the Gaussian simulated outcome (mean = 0, sd = 1, Table S-1). As expected, irrespective of the model and of the type of spectrum, a very large number of spectral features were found associated with blood concentration of glucose (Table ): for NOESY data, 72% and 44% of the spectral variables were found significantly associated with glucose level for models 1 and 2, respectively. These proportions were 40% and 16% for models 1 and 2, respectively, in CPMG data.

Table 3

Number and Percentage of Associated Variables for Models 1 and 2 for Both NOESY and CPMG by Class of Metabolitea

		NOESY			CPMG
model		number of data points per group(%)	number of significant data points (%)	number of significant data points after results prioritization (%)	number of data points per group (%)	number of significant data points (%)	number of significant data points after results prioritization (%)
1	allb	30 590 (100)	22 066 (72)	18 340 (83)	30 590 (100)	12 303 (40)	9920 (81)
	amino-acidsc	4786 (15.65)	3653 (76.3)	3222 (88.2)	4524 (14.79)	2780 (61.5)	2078 (74.7)
	carbohydratesc	2394 (7.83)	2204 (92.1)	2192 (99.5)	1826 (5.97)	1763 (96.5)	1734 (98.4)
	drug derivativesc	191 (0.62)	191 (100)	116 (60.7)	81 (0.26)	30 (37)	26 (86.7)
	lipidsc	4085 (13.35)	2815 (68.9)	2569 (91.3)	3906 (12.77)	2772 (71)	2530 (91.3)
	nucleosidesc	116 (0.38)	41 (35.3)	35 (85.4)	103 (0.34)	13 (12.6)	1 (7.7)
	organic acidsc	1383 (4.52)	949 (68.6)	838 (88.3)	864 (2.82)	390 (45.1)	266 (68.2)
	othersc	377 (1.23)	229 (60.7)	173 (75.5)	363 (1.19)	261 (71.9)	237 (90.8)
	proteinsc	549 (1.79)	542 (98.7)	500 (92.3)	499 (1.63)	499 (100)	494 (99)
	unassigned	15 978 (52.23)	11 081 (69.4)	8472 (76.5)	10 315 (33.72)	3414 (33.1)	2252 (66)
2	allb	30 590 (100)	13 449 (44)	9610 (71)	30 590 (100)	4909 (16)	3355 (68)
	amino-acidsc	4786 (15.65)	3123 (65.3)	2168 (69.4)	4524 (14.79)	1048 (23.2)	602 (57.4)
	carbohydratesc	2394 (7.83)	2079 (86.8)	1906 (91.7)	1826 (5.97)	1695 (92.8)	1653 (97.5)
	drug derivativesc	191 (0.62)	65 (34)	36 (55.4)	81 (0.26)	24 (29.6)	17 (70.8)
	lipidsc	4085 (13.35)	1172 (28.7)	176 (15)	3906 (12.77)	263 (6.7)	116 (44.1)
	nucleosidesc	116 (0.38)	7 (6)	0 (0)	103 (0.34)	0(−)	0(−)
	organic acidsc	1383 (4.52)	798 (57.7)	462 (57.9)	864 (2.82)	114 (13.2)	18 (15.8)
	othersc	377 (1.23)	112 (29.7)	58 (51.8)	363 (1.19)	87 (24)	35 (40.2)
	proteinsc	549 (1.79)	529 (96.4)	400 (75.6)	499 (1.63)	407 (81.6)	352 (86.5)
	unassigned	15 978 (52.23)	5421 (33.9)	4317 (79.6)	10 315 (33.72)	1067 (10.3)	413 (38.7)

Results are also given after results prioritization: variables identified in the discovery set and replicated in the validation set in at least one split across the 100 splits (see Methods).

Figures are given for the whole spectra (N = 30 590 variables) including the unassigned regions.

Figures are based on the achieved NMR assignment: not all variables have been assigned in the spectra.

Results are also given after results prioritization: variables identified in the discovery set and replicated in the validation set in at least one split across the 100 splits (see Methods). Figures are given for the whole spectra (N = 30 590 variables) including the unassigned regions. Figures are based on the achieved NMR assignment: not all variables have been assigned in the spectra. Analyses by classes of metabolite for model 1 revealed that nearly all variables assigned to drug derivative (100%), proteins (98.7%), and carbohydrates (92.1%) were associated with glucose in NOESY. Similarly, all variables assigned to proteins (100%), carbohydrates (96.5%), and others (71.9%, which include choline and glycerol) were associated with glucose in CPMG. The proportion of unassigned associated variables was higher for NOESY (69.4%) compared to CPMG (33.9%). Adjusting for Framingham risk score (FRS) variables in model 2 reduced the total number of associations for both NOESY and CPMG spectra. This reduction was mostly observed for the lipids class (NOESY, 68.9% vs 28.7%; CPMG, 71% vs 6.7%) and the “others” class (NOESY, 60.7% vs 29.7%; CPMG, 71.9% vs 24%). As expected, the proportion of carbohydrate associated variables was one of the least affected classes by the FRS adjustment (NOESY, 92.1% vs 86.8%; CPMG, 96.5% vs 92.8%). The utility of the MWAS approach in metabolic phenotyping relies on explicit visualization of the results. One primary output to help identify relevant spectral regions borrows from the field of genome-wide association studies and reports for each spectral variable the–log10 p-value multiplied by the sign of the corresponding regression coefficient. The resulting signed Manhattan plots (Figure and Figure S-6 for CPMG and NOESY, respectively) offer a global view of the spectral regions associated with the outcome of interest, and can be further informed by the annotation of spectral features. Assigned metabolites found associated with conventional serum glucose measurements are reported in Tables S-2 and S-3 for CPMG and NOESY, respectively. For CPMG, 47 unique metabolites were associated with glucose for model 1, of which 34 were still associated after controlling for the FRS variable (44 and 41 for NOESY).

Figure 2

Metabolome wide study of glucose (model 2). This Manhattan plot shows the analysis of the 30 590 CPMG features. The signed negative log10 p-value is plotted against the chemical shift in ppm. To ease the visualization, all log p-value ≤ 10–30 were set to 1 × 10–30. The horizontal dashed line indicates the α′ per-test significance level controlling the FWER at a 5% level using the Gaussian simulated outcome. Data points are colored by class of metabolites. Components were: 1, L1; 2, L2; 3, isoleucine; 4, leucine, isoleucine; 5, leucine; 6, valine; 7, L3; 8, lactate; 9, alanine; 10, L4; 11, arginine; 12, lysine; 13, acetate; 14, L5; 15, acetylglycoproteins; 16, methionine; 17, glutamate; 18, glutamine; 19, L6; 20, 3-hydroxybutyrate; 21, pyruvate; 22, pyroglutamate; 23, citrate; 24, L7; 25, aspartate; 26, albumin; 27, creatine; 28, creatinine; 29, ornithine, tyrosine; 30, ornithine; 31, phenylalanine; 32, tyrosine; 33, choline; 34, beta-glucose; 35, proline; 36, alpha-glucose, beta-glucose; 37, alpha-glucose; 38, glycine; 39, glycerol; 40, mannose; 41, glyceryl groups of lipids; 42, APAP glucuronide; 43, L8; 44, uridine; 45, 1-Methylhistidine; 46, histidine; 47, 3-methylhistidine; 48, formate. High-resolution visualization is also key to enable peak validation and subsequent annotation through the inspection of the shape and multiplicity of the spectral features in the neighborhood of the associated regions. Such visualization could be provided by regional plots as exemplified in Figure , which focuses on the glutamine region ([2.4535–2.5045] ppm). The upper panel represents the high resolution (unsigned) Manhattan plot where the −log10p-value (left Y axis) measuring the strength of association each between peak height and the log10 transformed blood glucose level is represented by a triangle (down pointing for negative associations). For the region under investigation, we define the reference feature as the strongest association (here 2.47175 ppm, p-value <2.10–16, represented in gray) and color-code the pairwise correlation of each spectral variable with this reference. The mean-corrected intensity (i.e., residuals removing the effect of possible confounders: phase and cohort, see Methods) is also represented (right Y axis) for the samples in the fifth (blue line) and 95th (green line) quantiles of the residuals log10 glucose levels after controlling for the FRS variables.

Figure 3

CPMG-model 2 regional association plots with log10 (glucose) for the glutamine. In the upper plot, the −log10 p-value for the features at two regions are shown on each plot. Features are colored based on their correlation with the gray hit that has the smallest p-value in the region. The lines show the mean corrected intensity (i.e., residuals removing the linear effect of the phase and the cohort) in the 5% of samples with high residual glucose in green and 5% of the samples with low residual glucose in blue. The bottom plot shows the mean spectral intensity in MESA phase 1 (plain line) and in MESA phase 2 (dashed line). Green circles indicate the proportion of replication after results prioritization. From this plot, it is clear that there are strong, and exclusively positive correlation coefficients among 1H NMR data points throughout the region as indicated by the color-coded pairwise correlation coefficients (75% are >0.77). As expected, because of the local correlation structures in NMR spectra (caused by the finite peak width and the resonance being split by the J couplings), data points neighboring the reference signal exhibit higher correlation levels. More distant spectral variables may also show strong correlation with the reference peak (e.g., 2.5045, Spearman correlation = 0.26; 2.4535, Spearman correlation = 0.30), and irrespective of their location in the spectrum, intensities that are the most correlated to the reference peak exhibit the strongest associations with glucose levels. This arises because a given metabolite often has several NMR peaks and this procedure offers a way of finding such linked resonances as an aid to molecular annotation. For these variables (dark orange and red triangles), the difference in the mean corrected intensity in participants with the highest and lowest glucose levels are larger compared to the spectral variables less correlated to the reference peak (yellow triangles). To get further insights into the nature (shape and multiplicity) of the variables found associated with glucose we represent the mean spectrum in the bottom panel (left Y axis). Mean spectra are plotted for each of the analytical phases to assess both potential technically induced bias across experimental batches, and possible population heterogeneity across samples assayed in each phase. While the mean spectrum from phase 2 appears to have more marked variables (with higher modes), both data sets yield consistent subspectra in terms of alignment, multiplicity and overall peak shape. To accommodate the expected large number of associations with glucose levels, efficient signal prioritization strategies are key. One established way to identify robust and replicable associations is to seek external replication in an independent data set. However, owing to the specificity of experimental protocols and the nature of the measurements (which are relative and not absolute concentrations), external validation is challenging in metabolic phenotyping, but could however be sought for using either standardized or annotated data. As a workable alternative and complementary approach prioritizing relevant metabolic features, we propose to seek for internal validation and randomly split the study population in a discovery (80% of the full population) and a validation (the remaining 20%) set. The robustness of an association identified in the full population is then quantified by the number of times discovered and replicated over the (N = 100) independent splits. This proportion is represented on the regional plot (green dots on the bottom panel of Figure ). Owing to the strong signals we identify in our glucose MWAS, we report very high replication proportions in glucose-associated regions. As a first approach, prioritized associations were defined as those discovered and validated in at least one split (i.e., with a proportion of replication >0). As illustrated in Table and in Figures and S-7 for CPMG and NOESY, this prioritization strategy reduced the number of significant associations for model 2 by almost 50 and 30%, respectively. As illustrated in Figure , the associated spectral variables are strongly clustered and define clear regions that are associated with the blood levels of glucose. Our prioritization strategy identifies overall the same regions along the spectra but selects preferentially the strongest associated variables within each region and regions that have been assigned. This is even more apparent when more stringent prioritization strategies, for instance, in Figure we plot the signed Manhattan plot for the features discovered and replicated in a least half of the splits (in red).

Figure 4

Comparison of results from the analysis in MESA to those from the 80:20 split strategy. Results are presented for the CPMG (N = 30 590 data points) metabolome wide association study of glucose using model 2. To ease the visualization, all p-value ≤ 10–30 were set to 1 × 10–30. The −log10(p-value) is signed by the direction of the effect size estimate and is plotted against the chemical shift. The horizontal dashed line indicates the per-test significance level controlling the FWER at a 5% level. Variables found from the analyses in MESA are presented in black, those discovered and replicated at least once across the 100 splits are presented in green and those discovered and replicated in 50% of the split are presented in red. Components were: 1, L1; 2, L2; 3, isoleucine; 4, leucine, isoleucine; 5, leucine; 6, valine; 7, L3; 8, lactate; 9, alanine; 10, L4; 11, arginine; 12, lysine; 13, acetate; 14, L5; 15, acetylglycoproteins; 16, methionine; 17, glutamate; 18, glutamine; 19, L6; 20, 3-hydroxybutyrate; 21, pyruvate; 22, pyroglutamate; 23, citrate; 24, L7; 25, aspartate; 26, albumin; 27, creatine; 28, creatinine; 29, ornithine, tyrosine; 30, ornithine; 31, phenylalanine; 32, tyrosine; 33, choline; 34, beta-glucose; 35, proline; 36, alpha-glucose, beta-glucose; 37, alpha-glucose; 38, glycine; 39, glycerol; 40, mannose; 41, glyceryl groups of lipids; 42, APAP glucuronide; 43, L8; 44, uridine; 45, 1-methylhistidine; 46, histidine; 47, 3-methylhistidine; 48, formate.

Conclusion

The identification and interpretation of metabolic features that contribute to a physiological and/or disease-induced outcome from full resolution NMR data is challenging for many reasons related to the dimensionality and complexity of metabolic profiles. The main aim of the present work is to address some of these issues through the development of a robust multiple testing strategy, intuitive visualizations and objective ways to prioritize results. The MWAS approach requires accurate correction for the large number of tests performed, and therefore needs to appropriately account for the strong and complex correlation structures within the NMR spectra. All methods for performing multiple testing correction assume a valid statistical model that captures dependencies in the data. While approaches controlling the FDR usually provide less stringent multiple testing correction, these have been reported to misperform in cases of high correlations among predictors.[40] As an extension of the MWAS approach to accommodate continuous variables in a regression context, we proposed an estimation of the metabolome wide significance level adopting the same permutation strategy and investigated the sensitivity of the estimates to the data structure and to the model parametrization. Our results suggest that in the case of highly correlated variables (spectral variables) that are strongly associated with an outcome (here glucose levels), permutations do not succeed in destroying the predictor-outcome relationship, hence yielding ENT estimates greater than the actual number of tests performed. Sensitivity analyses removing extreme values, and/or log transforming glucose levels showed that outlying observations are driving this unexpected estimate of the ENT. In the presence of strong correlation (i) between blood glucose and most of the assayed metabolites, (ii) among the metabolites, and (iii) between metabolites and adjustment variables, assumptions (e.g., observations exchangeability under the null) on which permutation inferences are based may be violated, and especially in the presence of strong outlying observations. From our data, MWSL estimates appeared robust to model parametrization (i.e., marginally affected by the set of confounding variables considered), but clearly depended on the correlation structure in the data, which are population and platform specific. This suggests that the MWSL should be tailored and re-estimated for each data set. MWSL estimates were also found to be sensitive to the distribution of the outcome and especially in the case of heavy tailed distributions due to the presence of outlying observations. While numerical transformations and truncation could be a way forward, one more general and conservative option could be to calculate the MWSL once using a virtual predictor sampled from a Gaussian distribution (Figure S-4). Using this MWSL approach, we propose extensions of existing visualization tools for the MWAS. This includes full resolution signed Manhattan plots included functional annotation and higher resolution regional plots displaying the per-variable strength of association as well as spectral summary features including correlation patterns across spectral variables. In our proof-of-principle example, we chose glucose as the outcome of interest which defined a challenging context in terms of results interpretability as a very large number of strong associations were identified and the assessment of their relevance went beyond the observed strong positive correlation for the expected glucose peaks along the 1H NMR spectra. This called for the definition of strong result prioritization strategy, which, in less extreme situations, is also critical to identify the most relevant associations that are worth dedicating resources for molecular assignment. Our approach relies on subsampling strategy where discovery (80%)-replication (20%) splits are used to identify associations that internally replicate. This strategy was found to be efficient to prioritize the most robust associations, which were not only those with the strongest p-values. We performed our internal validation at the metabolome-wide level, which was computationally intensive. To scale this approach in real-life studies, one alternative would consist in restricting the discovery-replication split strategy for the MWAS candidates that have been identified in the full population. The overall analytical strategy presented here provides a general framework for the analysis of cohort studies where large number of samples are profiled using untargeted technologies (e.g., high-resolution NMR, mass spectrometry). Overall, we believe the strategy and approach we present are generalizable and scalable and may therefore be relevant to aid the MWAS approach, particularly improving the interpretation of results.

37 in total

1. Statistical total correlation spectroscopy: an exploratory approach for latent biomarker identification from metabolic 1H NMR data sets.

Authors: Olivier Cloarec; Marc-Emmanuel Dumas; Andrew Craig; Richard H Barton; Johan Trygg; Jane Hudson; Christine Blancher; Dominique Gauguier; John C Lindon; Elaine Holmes; Jeremy Nicholson
Journal: Anal Chem Date: 2005-03-01 Impact factor: 6.986

2. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.

Authors: Juliane Schäfer; Korbinian Strimmer
Journal: Stat Appl Genet Mol Biol Date: 2005-11-14

3. Metabolic phenotyping in health and disease.

Authors: Elaine Holmes; Ian D Wilson; Jeremy K Nicholson
Journal: Cell Date: 2008-09-05 Impact factor: 41.582

4. Subset optimization by reference matching (STORM): an optimized statistical approach for recovery of metabolic biomarker structural information from 1H NMR spectra of biofluids.

Authors: Joram M Posma; Isabel Garcia-Perez; Maria De Iorio; John C Lindon; Paul Elliott; Elaine Holmes; Timothy M D Ebbels; Jeremy K Nicholson
Journal: Anal Chem Date: 2012-12-04 Impact factor: 6.986

5. Estimation of the concentration of low-density lipoprotein cholesterol in plasma, without use of the preparative ultracentrifuge.

Authors: W T Friedewald; R I Levy; D S Fredrickson
Journal: Clin Chem Date: 1972-06 Impact factor: 8.327

Review 6. A practical guide to metabolomic profiling as a discovery tool for human heart disease.

Authors: Lisa C Heather; Xinzhu Wang; James A West; Julian L Griffin
Journal: J Mol Cell Cardiol Date: 2012-12-08 Impact factor: 5.000

Review 7. Metabolic phenotyping in clinical and surgical environments.

Authors: Jeremy K Nicholson; Elaine Holmes; James M Kinross; Ara W Darzi; Zoltan Takats; John C Lindon
Journal: Nature Date: 2012-11-15 Impact factor: 49.962

8. Population structure and eigenanalysis.

Authors: Nick Patterson; Alkes L Price; David Reich
Journal: PLoS Genet Date: 2006-12 Impact factor: 5.917

9. HMDB: the Human Metabolome Database.

Authors: David S Wishart; Dan Tzur; Craig Knox; Roman Eisner; An Chi Guo; Nelson Young; Dean Cheng; Kevin Jewell; David Arndt; Summit Sawhney; Chris Fung; Lisa Nikolai; Mike Lewis; Marie-Aude Coutouly; Ian Forsythe; Peter Tang; Savita Shrivastava; Kevin Jeroncic; Paul Stothard; Godwin Amegbey; David Block; David D Hau; James Wagner; Jessica Miniaci; Melisa Clements; Mulu Gebremedhin; Natalie Guo; Ying Zhang; Gavin E Duggan; Glen D Macinnis; Alim M Weljie; Reza Dowlatabadi; Fiona Bamforth; Derrick Clive; Russ Greiner; Liang Li; Tom Marrie; Brian D Sykes; Hans J Vogel; Lori Querengesser
Journal: Nucleic Acids Res Date: 2007-01 Impact factor: 16.971

10. Global systems biology, personalized medicine and molecular epidemiology.

Authors: Jeremy K Nicholson
Journal: Mol Syst Biol Date: 2006-10-03 Impact factor: 11.429

3 in total

1. Serum metabolic signatures of coronary and carotid atherosclerosis and subsequent cardiovascular disease.

Authors: Ioanna Tzoulaki; Raphaële Castagné; Claire L Boulangé; Ibrahim Karaman; Elena Chekmeneva; Evangelos Evangelou; Timothy M D Ebbels; Manuja R Kaluarachchi; Marc Chadeau-Hyam; David Mosen; Abbas Dehghan; Alireza Moayyeri; Diana L Santos Ferreira; Xiuqing Guo; Jerome I Rotter; Kent D Taylor; Maryam Kavousi; Paul S de Vries; Benjamin Lehne; Marie Loh; Albert Hofman; Jeremy K Nicholson; John Chambers; Christian Gieger; Elaine Holmes; Russell Tracy; Jaspal Kooner; Philip Greenland; Oscar H Franco; David Herrington; John C Lindon; Paul Elliott
Journal: Eur Heart J Date: 2019-09-07 Impact factor: 29.983

2. Multiple-testing correction in metabolome-wide association studies.

Authors: Alina Peluso; Robert Glen; Timothy M D Ebbels
Journal: BMC Bioinformatics Date: 2021-02-12 Impact factor: 3.169

Review 3. Beyond genomics: understanding exposotypes through metabolomics.

Authors: Nicholas J W Rattray; Nicole C Deziel; Joshua D Wallach; Sajid A Khan; Vasilis Vasiliou; John P A Ioannidis; Caroline H Johnson
Journal: Hum Genomics Date: 2018-01-26 Impact factor: 4.639

3 in total