Raphaële Castagné, Claire Laurence Boulangé1, Ibrahim Karaman, Gianluca Campanella, Diana L Santos Ferreira, Manuja R Kaluarachchi1, Benjamin Lehne, Alireza Moayyeri2, Matthew R Lewis1, Konstantina Spagou1, Anthony C Dona1, Vangelis Evangelos3, Russell Tracy4, Philip Greenland5, John C Lindon1,6, David Herrington7, Timothy M D Ebbels1,6, Paul Elliott, Ioanna Tzoulaki, Marc Chadeau-Hyam. 1. Bioincubator Unit, Metabometrix Ltd , Bessemer Building, Prince Consort Road, South Kensington, London SW7 2BP U.K. 2. Farr Institute of Health Informatics Research, University College London Institute of Health Informatics , 222 Euston Road, NW1 2DA London, United Kingdom. 3. Department of Hygiene and Epidemiology, University of Ioannina Medical School , Ioannina 45110, Greece. 4. Department of Pathology and Laboratory Medicine, University of Vermont Larner College of Medicine , Burlington, Vermont 05405, United States. 5. Department of Preventive Medicine and the Institute for Public Health and Medicine, Northwestern University , Chicago, Illinois 60611, United States. 6. Computational and Systems Medicine, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London , Sir Alexander Fleming Building, South Kensington, SW7 2AZ London, United Kingdom. 7. Section on Cardiovascular Medicine, Department of Internal Medicine, Wake Forest University School of Medicine , Medical Center Boulevard, Winston-Salem, North Carolina 27157, United States.
Abstract
1H NMR spectroscopy of biofluids generates reproducible data allowing detection and quantification of small molecules in large population cohorts. Statistical models to analyze such data are now well-established, and the use of univariate metabolome wide association studies (MWAS) investigating the spectral features separately has emerged as a computationally efficient and interpretable alternative to multivariate models. The MWAS rely on the accurate estimation of a metabolome wide significance level (MWSL) to be applied to control the family wise error rate. Subsequent interpretation requires efficient visualization and formal feature annotation, which, in-turn, call for efficient prioritization of spectral variables of interest. Using human serum 1H NMR spectroscopic profiles from 3948 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), we have performed a series of MWAS for serum levels of glucose. We first propose an extension of the conventional MWSL that yields stable estimates of the MWSL across the different model parameterizations and distributional features of the outcome. We propose both efficient visualization methods and a strategy based on subsampling and internal validation to prioritize the associations. Our work proposes and illustrates practical and scalable solutions to facilitate the implementation of the MWAS approach and improve interpretation in large cohort studies.
1H NMR spectroscopy of biofluids generates reproducible data allowing detection and quantification of small molecules in large population cohorts. Statistical models to analyze such data are now well-established, and the use of univariate metabolome wide association studies (MWAS) investigating the spectral features separately has emerged as a computationally efficient and interpretable alternative to multivariate models. The MWAS rely on the accurate estimation of a metabolome wide significance level (MWSL) to be applied to control the family wise error rate. Subsequent interpretation requires efficient visualization and formal feature annotation, which, in-turn, call for efficient prioritization of spectral variables of interest. Using human serum 1H NMR spectroscopic profiles from 3948 participants from the Multi-Ethnic Study of Atherosclerosis (MESA), we have performed a series of MWAS for serum levels of glucose. We first propose an extension of the conventional MWSL that yields stable estimates of the MWSL across the different model parameterizations and distributional features of the outcome. We propose both efficient visualization methods and a strategy based on subsampling and internal validation to prioritize the associations. Our work proposes and illustrates practical and scalable solutions to facilitate the implementation of the MWAS approach and improve interpretation in large cohort studies.
Entities:
Keywords:
MESA; cohort studies; full resolution 1H NMR; high-throughput analysis; metabolic profiling; metabolome wide association study; molecular epidemiology; multiple testing correction; results visualization and prioritization; significance level
Over the past 15 years,
improvements in high-throughput technologies have accelerated the
simultaneous measurement of large numbers of metabolites in a single
sample using NMR and mass spectrometry (MS), which are both widely
used to characterize a biofluid using either untargeted or targeted
profiling approaches.[1,2] Both technologies have emerged
as efficient tools for identifying biomarkers of exposures as well
as early disease manifestations, hence informing on molecular mechanisms
involved in pathogenesis (e.g., in cancer, diabetes, cardiovascular,
and neurological diseases).[3−6] Metabolic phenotyping uses robust and reliable analytical
methods[7,8] that are ideally suited for untargeted profiling
since prior knowledge about compounds present in a sample is not required.
Such an agnostic discovery approach can inform complementary targeted
methods by identifying novel chemical compounds that may contribute
to the molecular pathways involved in complex phenotypes.[8−10] These screening exercises call upon the efficient analyses of high-dimensional
data whose exploration poses complex methodological and interpretation
problems.[11] Typically, untargeted NMR experiments
in a large population study generate tens of thousands of data points
for thousands of individuals. By adopting univariate approaches, the
concept of the metabolome wide association study (MWAS)[12] has been proposed to analyze these data, and
various statistical approaches to perform such analyses were established
and have been explored and reviewed.[13−17] The complex correlation structures existing in metabolic
profiles result in partially redundant signals across the metabolic
spectra. These need to be accounted for while correcting for multiple
testing, for instance by using a permutation-based procedure to derive
the metabolome-wide significance level (MWSL) controlling the family
wise error rate (FWER).[13] However, the
comparative applicability of these approaches, as well as the visualization
of the results they produce, has not been comprehensively explored.
Here we provide, using a real-life example from the COMBI-BIO project,
an extension of the MWAS methodology we initially developed using
simulated data sets in a class discrimination context[13] and further extended to accommodate continuous outcomes.
The current extension includes the computation of a stable and reproducible
statistical significance threshold and proposes ways to estimate consistently
across different types of NMR spectra as well as an internal validation
procedure to assess the robustness of candidate associations. Our
data set comprises two versions of 1H NMR nontargeted metabolic
profiles in 3948 participants from the Multiethnic Study of Atherosclerosis
(MESA) cohort.[18] As a proof-of-principle
example, we focus on identifying the NMR spectral features associated
with fasting blood serum glucose, which was measured by an independent
technique (glucose oxidase method). By examining the full NMR spectrum,
our analysis was designed to identify spectral regions corresponding
to other metabolites beyond glucose itself that also vary with fasting
serum glucose (1H NMR glucose associated peaks).
Materials
and Methods
Study Population and Sample Selection
The MESA cohort
has been described elsewhere[18] and includes
6814 participants (53% females, 47% males) aged 44–84 years
(mean = 62 years) from four different ethnic groups: Chinese-American
(n = 803), African-American (n =
1893), Hispanic (n = 1496), and Caucasian (n = 2622), all recruited between 2000 and 2002 at clinical
centers in the United States. Participants were free of symptomatic
cardiovascular disease at baseline, and demographic, medical history,
anthropometric, lifestyle data, and serum samples were collected during
the first examination (July 2000–August 2002), together with
information on lipid or blood pressure treatment, and diabetes, and
measures of systolic blood pressure. Serum samples were stored at
−80 °C after collection. At enrolment, high density lipoprotein
(HDL-C) was measured in EDTA plasma on the Roche/Hitachi 911 Automatic
Analyzer (Roche Diagnostics Corporation, Indianapolis, IN), and low
density lipoprotein (LDL-C) was calculated using the Friedewald equation,[19] together with fasting serum glucose using the
glucose oxidase method on the Vitros analyzer (Johnson and Johnson
Clinical Diagnostics). Ethical approval was obtained by local ethical
review boards, and subsequent analysis was conducted in full accordance
with the ethical approval obtained.
Samples Preparation and 1H NMR Spectroscopic Acquisition
The full sample preparation
and quality control procedure have
been extensively described elsewhere.[20] Briefly, serum samples were thawed on the day prior to analysis,
and 300 μL of each sample was mixed with 300 μL of phosphate
buffer (NaHPO4, 0.075M, pH = 7.4). Samples were processed in two phases
(each corresponding to a separate analytical batch). Eppendorfs were
used for phase 1, while 96-well plates were used for phase 2. After
centrifugation (12 000g at 4 °C for 5
min), 550 μL of each sample-buffer mixture was manually transferred
into SampleJet 5 mm diameter NMR tubes and kept at 4 °C until
analysis. Different types of quality control (QC) samples were used
for each phase as described elsewhere.[20] All QC pools were aliquoted in 350 μL and stored at −80
°C prior to analysis. 1H NMR spectra were acquired
using a Bruker DRX600 spectrometer (Bruker Biospin, Rheinstetten,
Germany) operating at 600 MHz. A standard water suppressed one-dimensional
spectrum (usually termed NOESY) and a Carr–Purcell–Meiboom–Gill
(CPMG) spectrum were obtained for each sample.[20]
1H NMR Metabolite Profiling
The metabolite
profiling workflow has been detailed elsewhere.[20] For each biosample, two NMR profiles were generated: (i)
a 1-D spectrum (NOESY) showing resonances from all proton-containing
molecules in the sample, including broad, largely undefined bands
from serum proteins, sharper and well-defined bands from serum lipoproteins
(with some classification into their main groups), and sharp peaks
from a range of small molecule metabolites such as amino acids, simple
carbohydrates, organic acids, organic bases, and a number of osmolytes,
and (ii) a Carr–Purcell–Meiboom–Gill (CPMG) spectrum
that attenuates the peaks from the macromolecules and allows better
definition of the small molecules.For both CPMG and NOESY NMR
data, in-house written MATLAB (Mathworks Inc., USA) routines were
utilized for phasing and baseline correction. Prior to spectral peak
alignment, the region δ 4.400–5.100 corresponding to
the H2O resonance was removed. Spectral peak alignment
was performed by the Recursive Segment-wise Peak Alignment (RSPA)
algorithm.[21] Regions where the peaks of
different suspected contaminations (i.e., methanol) occurred were
removed from the whole spectra (δ 1.180–1.240, δ
2.244–2.261 and δ 3.660–3.710). The remaining
spectral regions were normalized by probabilistic quotient normalization
using the median spectrum as the reference.[22] The normalized high resolution spectra contained 30 590 data
points (variables) for both CPMG and NOESY data sets. To ensure comparability
across batches (i.e., across studies and phases), each variable was
mean-centered[23] (we will hereafter refer
to this as mean corrected intensities). Score plots of the first few
principal components for the resulting data set were finally used
to identify and remove potential outlier samples (N = 3 for the CPMG and N = 4 for the NOESY data)
from further analyses. The MESA study population was further examined
for potential outliers using score plots from the first two principal
components (Figure S-1), which showed strong
homogeneity in the study participants.
Metabolite Assignments
Selected samples were analyzed
using a range of 2-D NMR experiments such as total correlation spectroscopy
(TOCSY) or heteronuclear single quantum coherence (HSQC) to aid molecular
identification. We used approaches such as statistical total correlation
spectroscopy (STOCSY) and STORM to help constrain possible molecular
structures.[24,25] Peak identification in the 1H NMR data was supported by a semiautomatic clustering of
the full resolution 1H NMR spectra (30 590 data
points) using statistical recoupling of variables,[26] where the algorithm defines a cluster as containing 10
or more variables. Each cluster is subsequently checked by NMR experts
to improve the data point grouping and identify peak overlaps. From
our data, 136 and 159 clusters were identified in 1H NMR
NOESY and CPMG data, respectively, each of them corresponding to a
single or a group of peaks. Resulting spectral information was also
compared to the literature[27,28] and to existing databases
such as the Human Metabolome Database[29] (HMDB, http://www.hmdb.ca/) and the Biological Magnetic Resonance Data Bank[30] (BMRB, http://www.bmrb.wisc.edu). Resulting sets of NMR features were ultimately confirmed through
spiking experiments using commercial standards. The level of assignment
(LoA) we used was adapted from Sumner et al.[31] NMR features were also characterized by their peak multiplicity:
singlet (s), doublet (d), triplet (t), quartet (q), doublet of doublet
(dd), multiplet (m), broad peak (b), and noise (n). Ninety-two clusters
were assigned to 44 unique metabolites for the NOESY data set, and
91 clusters were assigned to 48 unique metabolites for the CPMG data
set.
Metabolome-Wide Association Study (MWAS)
Figure presents our analytical workflow.
We adopted a MWAS approach using univariate linear regression models
to systematically screen the 30 590 data points assayed in
both NOESY and CPMG profiles.
Figure 1
Analytical workflow.
Analytical workflow.For a given data point, the linear model can be formulated
aswhere X represents the normalized NMR intensity, Y is the outcome variable; here the log10 transformed blood glucose concentration and ε is the error term for an observation i. α is the intercept of the model, and ß1 measures how strongly each data point influences the outcome
variable. FE is a vector of fixed effect
observations for individual i and the vector B2 compiles the regression coefficients for each adjustment
covariate; for model 1, age, gender, phase, and ethnicity, and for
model 2, we additionally correct for LDL and HDL cholesterol, lipids
and blood pressure treatment, systolic blood pressure, smoking status,
and diabetes.Most of the recently published MWAs adopted a
Bonferroni correction
for multiple testing,[32−34] which ignores correlation across variables and therefore
does not account for the (partial) redundancy across the statistical
test performed. This may lead to an overly conservative multiple testing
correction. To take into account the high degree of correlation in
spectral data and prevent overcorrection for multiple testing, one
computationally efficient method relies on the estimation of the effective
number of tests (ENT), as defined by the virtual number of independent
tests that are performed across the actual p tests
performed. ENT measures the level of correlation within the spectral
data and can be estimated through spectral decomposition of the correlation
matrix of predictors.[35] From this estimate,
the per-test significance level ensuring a FWER control can be defined
as the Bonferroni-corrected threshold corresponding to that number
of independent tests. However, this approach remains of limited use
in real-life metabolomic data sets, notably because, for numerical
reasons, the ENT is upper-bounded by the number of observations.[36]As a scalable alternative, we used a permutation-based
method to
estimate the metabolome wide significance level (MWSL or α′)[13] in which the outcome (glucose levels) is randomly
shuffled across observations. Each permutation mimics the null hypothesis
of no association, and we performed for each permuted data set a MWAS
using the linear regression model described above. In that setting,
and for a given permuted data set, the minimum p-value
across all variables (denoted q) represents the largest
significance level to be considered to ensure no false positive findings,
and the MWSL α′ controlling the FWER at a level α
can be derived from the distribution of q across
the N (set here to 10 000) permutations.The effective number of tests (ENT) is then defined as the number
of independent tests that would be required to obtain the same significance
level using a Bonferroni correction: ENT= α/α′
and measures the level of correlation across the p tests performed.To assess the robustness of the ENT estimates
and to circumvent
the strong assumptions from the generalized linear model on data structure
(independence of each data points, distribution of the residuals,
variance structure, and linear relationship between response and predictors),
we ran our permutation-based procedure for both data sets (NOESY and
CPMG) for both models (1 and 2) and used different transformations
of the glucose distribution: raw concentrations, truncated concentrations
(129 outlying observations with more than 2 standard deviations away
from the mean glucose level were discarded), and log10-transformed
glucose levels (Figure S-2).To formally
assess the sensitivity of our MWSL estimates to distributional
features of the outcome, we also simulated continuous responses from
gamma and several Gaussian distributions and compared resulting MWSL
estimates to those obtained using measured glucose levels.
Sensitivity,
Stability Analyses, and Results Prioritization
We performed
further sensitivity analyses to assess the stability
of the candidate associations we identified. These included a cross-validation
procedure based on the independent subsampling (N = 100 times) of discovery (containing 80% of the observations) and
replication (comprising the remaining 20% observation) data sets.For each 80:20 discovery and replication split, we performed a MWAS
and used the MWSL to identify the candidate associations in the discovery
set. For each discovery set (80% of the observations), the number
of independent signals among the candidate associations was approximated
by the number of principal components needed to explain more than
99% of their variance. That number of PCs was then used to compute
the Bonferroni corrected MWSL used in the replication set.[37]Our strategy could be summarized as follows
for a given split:MWAS: identify (N0) candidate associations
in the discovery set with discovery p-value <
MWSL.Estimate the
number of independent
signals these N0 correspond to run a PCA
on the XN0, the matrix combining the N0 data points declared significant in the discovery
set (a random 80% subsample or the full study population), and identify NPC, the number of PC’s needed to explain
more than 99% of the variance in XN0.Replication: identify
from the candidate
signals (step 1) those replicating in the 20% replication set at a
Bonferroni-corrected significance level accounting for NPC tests, α/ NPC, setting
α = 5%.Steps 1–3 are repeated
across the 100 independent splits
and the proportion a given signal is identified in the discovery set,
and replicated in the validation set is reported.Statistical
analyses were all performed using R v3.1.2.[38]
Results and Discussion
A total of 3948 individuals
from the MESA cohort were included
in the analysis, and for each individual, 30 590 serum NMR
features were measured for both NOESY and CPMG spectra. The characteristics
of the study population are summarized in Table .
Table 1
Summary Characteristics
of the Study
Population: Multiethnic Study of Atherosclerosis Cohort
N
% or mean (sd)
gender
men
1951
49.4
women
1997
50.6
age (y)
all
3948
62.9 (10.3)
phase
1
1976
50
2
1972
50
ethnicity
Caucasian
1521
38.5
Hispanic
926
23.4
African-American
968
24.5
Chinese-American
533
13.5
body mass index (kg/m2)
all
3948
28.2 (5.4)
glucose (mg/dL)
all
3945
98.3 (31.1)
missing
3
LDL
cholesterol (mg/dL)
all
3884
117.3 (31.4)
missing
64
HDL cholesterol (mg/dL)
all
3942
50.7 (14.7)
missing
6
systolic blood pressure
(mmHg)
all
3948
127.1 (21.3)
height (cm)
all
3948
166.4 (10.2)
diabetes
no
3387
85.8
yes
561
14.2
lipids treatment
no
3286
83.2
yes
662
16.8
blood pressure
treatment
no
2449
62.1
yes
1497
37.9
smoking
never
1988
50.4
former
483
12.2
current
1461
37
missing
16
0.4
We first explored the sensitivity of the MWSL estimates
to (i)
the parametrization of the statistical model, (ii) the type of NMR
data under investigation, and (iii) distributional features of the
outcome of interest. As summarized in Table , we then ran the permutation procedure for
two different models (Models 1 and 2), for both types of NMR data
sets (NOESY and CPMG) and 3 versions of glucose blood concentrations:
raw, log10 transformed and truncated (removing 129 outlying observations
which were outside the 2 standard deviation range from the mean glucose
level). Across all models investigated, MWSL estimates for CPMG are
more stringent than for NOESY spectra and, correspondingly, ENT estimates
for CPMG are much greater than those for NOESY data (ranging from
17 610 to 122 460 and from 3680 to 17 010 for
CPMG and NOESY, respectively). This suggests stronger correlations
within NOESY data, which is plausible given that NOESY NMR spectra
contain stronger broad peaks from proteins and lipoproteins than CPMG
spectra, such that data point intensities are highly correlated. As
summarized in Table , irrespective of the type of spectrum, MWSL estimates (and corresponding
ENT) seem to be only marginally affected by the number of confounders
included in the model (i.e., comparing models 1 and 2). However, a
systematic but moderate increase in the ENT is observed for the fully
adjusted model (Model 2). This can be explained by the fact that the
fully adjusted model removes spectral signals relating to the components
of the Framingham risk scores (such as the conventionally measured
lipoprotein levels), and these confounders were highly likely to be
driving correlations across some data points.
Table 2
Significance Threshold
α′
and Effective Number of Test (ENT) Based on a Bonferroni Correctionab
NOESY (30 590 variables)
CPMG (30 590 variables)
phenotype
N
model
MWSL (FWER = 5%)
MWSL (FWER = 5%)
glucose
3945
1
α′
(×10–5)
0.31 (0.26; 0.35)
0.04 (0.03; 0.05)
ENT (×103)
16.33 (14.42;
19.51)
53.39
122.46 (108.67;
143.67)
400.32
3866
2
α′
(×10–5)
0.29 (0.25; 0.32)
0.04 (0.04; 0.05)
ENT (×103)
17.01 (15.57;
19.88)
55.60
121.64 (107.21;
142.23)
397.64
log10(glucose)
3945
1
α′ (×10–5)
1.09 (0.96; 1.17)
0.22 (0.20; 0.23)
ENT
(×103)
4.59 (4.28; 5.22)
15.01
23.1 (21.7; 24.4)
75.37
3866
2
α′ (×10–5)
1.05 (0.96; 1.13)
0.21 (0.19; 0.22)
ENT
(×103)
4.75 (4.43; 5.02)
15.52
23.6 (22.7; 26.2)
77.07
glucose
without
outliers
3816
1
α′ (×10–5)
1.36 (1.25; 1.45)
0.28 (0.26;
0.29)
ENT (×103)
3.68 (3.45; 4.01)
12.02
18.02 (17.36; 19.07)
58.92
3743
2
α′ (×10–5)
1.06 (1.00; 1.10)
0.28 (0.27;
0.30)
ENT (×103)
4.70 (4.53; 4.99)
15.36
17.61 (16.74; 18.74)
57.57
95% confidence intervals are
given in parentheses. Figures are based on 10 000 permutations
for each model (1 and 2) and given for the glucose, log10(glucose)
and the glucose after outliers exclusion (N = 129
excluded).
Bold figures
are the ratios of effective/actual
number of tests.
95% confidence intervals are
given in parentheses. Figures are based on 10 000 permutations
for each model (1 and 2) and given for the glucose, log10(glucose)
and the glucose after outliers exclusion (N = 129
excluded).Bold figures
are the ratios of effective/actual
number of tests.For CPMG data,
estimates of the MWSL using raw glucose concentrations
led to an estimated ENT greater than the actual number of tests (ratio
around 4.00). This might be attributed to the parametric assumption
in generalized linear regression model (e.g., normality of the outcome
and equal variance) underlying our permutation procedure being violated.
This is supported by the distribution of the blood levels of glucose,
which is right skewed, with several outlying observations (Figure S-2A).To account for this asymmetrical
distribution, we first applied
a log10 transformation to glucose levels (Figure S-2B). Although the transformed distribution remains
right skewed, corresponding estimates of the MWSL were less stringent,
and the effective to actual number of tests ratio dropped to around
0.75 for CPMG (and to 0.15 for NOESY). To remove the influence of
outlying observations, we truncated the glucose distribution and removed
observations more than 2 standard deviations away from the mean glucose
level (Figure S-2C). While only a small
number of observations were discarded (N = 129),
this removal strongly impacted the MWSL estimates for CPMG data, and
the effective to actual number of test ratio dropped to less than
60% for both models.These results
suggest that MWSL estimates are sensitive to the
shape of the distribution of the continuous outcome under investigation,
and are specifically affected by both the relative weight of its tail,
and by the presence of outlying observations. To formally assess the
sensitivity of our MWSL estimates to the parametric form of the response
variable, we ran a series of sensitivity analyses where we randomly
sampled for each participant (and for each permutation) the glucose
levels from (i) a Gamma distribution fitted on the measured glucose
levels (shape = 15.90, scale = 6.18), and (ii) several Gaussian distributions
(mean = 0, sd = 1; mean = 0, sd = 10; mean = 0, sd = 100, mean = mean(glucose),
sd = sd(glucose). By construction, the Gamma-distributed response
did not include outlying observations, but featured an inflated right
tail, which provided less stringent MWSL for both NOESY. Results from
the normally distributed outcomes showed consistent MWSL estimates
and did not seem to be strongly affected by the parameters choices
defining the Gaussian distribution (Table S-1, Figure S-3). Since the numbers of associated variables were
only marginally affected by the way the ENT was computed in our example,
we took forward the MWSL estimated from the Gaussian simulated outcome
(mean = 0, sd = 1), which also seemed to provide the best balance
between generalizability and simplicity (Figures S-3 and S-4). We further compared the proportion of associated
variables from different multiple testing correction strategies including
Bonferroni and FWER control, and Benjamini-Hochberg[39] false discovery rate (BH-FDR) procedure (Figure S-5). In all scenario considered, the largest proportion
of associated variables was always observed when using a BH-FDR procedure,
and the smallest proportion of associated variables was always observed
when using a Bonferroni procedure (Figure S-5).
Metabolome Wide Association Study of Glucose: Visualization
and Prioritization
We performed the MWAS of blood levels
of glucose on both CPMG and NOESY spectra using both models and setting
the MWSL to that estimated using the Gaussian simulated outcome (mean
= 0, sd = 1, Table S-1). As expected, irrespective
of the model and of the type of spectrum, a very large number of spectral
features were found associated with blood concentration of glucose
(Table ): for NOESY
data, 72% and 44% of the spectral variables were found significantly
associated with glucose level for models 1 and 2, respectively. These
proportions were 40% and 16% for models 1 and 2, respectively, in
CPMG data.
Table 3
Number and Percentage of Associated
Variables for Models 1 and 2 for Both NOESY and CPMG by Class of Metabolitea
NOESY
CPMG
model
number of
data points per group(%)
number of
significant data points (%)
number of
significant data points after results prioritization (%)
number of
data points per group (%)
number of
significant data points (%)
number of
significant data points after results prioritization (%)
1
allb
30 590 (100)
22 066 (72)
18 340 (83)
30 590 (100)
12 303 (40)
9920 (81)
amino-acidsc
4786 (15.65)
3653 (76.3)
3222 (88.2)
4524 (14.79)
2780 (61.5)
2078 (74.7)
carbohydratesc
2394 (7.83)
2204 (92.1)
2192 (99.5)
1826 (5.97)
1763 (96.5)
1734 (98.4)
drug derivativesc
191 (0.62)
191 (100)
116 (60.7)
81 (0.26)
30 (37)
26 (86.7)
lipidsc
4085 (13.35)
2815 (68.9)
2569 (91.3)
3906 (12.77)
2772 (71)
2530 (91.3)
nucleosidesc
116 (0.38)
41 (35.3)
35 (85.4)
103 (0.34)
13 (12.6)
1 (7.7)
organic acidsc
1383 (4.52)
949 (68.6)
838 (88.3)
864 (2.82)
390 (45.1)
266 (68.2)
othersc
377 (1.23)
229 (60.7)
173 (75.5)
363 (1.19)
261 (71.9)
237 (90.8)
proteinsc
549 (1.79)
542 (98.7)
500 (92.3)
499 (1.63)
499 (100)
494 (99)
unassigned
15 978 (52.23)
11 081 (69.4)
8472 (76.5)
10 315 (33.72)
3414 (33.1)
2252 (66)
2
allb
30 590 (100)
13 449 (44)
9610 (71)
30 590 (100)
4909 (16)
3355 (68)
amino-acidsc
4786 (15.65)
3123 (65.3)
2168 (69.4)
4524 (14.79)
1048 (23.2)
602 (57.4)
carbohydratesc
2394 (7.83)
2079 (86.8)
1906 (91.7)
1826 (5.97)
1695 (92.8)
1653 (97.5)
drug derivativesc
191 (0.62)
65 (34)
36 (55.4)
81 (0.26)
24 (29.6)
17 (70.8)
lipidsc
4085 (13.35)
1172 (28.7)
176 (15)
3906 (12.77)
263 (6.7)
116 (44.1)
nucleosidesc
116 (0.38)
7 (6)
0 (0)
103 (0.34)
0(−)
0(−)
organic acidsc
1383 (4.52)
798 (57.7)
462 (57.9)
864 (2.82)
114 (13.2)
18 (15.8)
othersc
377 (1.23)
112 (29.7)
58 (51.8)
363 (1.19)
87 (24)
35 (40.2)
proteinsc
549 (1.79)
529 (96.4)
400 (75.6)
499 (1.63)
407 (81.6)
352 (86.5)
unassigned
15 978 (52.23)
5421 (33.9)
4317 (79.6)
10 315 (33.72)
1067 (10.3)
413 (38.7)
Results are also given after
results prioritization: variables identified in the discovery set
and replicated in the validation set in at least one split across
the 100 splits (see Methods).
Figures are given for the whole
spectra (N = 30 590 variables) including the
unassigned regions.
Figures
are based on the achieved
NMR assignment: not all variables have been assigned in the spectra.
Results are also given after
results prioritization: variables identified in the discovery set
and replicated in the validation set in at least one split across
the 100 splits (see Methods).Figures are given for the whole
spectra (N = 30 590 variables) including the
unassigned regions.Figures
are based on the achieved
NMR assignment: not all variables have been assigned in the spectra.Analyses by classes of metabolite for model 1 revealed
that nearly all variables assigned to drug derivative (100%), proteins
(98.7%), and carbohydrates (92.1%) were associated with glucose in
NOESY. Similarly, all variables assigned to proteins (100%), carbohydrates
(96.5%), and others (71.9%, which include choline and glycerol) were
associated with glucose in CPMG. The proportion of unassigned associated
variables was higher for NOESY (69.4%) compared to CPMG (33.9%).Adjusting for Framingham risk score (FRS) variables in model 2
reduced the total number of associations for both NOESY and CPMG spectra.
This reduction was mostly observed for the lipids class (NOESY, 68.9%
vs 28.7%; CPMG, 71% vs 6.7%) and the “others” class
(NOESY, 60.7% vs 29.7%; CPMG, 71.9% vs 24%). As expected, the proportion
of carbohydrate associated variables was one of the least affected
classes by the FRS adjustment (NOESY, 92.1% vs 86.8%; CPMG, 96.5%
vs 92.8%).The utility of the MWAS approach
in metabolic phenotyping relies
on explicit visualization of the results. One primary output to help
identify relevant spectral regions borrows from the field of genome-wide
association studies and reports for each spectral variable the–log10 p-value multiplied by the sign of the corresponding regression
coefficient. The resulting signed Manhattan plots (Figure and Figure S-6 for CPMG and NOESY, respectively) offer a global view of
the spectral regions associated with the outcome of interest, and
can be further informed by the annotation of spectral features. Assigned
metabolites found associated with conventional serum glucose measurements
are reported in Tables S-2 and S-3 for
CPMG and NOESY, respectively. For CPMG, 47 unique metabolites were
associated with glucose for model 1, of which 34 were still associated
after controlling for the FRS variable (44 and 41 for NOESY).
Figure 2
Metabolome
wide study of glucose (model 2). This Manhattan plot
shows the analysis of the 30 590 CPMG features. The signed
negative log10 p-value is plotted against the chemical
shift in ppm. To ease the visualization, all log p-value ≤ 10–30 were set to 1 × 10–30. The horizontal dashed line indicates the α′
per-test significance level controlling the FWER at a 5% level using
the Gaussian simulated outcome. Data points are colored by class of
metabolites. Components were: 1, L1; 2, L2; 3, isoleucine; 4, leucine,
isoleucine; 5, leucine; 6, valine; 7, L3; 8, lactate; 9, alanine;
10, L4; 11, arginine; 12, lysine; 13, acetate; 14, L5; 15, acetylglycoproteins;
16, methionine; 17, glutamate; 18, glutamine; 19, L6; 20, 3-hydroxybutyrate;
21, pyruvate; 22, pyroglutamate; 23, citrate; 24, L7; 25, aspartate;
26, albumin; 27, creatine; 28, creatinine; 29, ornithine, tyrosine;
30, ornithine; 31, phenylalanine; 32, tyrosine; 33, choline; 34, beta-glucose;
35, proline; 36, alpha-glucose, beta-glucose; 37, alpha-glucose; 38,
glycine; 39, glycerol; 40, mannose; 41, glyceryl groups of lipids;
42, APAP glucuronide; 43, L8; 44, uridine; 45, 1-Methylhistidine;
46, histidine; 47, 3-methylhistidine; 48, formate.
Metabolome
wide study of glucose (model 2). This Manhattan plot
shows the analysis of the 30 590 CPMG features. The signed
negative log10 p-value is plotted against the chemical
shift in ppm. To ease the visualization, all log p-value ≤ 10–30 were set to 1 × 10–30. The horizontal dashed line indicates the α′
per-test significance level controlling the FWER at a 5% level using
the Gaussian simulated outcome. Data points are colored by class of
metabolites. Components were: 1, L1; 2, L2; 3, isoleucine; 4, leucine,
isoleucine; 5, leucine; 6, valine; 7, L3; 8, lactate; 9, alanine;
10, L4; 11, arginine; 12, lysine; 13, acetate; 14, L5; 15, acetylglycoproteins;
16, methionine; 17, glutamate; 18, glutamine; 19, L6; 20, 3-hydroxybutyrate;
21, pyruvate; 22, pyroglutamate; 23, citrate; 24, L7; 25, aspartate;
26, albumin; 27, creatine; 28, creatinine; 29, ornithine, tyrosine;
30, ornithine; 31, phenylalanine; 32, tyrosine; 33, choline; 34, beta-glucose;
35, proline; 36, alpha-glucose, beta-glucose; 37, alpha-glucose; 38,
glycine; 39, glycerol; 40, mannose; 41, glyceryl groups of lipids;
42, APAP glucuronide; 43, L8; 44, uridine; 45, 1-Methylhistidine;
46, histidine; 47, 3-methylhistidine; 48, formate.High-resolution visualization is also key to enable
peak validation
and subsequent annotation through the inspection of the shape and
multiplicity of the spectral features in the neighborhood of the associated
regions. Such visualization could be provided by regional plots as
exemplified in Figure , which focuses on the glutamine region ([2.4535–2.5045] ppm).
The upper panel represents the high resolution (unsigned) Manhattan
plot where the −log10p-value (left
Y axis) measuring the strength of association each between peak height
and the log10 transformed blood glucose level is represented
by a triangle (down pointing for negative associations). For the region
under investigation, we define the reference feature as the strongest
association (here 2.47175 ppm, p-value <2.10–16, represented in gray) and color-code the pairwise
correlation of each spectral variable with this reference. The mean-corrected
intensity (i.e., residuals removing the effect of possible confounders:
phase and cohort, see Methods) is also represented
(right Y axis) for the samples in the fifth (blue line) and 95th (green
line) quantiles of the residuals log10 glucose levels after
controlling for the FRS variables.
Figure 3
CPMG-model 2 regional association plots
with log10 (glucose)
for the glutamine. In the upper plot, the −log10 p-value for the features at two regions are shown on each plot. Features
are colored based on their correlation with the gray hit that has
the smallest p-value in the region. The lines show
the mean corrected intensity (i.e., residuals removing the linear
effect of the phase and the cohort) in the 5% of samples with high
residual glucose in green and 5% of the samples with low residual
glucose in blue. The bottom plot shows the mean spectral intensity
in MESA phase 1 (plain line) and in MESA phase 2 (dashed line). Green
circles indicate the proportion of replication after results prioritization.
CPMG-model 2 regional association plots
with log10 (glucose)
for the glutamine. In the upper plot, the −log10 p-value for the features at two regions are shown on each plot. Features
are colored based on their correlation with the gray hit that has
the smallest p-value in the region. The lines show
the mean corrected intensity (i.e., residuals removing the linear
effect of the phase and the cohort) in the 5% of samples with high
residual glucose in green and 5% of the samples with low residual
glucose in blue. The bottom plot shows the mean spectral intensity
in MESA phase 1 (plain line) and in MESA phase 2 (dashed line). Green
circles indicate the proportion of replication after results prioritization.From this plot, it is clear that
there are strong, and exclusively
positive correlation coefficients among 1H NMR data points
throughout the region as indicated by the color-coded pairwise correlation
coefficients (75% are >0.77). As expected, because of the local
correlation
structures in NMR spectra (caused by the finite peak width and the
resonance being split by the J couplings), data points neighboring
the reference signal exhibit higher correlation levels. More distant
spectral variables may also show strong correlation with the reference
peak (e.g., 2.5045, Spearman correlation = 0.26; 2.4535, Spearman
correlation = 0.30), and irrespective of their location in the spectrum,
intensities that are the most correlated to the reference peak exhibit
the strongest associations with glucose levels. This arises because
a given metabolite often has several NMR peaks and this procedure
offers a way of finding such linked resonances as an aid to molecular
annotation. For these variables (dark orange and red triangles), the
difference in the mean corrected intensity in participants with the
highest and lowest glucose levels are larger compared to the spectral
variables less correlated to the reference peak (yellow triangles).To get further insights into the nature (shape and multiplicity)
of the variables found associated with glucose we represent the mean
spectrum in the bottom panel (left Y axis). Mean spectra are plotted
for each of the analytical phases to assess both potential technically
induced bias across experimental batches, and possible population
heterogeneity across samples assayed in each phase. While the mean
spectrum from phase 2 appears to have more marked variables (with
higher modes), both data sets yield consistent subspectra in terms
of alignment, multiplicity and overall peak shape.To accommodate
the expected large number of associations with glucose
levels, efficient signal prioritization strategies are key. One established
way to identify robust and replicable associations is to seek external
replication in an independent data set. However, owing to the specificity
of experimental protocols and the nature of the measurements (which
are relative and not absolute concentrations), external validation
is challenging in metabolic phenotyping, but could however be sought
for using either standardized or annotated data. As a workable alternative
and complementary approach prioritizing relevant metabolic features,
we propose to seek for internal validation and randomly split the
study population in a discovery (80% of the full population) and a
validation (the remaining 20%) set. The robustness of an association
identified in the full population is then quantified by the number
of times discovered and replicated over the (N =
100) independent splits. This proportion is represented on the regional
plot (green dots on the bottom panel of Figure ). Owing to the strong signals we identify
in our glucose MWAS, we report very high replication proportions in
glucose-associated regions. As a first approach, prioritized associations
were defined as those discovered and validated in at least one split
(i.e., with a proportion of replication >0). As illustrated in Table and in Figures and S-7 for CPMG and NOESY, this prioritization strategy reduced the number
of significant associations for model 2 by almost 50 and 30%, respectively.
As illustrated in Figure , the associated spectral variables are strongly clustered
and define clear regions that are associated with the blood levels
of glucose. Our prioritization strategy identifies overall the same
regions along the spectra but selects preferentially the strongest
associated variables within each region and regions that have been
assigned. This is even more apparent when more stringent prioritization
strategies, for instance, in Figure we plot the signed Manhattan plot for the features
discovered and replicated in a least half of the splits (in red).
Figure 4
Comparison
of results from the analysis in MESA to those from the
80:20 split strategy. Results are presented for the CPMG (N = 30 590 data points) metabolome wide association
study of glucose using model 2. To ease the visualization, all p-value ≤ 10–30 were set to 1 ×
10–30. The −log10(p-value)
is signed by the direction of the effect size estimate and is plotted
against the chemical shift. The horizontal dashed line indicates the
per-test significance level controlling the FWER at a 5% level. Variables
found from the analyses in MESA are presented in black, those discovered
and replicated at least once across the 100 splits are presented in
green and those discovered and replicated in 50% of the split are
presented in red. Components were: 1, L1; 2, L2; 3, isoleucine; 4,
leucine, isoleucine; 5, leucine; 6, valine; 7, L3; 8, lactate; 9,
alanine; 10, L4; 11, arginine; 12, lysine; 13, acetate; 14, L5; 15,
acetylglycoproteins; 16, methionine; 17, glutamate; 18, glutamine;
19, L6; 20, 3-hydroxybutyrate; 21, pyruvate; 22, pyroglutamate; 23,
citrate; 24, L7; 25, aspartate; 26, albumin; 27, creatine; 28, creatinine;
29, ornithine, tyrosine; 30, ornithine; 31, phenylalanine; 32, tyrosine;
33, choline; 34, beta-glucose; 35, proline; 36, alpha-glucose, beta-glucose;
37, alpha-glucose; 38, glycine; 39, glycerol; 40, mannose; 41, glyceryl
groups of lipids; 42, APAP glucuronide; 43, L8; 44, uridine; 45, 1-methylhistidine;
46, histidine; 47, 3-methylhistidine; 48, formate.
Comparison
of results from the analysis in MESA to those from the
80:20 split strategy. Results are presented for the CPMG (N = 30 590 data points) metabolome wide association
study of glucose using model 2. To ease the visualization, all p-value ≤ 10–30 were set to 1 ×
10–30. The −log10(p-value)
is signed by the direction of the effect size estimate and is plotted
against the chemical shift. The horizontal dashed line indicates the
per-test significance level controlling the FWER at a 5% level. Variables
found from the analyses in MESA are presented in black, those discovered
and replicated at least once across the 100 splits are presented in
green and those discovered and replicated in 50% of the split are
presented in red. Components were: 1, L1; 2, L2; 3, isoleucine; 4,
leucine, isoleucine; 5, leucine; 6, valine; 7, L3; 8, lactate; 9,
alanine; 10, L4; 11, arginine; 12, lysine; 13, acetate; 14, L5; 15,
acetylglycoproteins; 16, methionine; 17, glutamate; 18, glutamine;
19, L6; 20, 3-hydroxybutyrate; 21, pyruvate; 22, pyroglutamate; 23,
citrate; 24, L7; 25, aspartate; 26, albumin; 27, creatine; 28, creatinine;
29, ornithine, tyrosine; 30, ornithine; 31, phenylalanine; 32, tyrosine;
33, choline; 34, beta-glucose; 35, proline; 36, alpha-glucose, beta-glucose;
37, alpha-glucose; 38, glycine; 39, glycerol; 40, mannose; 41, glyceryl
groups of lipids; 42, APAP glucuronide; 43, L8; 44, uridine; 45, 1-methylhistidine;
46, histidine; 47, 3-methylhistidine; 48, formate.
Conclusion
The identification and
interpretation of metabolic features that contribute to a physiological
and/or disease-induced outcome from full resolution NMR data is challenging
for many reasons related to the dimensionality and complexity of metabolic
profiles. The main aim of the present work is to address some of these
issues through the development of a robust multiple testing strategy,
intuitive visualizations and objective ways to prioritize results.
The MWAS approach requires accurate correction for the large number
of tests performed, and therefore needs to appropriately account for
the strong and complex correlation structures within the NMR spectra.
All methods for performing multiple testing correction assume a valid
statistical model that captures dependencies in the data. While approaches
controlling the FDR usually provide less stringent multiple testing
correction, these have been reported to misperform in cases of high
correlations among predictors.[40] As an
extension of the MWAS approach to accommodate continuous variables
in a regression context, we proposed an estimation of the metabolome
wide significance level adopting the same permutation strategy and
investigated the sensitivity of the estimates to the data structure
and to the model parametrization. Our results suggest that in the
case of highly correlated variables (spectral variables) that are
strongly associated with an outcome (here glucose levels), permutations
do not succeed in destroying the predictor-outcome relationship, hence
yielding ENT estimates greater than the actual number of tests performed.
Sensitivity analyses removing extreme values, and/or log transforming
glucose levels showed that outlying observations are driving this
unexpected estimate of the ENT. In the presence of strong correlation
(i) between blood glucose and most of the assayed metabolites, (ii)
among the metabolites, and (iii) between metabolites and adjustment
variables, assumptions (e.g., observations exchangeability under the
null) on which permutation inferences are based may be violated, and
especially in the presence of strong outlying observations. From our
data, MWSL estimates appeared robust to model parametrization (i.e.,
marginally affected by the set of confounding variables considered),
but clearly depended on the correlation structure in the data, which
are population and platform specific. This suggests that the MWSL
should be tailored and re-estimated for each data set. MWSL estimates
were also found to be sensitive to the distribution of the outcome
and especially in the case of heavy tailed distributions due to the
presence of outlying observations. While numerical transformations
and truncation could be a way forward, one more general and conservative
option could be to calculate the MWSL once using a virtual predictor
sampled from a Gaussian distribution (Figure S-4).Using this MWSL approach, we propose extensions of existing
visualization
tools for the MWAS. This includes full resolution signed Manhattan
plots included functional annotation and higher resolution regional
plots displaying the per-variable strength of association as well
as spectral summary features including correlation patterns across
spectral variables. In our proof-of-principle example, we chose glucose
as the outcome of interest which defined a challenging context in
terms of results interpretability as a very large number of strong
associations were identified and the assessment of their relevance
went beyond the observed strong positive correlation for the expected
glucose peaks along the 1H NMR spectra. This called for
the definition of strong result prioritization strategy, which, in
less extreme situations, is also critical to identify the most relevant
associations that are worth dedicating resources for molecular assignment.
Our approach relies on subsampling strategy where discovery (80%)-replication
(20%) splits are used to identify associations that internally replicate.
This strategy was found to be efficient to prioritize the most robust
associations, which were not only those with the strongest p-values. We performed our internal validation at the metabolome-wide
level, which was computationally intensive. To scale this approach
in real-life studies, one alternative would consist in restricting
the discovery-replication split strategy for the MWAS candidates that
have been identified in the full population. The overall analytical
strategy presented here provides a general framework for the analysis
of cohort studies where large number of samples are profiled using
untargeted technologies (e.g., high-resolution NMR, mass spectrometry).
Overall, we believe the strategy and approach we present are generalizable
and scalable and may therefore be relevant to aid the MWAS approach,
particularly improving the interpretation of results.
Authors: Olivier Cloarec; Marc-Emmanuel Dumas; Andrew Craig; Richard H Barton; Johan Trygg; Jane Hudson; Christine Blancher; Dominique Gauguier; John C Lindon; Elaine Holmes; Jeremy Nicholson Journal: Anal Chem Date: 2005-03-01 Impact factor: 6.986
Authors: Joram M Posma; Isabel Garcia-Perez; Maria De Iorio; John C Lindon; Paul Elliott; Elaine Holmes; Timothy M D Ebbels; Jeremy K Nicholson Journal: Anal Chem Date: 2012-12-04 Impact factor: 6.986
Authors: Jeremy K Nicholson; Elaine Holmes; James M Kinross; Ara W Darzi; Zoltan Takats; John C Lindon Journal: Nature Date: 2012-11-15 Impact factor: 49.962
Authors: David S Wishart; Dan Tzur; Craig Knox; Roman Eisner; An Chi Guo; Nelson Young; Dean Cheng; Kevin Jewell; David Arndt; Summit Sawhney; Chris Fung; Lisa Nikolai; Mike Lewis; Marie-Aude Coutouly; Ian Forsythe; Peter Tang; Savita Shrivastava; Kevin Jeroncic; Paul Stothard; Godwin Amegbey; David Block; David D Hau; James Wagner; Jessica Miniaci; Melisa Clements; Mulu Gebremedhin; Natalie Guo; Ying Zhang; Gavin E Duggan; Glen D Macinnis; Alim M Weljie; Reza Dowlatabadi; Fiona Bamforth; Derrick Clive; Russ Greiner; Liang Li; Tom Marrie; Brian D Sykes; Hans J Vogel; Lori Querengesser Journal: Nucleic Acids Res Date: 2007-01 Impact factor: 16.971
Authors: Ioanna Tzoulaki; Raphaële Castagné; Claire L Boulangé; Ibrahim Karaman; Elena Chekmeneva; Evangelos Evangelou; Timothy M D Ebbels; Manuja R Kaluarachchi; Marc Chadeau-Hyam; David Mosen; Abbas Dehghan; Alireza Moayyeri; Diana L Santos Ferreira; Xiuqing Guo; Jerome I Rotter; Kent D Taylor; Maryam Kavousi; Paul S de Vries; Benjamin Lehne; Marie Loh; Albert Hofman; Jeremy K Nicholson; John Chambers; Christian Gieger; Elaine Holmes; Russell Tracy; Jaspal Kooner; Philip Greenland; Oscar H Franco; David Herrington; John C Lindon; Paul Elliott Journal: Eur Heart J Date: 2019-09-07 Impact factor: 29.983
Authors: Nicholas J W Rattray; Nicole C Deziel; Joshua D Wallach; Sajid A Khan; Vasilis Vasiliou; John P A Ioannidis; Caroline H Johnson Journal: Hum Genomics Date: 2018-01-26 Impact factor: 4.639