Literature DB >> 29457906

Optimized Phenotypic Biomarker Discovery and Confounder Elimination via Covariate-Adjusted Projection to Latent Structures from Metabolic Spectroscopy Data.

Joram M Posma, Isabel Garcia-Perez¹, Timothy M D Ebbels, John C Lindon, Jeremiah Stamler², Paul Elliott, Elaine Holmes, Jeremy K Nicholson.

Abstract

Metabolism is altered by genetics, diet, disease status, environment, and many other factors. Modeling either one of these is often done without considering the effects of the other covariates. Attributing differences in metabolic profile to one of these factors needs to be done while controlling for the metabolic influence of the rest. We describe here a data analysis framework and novel confounder-adjustment algorithm for multivariate analysis of metabolic profiling data. Using simulated data, we show that similar numbers of true associations and significantly less false positives are found compared to other commonly used methods. Covariate-adjusted projections to latent structures (CA-PLS) are exemplified here using a large-scale metabolic phenotyping study of two Chinese populations at different risks for cardiovascular disease. Using CA-PLS, we find that some previously reported differences are actually associated with external factors and discover a number of previously unreported biomarkers linked to different metabolic pathways. CA-PLS can be applied to any multivariate data where confounding may be an issue and the confounder-adjustment procedure is translatable to other multivariate regression techniques.

Entities: Chemical Disease Gene Species

Keywords: Monte Carlo cross-validation; biomarker discovery; chemometrics; confounder elimination; covariate adjustment; metabolic phenotyping; multivariate data analysis; random matrix theory; reanalysis; sampling bias

Mesh：

Substances：
Biomarkers

Year: 2018 PMID： 29457906 PMCID： PMC5891819 DOI： 10.1021/acs.jproteome.7b00879

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Human metabolic phenotypes, “metabotypes”,[1,2] are influenced by multiple interacting factors, such as dietary, environmental, genetic, and microbial variation,[2−6] and reflect the health status of an individual.[7] Metabotypes can be studied with metabonomics and metabolomics,[8,9] which utilize multivariate statistical methods to find relevant changes in metabolite profiles related to outcomes/responses. For this, urine and plasma/serum are the most desirable biofluids, as they can be (relatively) noninvasively obtained and they are not likely to be volume limited in humans.[10] Urine gives a homeostatic signature of all metabolic processes in a biological system, including genetic, diet, and gut microbial activity,[11] and thus, the variation in the urinary metabolic profile can be attributed to many factors other than disease risk, whereas the plasma/serum biological matrix holds information on physiological status at the specific sampling time.[7] The measurement of metabolites in biofluid samples is most commonly performed using 1H nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS), with the latter often preceded with a liquid/gas chromatography step for metabolite separation.[12] Both platforms yield data sets/matrices (X) with thousands to ten-thousands of variables (p) with often a much lower number of samples (n). This wealth of information comes at a price: there are likely spurious findings which will need to be controlled for (type-I errors) and findings attributable to other factors rather than the response (Y) (e.g., disease state). Discovering accurate information relating to etiology or pathogenesis depends on a number of aspects: correct metabolite assignment using proper analytical assays, use of appropriate statistical methods, validation of results and ensuring confounding factors do not influence the relation between X and Y. The latter is an important aspect in epidemiology; however, it is not common practice in metabonomics despite the fact that in molecular epidemiology studies there often is a wealth of meta-data available that are often gathered at the same time as sample collection.[2] In the gene expression and proteomic literature, there are methods that aim to separate known covariations (batch/population structure) in data,[13−15] correct for unknown confounders,[16−18] or adjust for both types.[19,20] A difference between the analysis of genomic/proteomic and metabonomic data is that for the former data analysis is often done univariately, whereas for metabonomics data are often analyzed using multivariate methods as these are able to capture metabolite-metabolite interactions part of potentially perturbed pathways. Therefore, confounder correction methods used in genomics/proteomics are not necessarily suitable for metabonomics. Some metabonomic studies[6,11,21,22] have aimed to adjust for confounders using multiple linear regression (MLR) by regressing Y on a matrix of the covariates (C) and a single variable i from the data set (X). This adjusts the contribution of X on Y because C is also included in the same model; however, this approach is univariate for X and thus does not capture metabolite-metabolite interactions. Other studies[23−25] regress each confounder on X and then compare the “significant” metabolites with those found from regressing Y on X. Last, there is orthogonal projections to latent structures[26] (OPLS), which is widely used in metabonomics and removes variation orthogonal to Y from X before calculating the regression coefficients. OPLS has been claimed to possibly correct for confounders;[27−29] however, confounders are not necessarily orthogonal to Y; thus, this method will not correct for all confounders. Nevertheless, the (O)PLS method is popular in metabonomics because it can deal with large p, a high degree of collinearity and only requires one parameter to be estimated.[30] Different methods exist that deal with confounder adjustment in kernel matrices[31,32] and those that take an unsupervised functional data analysis approach.[33] However, while inclusion in kernel matrices can be beneficial in terms of classification by including potential nonlinearity in the kernel transformation, it comes at a price as the variable importance is lost for biological interpretation. The naïve approach of concatenating C and X and regressing Y on this concatenated matrix using multivariate methods will, most likely, not properly adjust for confounders as the variables in X will dominate the model. Regularization approaches such as the lasso[34] and elastic net[35] can be used to circumvent this problem in forcing the majority of regression coefficients (β) to 0. However, the lasso model can contain at most n nonzero βs and is known to perform poorly when data sets consist of correlated variables (as with metabonomics data). Therefore, the inclusion of a variable from X in the lasso model does not indicate it is not associated with C. The elastic net regularizes βs while simultaneously including groups of correlated variables; however, attributing which variables are associated with confounders is challenging and, in addition, it is a computationally heavy approach. We propose here a new data analysis framework and algorithm to correct for known confounders using PLS (or orthogonal signal corrected PLS), called covariate-adjusted PLS (CA-PLS); however, in theory any multivariate regression method can be used instead of PLS. Our method mimics how MLR works in the univariate case, where C is used to counterweight X and not Y,[36] and still provides a level of variable importance.

Materials and Methods

INTERMAP

The INTERMAP study investigates dietary and other factors associated with blood pressure[37] (BP), the major modifiable risk factor underlying the worldwide epidemic of cardiovascular diseases[38] (CVDs). INTERMAP surveyed a total of 4680 men and women aged 40–59 from 17 population samples in four countries (People’s Republic of China (PRC), Japan, United Kingdom, and United States) at two time-points (“visits”). In this study, data from the three Chinese population samples were used to study the effect of potential confounders on metabolic profiles and compared to a previous study on these data done without any adjustment[39] (Yap et al.). These three (rural) populations are from two geographical locations, two from the north (Pinggu county, Beijing, and Yu county, Shanxi) and one from the south (Wuming county, Guangxi). Studies have shown that northern and southern Chinese are at different risks for CVD[40] and the metabolic profiles of these populations are different, with the two northern Chinese populations most similar to each other.[41] However, it is unclear whether other factors may be causing the differences in metabolic profiles instead of genetics and environment.

NMR Spectroscopy

Urine specimens were analyzed using a Bruker Avance-III spectrometer, operating at 600.29 MHz (1H), equipped with a 5 mm, TCI, Z-gradient Cryo-probe. 1H NMR spectra of urine were acquired using standard 1D pulse sequences with water presaturation during both the relaxation delay (RD = 2 s) and mixing time (tm = 150 ms).[42] The 90° pulse length was 10 μs and total acquisition time 2.73 s. Per sample, 64 scans were collected into 32K data-points using a spectral width of 20 ppm. Free induction decays (FIDs) were multiplied by an exponential weighing function (corresponding to line broadening of 0.3 Hz) prior to Fourier-transformation. FIDs were referenced to an internal standard (trimethylsilyl-[2H4]-propionate, TSP), baseline and phase-corrected using in-house software. Spectral regions containing water/urea (δ6.4 to 4.5), TSP (δ0.2 to −0.2), and noise (δ0.5 to 0.2, δ-0.2 to −4.5, δ15.5 to 9.5) were removed prior to median-fold change normalization.[43] Remaining variables were binned to 7100 variables using bin widths of 0.001 ppm to down-sample the total number of variables (for computation) while still retaining peak shapes. A separate study[44] showed good analytical reproducibility of the data set with 96% of split pairs correctly identified. Metabolic outliers were defined, and excluded, as participants whose principal component analysis scores, for either visit, mapped outside Hotelling’s T2 95% confidence interval (CI95).[41] Subset optimization by reference matching[45] (STORM) was used to identify metabolites using the correlation structure of 1H NMR data. Localized clustering of small spectral regions was used for selecting appropriate reference spectra. Additionally, a Bruker compound library, internal databases, and extensive 2D NMR identification strategies[46] were used for identification of molecular species.

CA-PLS Framework and Algorithm

We use a version of the SIMPLS algorithm to deal specifically with wide X matrices[47] (see Supporting Information). Same as for the original PLS algorithm, this algorithm can also be used to do orthogonal signal corrected[48] (OSC) PLS, for which X is replaced by the OSC matrix Xosc. The framework (Figure , left panel) is designed to minimize the effect of sampling/selection bias and avoid overfitting models. We perform the calculations in a Monte Carlo cross-validation[49] (MCCV) procedure; specifically, we perform 1000 iterations for the MCCV. We randomly partition the data in a model (M) and validation (V) set and use 1/7th of the data for V to mimic the partitioning of Yap et al. The important aspect of this framework is that V is completely set aside and thus not used in scaling, parameter estimation or modeling in any way to avoid biasing model prediction. The MCCV framework that is used here has previously been used in a repeated-measures design to show that dietary patterns could be predicted using analysis of a urine sample.[50]

Figure 1

Data analysis framework and covariate-adjustment algorithm. Left panel shows different stages of the data analysis and shows how the introduction of bias is avoided by carefully splitting and scaling the data before modeling. Right panel (cyan box) outlines the covariate-adjustment algorithm that is used in the data analysis framework in the left panel in the cyan-colored boxes. The green outline indicates the entire MCCV procedure, the red dashed box the regression analysis performed for each covariate and blue dotted boxes indicate a CV loop. Here, β are regression coefficients, RMT stands for random matrix theory (see Supporting Information for algorithm) and ◦ denotes an element-wise operation. See Supporting Information for a glossary of mathematical operations used here. The first step of the framework is to find the optimal parameter settings (τ) using cross-model-validation[51] (CMV) by partitioning M into multiple training and test sets for modeling each covariate. τ is intended to be specifically vague as the types and numbers of parameters are different for each regression method that this framework can be extended to. Hence, τ is unknown and is different for each regression method, for example, the number of components (as used here) for PLS or λ for ridge regression[52] and lasso[34] and λ and α for elastic net.[35] If the optimal number of components is k, then for PLS k components are calculated whereas for OSC-PLS a model is calculated with one predictive and k–1 orthogonal components.[53] For each partitioning, the training set is autoscaled (mean-centered, divided by standard deviation) to match Yap et al. The test set is scaled using the parameters from the training set to avoid introducing bias. Each covariate is also autoscaled to ensure regression coefficients (β) for the covariates have the same scale. τ is found by evaluating the cross-validated error-of-prediction (Q2) (eq ), where Ŷ is the predicted response and Y̅ the mean response for the test set. If, for binary responses, the predicted value is bigger (in absolute sense) than its true value, we do not penalized this “error” and replace Ŷ with its true value[54] (eq ), here “sgn” is the sign operator. The goodness-of-fit (R2) is calculated identically, except using training instead of test data:When τ is found, β is calculated using M. This process is repeated for each covariate. However, not all covariates may be accurately modeled using the data. To avoid adjusting for covariates that cannot be modeled properly, we place some constraints for which covariates are adjusted. We use a lower threshold (0.10) for Q2 and another threshold (0.25) for the robustness of cross-validation (RCV) (eq ) to avoid adjusting for covariates that do not generalize well: We propose to use the RCV because often determining whether a model generalizes well is done by judging, highly subjective, whether the Q2 is positive and “high enough”. RCV can be seen as measure of how the model generalizes with respect to the optimal fit. A permutation scheme can be used to find a suitable lower bound for RCV (analogous to the permutation test for Q2-values.[55] Here we simply use a hard threshold for RCV. However, care must be taken to deal with negative Q2-values for calculating the RCV, for instance by setting a lower limit of 0 for the Q2 for calculating the RCV. Low (or negative) Q2-values indicate poor model predictive ability and in cases where R2 is high but Q2 is low this means the model is overfitting the data. This indicates that the correct τ has not been found. The next step of the framework and algorithm is to adjust the data for covariates (cyan-colored boxes in the left panel and pseudocode in the right panel of Figure ) that pass the thresholds and model Y on the adjusted data matrix. M is again split into training and test sets, these are scaled as before and then the data is adjusted for covariates using the algorithm shown in the right panel of Figure . As covariates may be correlated, the resulting βs will also be correlated and are thus nonorthogonal. βs of covariates must therefore be adjusted for their autocorrelation (rB) prior to adjusting the data. In this procedure, a Jacobian matrix is numerically computed from rB using random matrix theory[56] (RMT, see Supporting Information). Then the number of uncorrelated components (υ, line 1) is defined as the number of eigenvalues from the decomposition of rB that are larger than the largest eigenvalue from the Jacobian matrix (see Supporting Information). Once υ is known, β is decorrelated and a new set of decorrelated βs (D) are obtained by decomposing the covariance of β (ΣB) and retaining eigenvectors/eigenvalues that explain at least 95% of the total variance (subject to there being at least υ retained components) (lines 2–4). Next, υ components that span the space of D (where υ ≤ rank(D) ≤ c) are sought in an iterative process and saved in columns of W (lines 7–21). W is then normalized and transformed to span the space of β resulting in uncorrelated regression coefficients (U) (lines 22–24). U is used to adjust X for the βs from C (lines 25–27). Note that the same U is used for all adjustments, so need only be calculated once (only lines 25–27 need to be repeated for each new data matrix to replace “X”: M, M, M, V). The optimal number of components for Y (“τ”) is found using the adjusted training M matrices. Regression coefficients for Y (βY) are then found by adjusting M using U. Using βY (and U) the validation set V that has not been used at any stage, can now be predicted free from bias. This entire process is repeated in the MCCV. To find the variable contributions across all models, we recalculate each βY 25 times by bootstrapping[57]Y and M. Thus, after MCCV, two matrices are obtained with βs, those of each model (n = 1000) and those of the bootstrap models (n = 25 000). The mean of model βs and variance of the bootstrap βs are calculated and from these t-scores, and subsequently P-values, are calculated for each variable in the multivariate model.[58]P-values are corrected for multiple testing using the False Discovery Rate[59] (Q-value). We allow 5% false discoveries. Only variables whose βs are the same sign and Q < 0.05 are considered to be consistently and-similarly contributing (“significant”) in the MCCV. Precompiled code to run CA-PLS is available from the first author’s Web site.

Variable Significance

Variable significance is shown in plots as , defining S as “significance”, β the regression coefficient, and q the Q-value for variable i. A variable has to be “significant” in both visits and have the same sign. The Supporting Information contains information about how we simulated data sets (Supplementary Figures 1 and 2) to show the difference between consistently and similarly contributing variables between (OSC)PLS and CA-(O)PLS. Calculations were performed in MATLAB (R2014a, The Mathworks, USA).

Results and Discussion

Method Comparison Using Simulated Data

We compare CA-(O)PLS with PLS and OSC-PLS for the simulated data sets (see Supporting Information) with confounders introduced into the data sets. Here, CA-(O)PLS adjusts either for confounder 1 (nonorthogonal to Y) or confounder 2 (almost orthogonal to Y). Supplementary Table 1 shows how methods performed in finding consistently contributing variables associated with the case/control status for the data sets with an effect size of 1 (for inducing differences between groups). It shows the percentage of false negative (type-II error) and false positive (type-I error) findings. All methods find between 1–3% type-II errors; however, the differences between them are observed for the amount of type-I errors. CA-(O)PLS (correcting for confounder 1) finds a lower number, 0.31% (CI95 [0.26, 0.36]), of type-I errors compared to the other methods (1.35–3.09%). As expected, OSC-PLS and CA-(O)PLS (for confounder 2) perform similarly; however, PLS has less, 1.35% (CI95 [1.26, 1.44]) type-I errors than OSC-PLS/CA-(O)PLS (confounder 2) (3.09%/2.91%), which is surprising. However, Supplementary Table 1 shows the OSC-PLS model finds more variables significant that correlate to case/control status, whereas PLS finds more variables uncorrelated to case/control status significant. Similar, but less pronounced, results were found for a data set with less overlap (effect size of 1.645) between groups (Supplementary Table 2).

Unadjusted Model

It has been shown that the prevalence of CVD in general, and hypertension (HBP) specifically, is higher in the north of PRC[40] and that northern and southern Han Chinese are genetically different.[60] We find significant differences (Supplementary Table 3) between our Chinese populations for a number of dietary, lifestyle, metabolic, and population risk factors for CVD for which we aim to adjust. However, we first compare the results from Yap et al., who used t tests to test significance of each 1H NMR variable, with results obtained using our framework and assessment of variable contributions. Figure a and b show the resulting MCCV score plots of the mean predictions. For the model of visit 1 (“model 1”), the R2 is 0.72 with the Q2 being 0.68, resulting in an RCV of 0.95. For the model of visit 2 (“model 2”) the R2, Q2, and RCV are 0.71, 0.67, and 0.95, respectively. This highlights the overall quality of the models. To indicate the spread of the predictions we include the kernel density estimate (KDE) of predicted values for each class. To obtain the KDE, we calculate for each sample the mean and standard deviation of its prediction when it was part of a test set in MCCV. Summing the distribution of each sample per class then yields the KDE as shown. The local sharp peaks of the KDE indicate large between-person variability.

Figure 2

Score plots of the MCCV models of predictive and first orthogonal components with kernel density estimate (KDE), R2 and Q2 shown for the predictive axis. North Chinese individuals (Beijing and Shanxi) are shown as red circles and south Chinese (Guangxi) as cyan crosses. (a) Unadjusted model of urine collection 1. (b) Unadjusted model of urine collection 2. (c) Covariate-adjusted model of urine collection 1. (d) Covariate-adjusted model of urine collection 2. Age, gender, BMI, (on medication for) HBP, smoking status, physical activity, Na/K ratio, and total intake of fats were adjusted for in the CA-(O)PLSDA models (c and d). To analyze sensitivity of models, we used each model to predict the data set from the other visit, resulting in goodness-of-external-predictions of 0.61 for model 1 (predicting visit 2) and 0.64 for model 2 (predicting visit 1) (Supplementary Figure 3). While both data sets do consist of the same individuals, the similarity between spectra from same individuals across visits is not high, indicated by an Rv-coefficient[61] of 0.31, where Rv = 1, indicates perfect similarity and 0 indicates dissimilarity. This is another reason why we use both data sets in determining consistently contributing variables to avoid capturing visit-specific variability.

Adjusted Model

Next, we picked a number of significantly different or important factors from Supplementary Table 3 (age, gender, body mass index (BMI, kg × m–2), (on medication for) HBP, smoking status, physical activity, Na/K-ratio, and total intake of fats) to adjust for using our CA-(O)PLS algorithm. The choice between HBP/medication status and individual measurements of systolic/diastolic BP was made as both measurements of BP are lowered by medication while there still is an underlying condition. The CA-(O)PLS algorithm determines, as described, which covariates are modeled accurately enough to be included in the adjustment. After the adjustment procedure, the geographical location was modeled as binary outcome variable and the resulting score plots (Figure c,d) of the MCCV predictions remain to show good separation; however, they are lower compared to the unadjusted model, with Q2 values of 0.50 and 0.60 for urine collection 1 and 2, respectively. The resulting RCVs are 0.92 and 0.93, indicating the validation procedure is robust, again also demonstrated by predicting the other visits, with goodness-of-external-predictions of 0.46 for model 1 (predicting visit 2) and 0.57 for model 2 (predicting visit 1) (Supplementary Figure 4). Interestingly, here the KDEs are smooth distributions of the two groups, caused by the removal of specific variation in the data related to covariates/potential confounders. While it may seem counterintuitive to obtain models with a (slightly) lower predictive ability after correcting for confounders, it is a logical consequence of the covariates correlating with the outcome. However, the drawbacks of models with a lower predictive value (due to correlation between the outcome variable and covariates) are more than made up for by improved interpretability as the important variables relate to the part of the data that is not affected by covariates and only to the outcome. Figure shows metabolites that consistently contribute to models. Unidentified metabolites are only included if their STORM[45] pseudospectrum showed clear interpretable patterns (Supplementary Figure 6).

Figure 3

Top shows the average 1H NMR spectrum from the first visit. The bottom panel shows the variable contribution across MCCV models. Models were adjusted for age, gender, HBP/medication, BMI, physical activity, smoking status, Na/K-ratio, and total fat intake. Labels: 1, 2-oxoisocaproate; 2, leucine; 3, valine; 4, unknown (1.15(s), 3.49(d), 3.61(d), 3.67(m), 3.83(m)); 5, ethylglucuronide; 6, 2-hydroxyisobutyrate; 7, unknown (1.42(d), 1.46(d), 1.51(d)); 8, unknown (1.82(m), 3.52(s)); 9, N-acetyl-S-(1Z)-propenyl-cysteine-sulfoxide; 10, glutamine; 11, acetone; 12, unknown (2.32(d), 2.34(d), 2.38(d), 2.40(d), 3.52(m)); 13, prolinebetaine; 14, sarcosine; 15, dimethylglycine; 16, unknown (1.84(m), 2.78(m), 2.95(s), 3.36(m), 3.59(m), 3.62(m)); 17, creatine; 18, N6,N6,N6-trimethyllysine; 19, dimethylsulfone; 20, O-acetylcarnitine; 21, carnitine; 22, taurine; 23, 4-hydroxyhippurate; 24, 1-methylhistidine; 25, histidine; 26, tyrosine; 27, pseudouridine; 28, formate; 29, N-methylnicotinic acid. Supplementary Figure 5 shows the results for the unadjusted model.

Discriminatory Metabolites

We compare in Supplementary Table 4 the metabolites reported previously[39] and those found using the unadjusted and confounder-adjusted procedures. Our unadjusted procedure finds the same metabolites previously reported, plus a number of new associations. A large number of these metabolites are no longer significant after the confounder-adjustment and thus are likely related to one or more of the covariates. In every iteration of the MCCV, the covariates are remodeled and only included if they were sufficiently predictive. In theory this depends on sampling of training/test sets; however, in practice we find there is a high consistency in which covariates could be accurately modeled. Gender and smoking status were modeled accurately for all models and Na/K-ratio in 94.4% (visit 1) and 100% (visit 2) of models. Age, fat intake, and HBP/medication could not be modeled accurately, and BMI and physical activity were only included in 4–11% of models. To give a rough estimate of the association of metabolites no longer significant (after covariate-adjustment) and the covariates themselves, we calculated a correlation network (Supplementary Figure 7). The correlations were adjusted for multiple testing using a Bonferroni correction of P < 1.9 × 10–04 for both visits. Dietary Na/K-ratio has a large number of correlations, which appear to have the inverse sign of the correlations of physical activity with metabolite levels. A logical reason is that Na/K-ratio and physical activity are inversely associated themselves. In general, south Chinese individuals are physically more active and consume more potassium and less sodium, and individuals who are more active have lower Na/K-ratios (Supplementary Table 3, Supplementary Figure 8).

Metabolic Reaction Network Analysis

Using metabolites differentially expressed between the Chinese populations, we constructed a condensed multicompartmental metabolic reaction network using MetaboNetworks.[62] It calculates the shortest paths (number of reactions) between metabolites, only considering reactions that can occur in the Homo sapiens supra-organism. We found a number of gut microbial cometabolites, and thus included reactions that can occur in species from the phyla firmicutes, bacteroidetes, alpha-proteobacteria, beta-proteobacteria, gamma-proteobacteria, delta-proteobacteria, and actinobacteria. These phyla make up 99% of phylotypes found in the human gut.[63]Figure shows the resulting metabolic reaction network for the urinary metabolic differences in Chinese populations. Reactions between metabolites are indicated by a solid line for those that are spontaneous or due to human enzymes and by dotted lines for reactions that occur only in gut microbiota. The background colors indicate different types of conventional metabolic pathways. Figure highlights the interconnectivity between many of the metabolites found to be differentially expressed between northern and southern Chinese.

Figure 4

Perturbations to a living system often instigate changes to multiple pathways simultaneously; we show here a condensed multicompartmental metabolic reaction network of the homeostatic urinary signature of differences between north and south Chinese individuals for the human supra-organism, created using MetaboNetworks. A link is shown between two metabolites if the reaction is listed in KEGG and can occur in Homo sapiens (solid lines) or the most abundant endosymbionts (dotted lines). Metabolites not connected in the network, and those not listed in KEGG, were connected to the closest related metabolite in the network, indicated by a dashed line. The background shading illustrates different types of metabolism based on the closest affinity with some overlap between groups. A table with full names for the abbreviated metabolite names can be found in Supplementary Table 5. Branched-chain amino acids (BCAAs) and derivatives (leucine, valine, 2-oxoisocaproate) are found higher in the north compared to the south, which may indicate a difference in energy metabolism, potentially also reflected by the tricarboxylic acid (TCA) cycle intermediates citrate and succinate, and isoleucine found higher in the north without adjustment. Aside from BCAAs, amino acids histidine, tyrosine, and glutamine are also found higher in the north. Glutamine is involved in TCA anaplerotic metabolism and histidine in muscle metabolism. While tyrosine partly links into the TCA anaplerotic metabolism via a microbial conversion to pyruvate, it is an aromatic amino acid. Other aromatic compounds were found to be higher in the south before covariate adjustment (4-cresylsulfate, phenylacetylglutamine, hippurate, 4-hydroxyphenylacetate, 3-hydroxymandelate). The fact that 4-cresylsulfate, phenylacetylglutamine, and hippurate are no longer significant after adjustment is related to gender and body weight differences[11,64] (Supplementary Figure 7). Another aromatic compound found higher in the south is 4-hydroxyhippurate, which has been linked to citrus fruit intake[65] and healthy eating in general.[50] We also find the most common biomarker of citrus fruit intake,[65] prolinebetaine, in higher concentrations in southern Chinese individuals. Aside from being a biomarker for citrus fruit intake, prolinebetaine is also considered an osmoprotectant, as are carnitine and trimethyllysine, which are also found in higher concentrations in the south. Also, the intracellular concentration of taurine (found higher in the south) increases when the extra-cellular fluid is hypertonic,[66] this may indicate that southern Chinese individuals (lower Na/K-ratios) are under less osmotic stress, reflected by the excretion of these metabolites. It should be noted however that the excretion of carnitine and O-acetylcarnitine are also linked to meat intake[67] and that taurine is a major metabolite in the conjugation of bile acids, which may be related to the higher intake of fats in the southern Chinese, indicating differences in lipid/fatty acid metabolism between the regions as well as being indicative of cataplerosis.[68] Acetone is a byproduct of breakdown of lipids/fatty acids for energy release. It has been noted that incomplete fatty acid oxidation and fat excess in skeletal muscle tissue can perturb energy anaplerosis and cause diabetes.[69] Aside from taurine, two other sulfur-containing metabolites are found, dimethylsulfone and N-acetyl-S-(1Z)-propenyl-cysteine-sulfoxide. Both are biomarkers of onion consumption[46,70] with the later validated in a controlled clinical trial.[46,50] Another metabolite possibly linked to dietary intake is ethylglucuronide, which is a long-term marker of alcohol consumption and component of rice wine.[71]N-methylnicotinic acid is a metabolite linked to many different sources, among which coffee consumption[72] and peas,[46] and is a major metabolite of niacin (vitamin B3). In the metabolic network (Figure ), there are multiple domains related to B-vitamins, such as thiamin (B1), panthothenate (B5), pyridoxal (B6), biotin (B7), folate (B9), and cobalamin (B12). These play roles in many processes, including lipid metabolism. Closely linked to the lipidic domain through choline metabolism are formate, dimethylglycine, and sarcosine. These were all found higher in the north and are part of 1-carbon metabolism. Sarcosine is also linked to creatine and urea formation via a microbial link. Creatine is, among many other processes, related to muscle metabolism and, like 1-methylhistidine, found higher in the south. This reflects differences in muscle metabolism between populations, possibly a long-term effect from physical activity (Supplementary Table 3). We also find 2-hydroxyisobutyrate higher in the south which is a product of n-butyrate producing bacteria,[5] the same bacterium (F. prausnitzii) is associated with higher levels of β-aminoisobutyrate, taurine, and dimethylamine, and lower levels of lactate and glycine. With the exception of dimethylamine, these metabolites are similarly expressed in the southern Chinese in the unadjusted model, indicating a possible difference in n-butyrate producing bacteria. Last, pseudouridine, a marker of tRNA turnover, is higher in the south.

Conclusion

Adjusting data for confounders may lead to a loss of predictive power; however, the number of spurious findings is reduced (type-I errors), thereby greatly improving model interpretability. The CA-(O)PLS framework leads to finding more robust sets of biomarkers and more accurate predictions by (1) reducing sampling bias, independent scaling and MCCV, (2) optimizing parameter settings using CMV, (3) removing layers of confounding information from data, and (4) evaluating variable importance across multiple models instead of calculating a single[53] model. We recommend testing whether each covariate can be modeled accurately before including them. However, if covariates cannot be accurately modeled, they are not adjusted for and therefore do not influence models. If this is the case for all covariates, CA-(O)PLS defaults to (OSC−)PLS. While many factors can be adjusted for simultaneously, this will ultimately lead to loss of power, regardless of analysis method (univariate/multivariate). However, including highly (anti)correlated covariates does not pose a problem for CA-(O)PLS as it finds an orthonormal set of factors from the covariate models to adjust data sets with. The benefit of CA-PLS is that it directly adjusts the data which allows a posteriori interpretation of metabolic signatures associated with covariates, opposed to other methods that work on a kernel matrix[32] in which this is lost. We showed that confounders that differ between northern and southern Chinese individuals influence metabolite associations. We find that some previously reported associations are primarily associated with potential confounders. The metabolites that we found to be consistently contributing to the models highlight important underlying processes, most noticeably lipid, energy and gut-microbial metabolism, potentially of interest in determining what drives the differences in prevalence of CVDs in the Chinese population. The multivariate confounder-adjustment framework we describe is easily translatable to other multivariate regression techniques and the potential benefit is not limited to metabolic phenotyping, but in theory it is applicable in any field, for example, other “omics” technologies, drug discovery, ecology, and potentially finance, where changes in collinear multivariate data can be attributed to confounders.

54 in total

Review 1. The key role of anaplerosis and cataplerosis for citric acid cycle function.

Authors: Oliver E Owen; Satish C Kalhan; Richard W Hanson
Journal: J Biol Chem Date: 2002-06-26 Impact factor: 5.157

2. A metabonomic strategy for the detection of the metabolic effects of chamomile (Matricaria recutita L.) ingestion.

Authors: Yulan Wang; Huiru Tang; Jeremy K Nicholson; Peter J Hylands; J Sampson; Elaine Holmes
Journal: J Agric Food Chem Date: 2005-01-26 Impact factor: 5.279

3. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics.

Authors: Frank Dieterle; Alfred Ross; Götz Schlotterbeck; Hans Senn
Journal: Anal Chem Date: 2006-07-01 Impact factor: 6.986

4. Systems biology: Metabonomics.

Authors: Jeremy K Nicholson; John C Lindon
Journal: Nature Date: 2008-10-23 Impact factor: 49.962

5. An integrated metabonomic approach to describe temporal metabolic disregulation induced in the rat by the model hepatotoxin allyl formate.

Authors: Ivan K S Yap; T Andrew Clayton; Huiru Tang; Jeremy R Everett; Gilles Hanton; Jean-Pierre Provost; Jean-Loic Le Net; Claude Charuel; John C Lindon; Jeremy K Nicholson
Journal: J Proteome Res Date: 2006-10 Impact factor: 4.466

6. Metabolic profiling and the metabolome-wide association study: significance level for biomarker identification.

Authors: Marc Chadeau-Hyam; Timothy M D Ebbels; Ian J Brown; Queenie Chan; Jeremiah Stamler; Chiang Ching Huang; Martha L Daviglus; Hirotsugu Ueshima; Liancheng Zhao; Elaine Holmes; Jeremy K Nicholson; Paul Elliott; Maria De Iorio
Journal: J Proteome Res Date: 2010-09-03 Impact factor: 4.466

7. An NMR-based metabonomic approach to investigate the biochemical consequences of genetic strain differences: application to the C57BL10J and Alpk:ApfCD mouse.

Authors: C L Gavaghan; E Holmes; E Lenz; I D Wilson; J K Nicholson
Journal: FEBS Lett Date: 2000-11-10 Impact factor: 4.124

Review 8. Nutrient-gene interaction: metabolic genotype-phenotype relationship.

Authors: Vay Liang W Go; Christine T H Nguyen; Diane M Harris; Wai-Nang Paul Lee
Journal: J Nutr Date: 2005-12 Impact factor: 4.798

9. Human metabolic phenotype diversity and its association with diet and blood pressure.

Authors: Elaine Holmes; Ruey Leng Loo; Jeremiah Stamler; Magda Bictash; Ivan K S Yap; Queenie Chan; Tim Ebbels; Maria De Iorio; Ian J Brown; Kirill A Veselkov; Martha L Daviglus; Hugo Kesteloot; Hirotsugu Ueshima; Liancheng Zhao; Jeremy K Nicholson; Paul Elliott
Journal: Nature Date: 2008-04-20 Impact factor: 49.962

10. Symbiotic gut microbes modulate human metabolic phenotypes.

Authors: Min Li; Baohong Wang; Menghui Zhang; Mattias Rantalainen; Shengyue Wang; Haokui Zhou; Yan Zhang; Jian Shen; Xiaoyan Pang; Meiling Zhang; Hua Wei; Yu Chen; Haifeng Lu; Jian Zuo; Mingming Su; Yunping Qiu; Wei Jia; Chaoni Xiao; Leon M Smith; Shengli Yang; Elaine Holmes; Huiru Tang; Guoping Zhao; Jeremy K Nicholson; Lanjuan Li; Liping Zhao
Journal: Proc Natl Acad Sci U S A Date: 2008-02-05 Impact factor: 11.205

11 in total

1. A prospective cohort analysis of gut microbial co-metabolism in Alaska Native and rural African people at high and low risk of colorectal cancer.

Authors: Soeren Ocvirk; Annette S Wilson; Joram M Posma; Jia V Li; Kathryn R Koller; Gretchen M Day; Christie A Flanagan; Jill Evon Otto; Pam E Sacco; Frank D Sacco; Flora R Sapp; Amy S Wilson; Keith Newton; Faye Brouard; James P DeLany; Marissa Behnning; Corynn N Appolonia; Devavrata Soni; Faheem Bhatti; Barbara Methé; Adam Fitch; Alison Morris; H Rex Gaskins; James Kinross; Jeremy K Nicholson; Timothy K Thomas; Stephen J D O'Keefe
Journal: Am J Clin Nutr Date: 2020-02-01 Impact factor: 7.045

2. Identifying unknown metabolites using NMR-based metabolic profiling techniques.

Authors: Isabel Garcia-Perez; Joram M Posma; Jose Ivan Serrano-Contreras; Claire L Boulangé; Queenie Chan; Gary Frost; Jeremiah Stamler; Paul Elliott; John C Lindon; Elaine Holmes; Jeremy K Nicholson
Journal: Nat Protoc Date: 2020-07-17 Impact factor: 13.491

3. Characterizing the metabolic effects of the selective inhibition of gut microbial β-glucuronidases in mice.

Authors: Marine P M Letertre; Aadra P Bhatt; Michael Harvey; Jeremy K Nicholson; Ian D Wilson; Matthew R Redinbo; Jonathan R Swann
Journal: Sci Rep Date: 2022-10-19 Impact factor: 4.996

4. Nutriome-metabolome relationships provide insights into dietary intake and metabolism.

Authors: Joram M Posma; Isabel Garcia-Perez; Gary Frost; Ghadeer S Aljuraiban; Queenie Chan; Linda Van Horn; Martha Daviglus; Jeremiah Stamler; Elaine Holmes; Paul Elliott; Jeremy K Nicholson
Journal: Nat Food Date: 2020-06-22

5. Use of Large and Diverse Datasets for ¹H NMR Serum Metabolic Profiling of Early Lactation Dairy Cows.

Authors: Timothy D W Luke; Jennie E Pryce; Aaron C Elkins; William J Wales; Simone J Rochfort
Journal: Metabolites Date: 2020-04-30

6. Postnatal prebiotic supplementation in rats affects adult anxious behaviour, hippocampus, electrophysiology, metabolomics, and gut microbiota.

Authors: Sonia O Spitzer; Andrzej Tkacz; Helene M Savignac; Matthew Cooper; Natasa Giallourou; Edward O Mann; David M Bannerman; Jonathan R Swann; Daniel C Anthony; Philip S Poole; Philip W J Burnet
Journal: iScience Date: 2021-09-10

7. Preanalytical Pitfalls in Untargeted Plasma Nuclear Magnetic Resonance Metabolomics of Endocrine Hypertension.

Authors: Nikolaos G Bliziotis; Leo A J Kluijtmans; Gerjen H Tinnevelt; Parminder Reel; Smarti Reel; Katharina Langton; Mercedes Robledo; Christina Pamporaki; Alessio Pecori; Josie Van Kralingen; Martina Tetti; Udo F H Engelke; Zoran Erlic; Jasper Engel; Timo Deutschbein; Svenja Nölting; Aleksander Prejbisz; Susan Richter; Jerzy Adamski; Andrzej Januszewicz; Filippo Ceccato; Carla Scaroni; Michael C Dennedy; Tracy A Williams; Livia Lenzini; Anne-Paule Gimenez-Roqueplo; Eleanor Davies; Martin Fassnacht; Hanna Remde; Graeme Eisenhofer; Felix Beuschlein; Matthias Kroiss; Emily Jefferson; Maria-Christina Zennaro; Ron A Wevers; Jeroen J Jansen; Jaap Deinum; Henri J L M Timmers
Journal: Metabolites Date: 2022-07-24

8. A Two-Way Interaction between Methotrexate and the Gut Microbiota of Male Sprague-Dawley Rats.

Authors: Marine P M Letertre; Nyasha Munjoma; Kate Wolfer; Alexandros Pechlivanis; Julie A K McDonald; Rhiannon N Hardwick; Nathan J Cherrington; Muireann Coen; Jeremy K Nicholson; Lesley Hoyles; Jonathan R Swann; Ian D Wilson
Journal: J Proteome Res Date: 2020-07-06 Impact factor: 4.466

9. Metabolic maturation in the first 2 years of life in resource-constrained settings and its association with postnatal growths.

Authors: N Giallourou; F Fardus-Reid; G Panic; K Veselkov; B J J McCormick; M P Olortegui; T Ahmed; E Mduma; P P Yori; M Mahfuz; E Svensen; M M M Ahmed; J M Colston; M N Kosek; J R Swann
Journal: Sci Adv Date: 2020-04-08 Impact factor: 14.136

10. Dietary metabolite profiling brings new insight into the relationship between nutrition and metabolic risk: An IMI DIRECT study.

Authors: Rebeca Eriksen; Isabel Garcia Perez; Joram M Posma; Mark Haid; Sapna Sharma; Cornelia Prehn; Louise E Thomas; Robert W Koivula; Roberto Bizzotto; Cornelia Prehn; Andrea Mari; Giuseppe N Giordano; Imre Pavo; Jochen M Schwenk; Federico De Masi; Konstantinos D Tsirigos; Søren Brunak; Ana Viñuela; Anubha Mahajan; Timothy J McDonald; Tarja Kokkola; Femke Rutter; Harriet Teare; Tue H Hansen; Juan Fernandez; Angus Jones; Chris Jennison; Mark Walker; Mark I McCarthy; Oluf Pedersen; Hartmut Ruetten; Ian Forgie; Jimmy D Bell; Ewan R Pearson; Paul W Franks; Jerzy Adamski; Elaine Holmes; Gary Frost
Journal: EBioMedicine Date: 2020-08-04 Impact factor: 8.143