Literature DB >> 31626640

Machine learning of human plasma lipidomes for obesity estimation in a large population cohort.

Mathias J Gerl¹, Christian Klose¹, Michal A Surma^1,2, Celine Fernandez³, Olle Melander^3,4, Satu Männistö⁵, Katja Borodulin⁶, Aki S Havulinna^6,7, Veikko Salomaa⁶, Elina Ikonen⁸, Carlo V Cannistraci^9,10,11, Kai Simons^1,12.

Abstract

Obesity is associated with changes in the plasma lipids. Although simple lipid quantification is routinely used, plasma lipids are rarely investigated at the level of individual molecules. We aimed at predicting different measures of obesity based on the plasma lipidome in a large population cohort using advanced machine learning modeling. A total of 1,061 participants of the FINRISK 2012 population cohort were randomly chosen, and the levels of 183 plasma lipid species were measured in a novel mass spectrometric shotgun approach. Multiple machine intelligence models were trained to predict obesity estimates, i.e., body mass index (BMI), waist circumference (WC), waist-hip ratio (WHR), and body fat percentage (BFP), and validated in 250 randomly chosen participants of the Malmö Diet and Cancer Cardiovascular Cohort (MDC-CC). Comparison of the different models revealed that the lipidome predicted BFP the best (R2 = 0.73), based on a Lasso model. In this model, the strongest positive and the strongest negative predictor were sphingomyelin molecules, which differ by only 1 double bond, implying the involvement of an unknown desaturase in obesity-related aberrations of lipid metabolism. Moreover, we used this regression to probe the clinically relevant information contained in the plasma lipidome and found that the plasma lipidome also contains information about body fat distribution, because WHR (R2 = 0.65) was predicted more accurately than BMI (R2 = 0.47). These modeling results required full resolution of the lipidome to lipid species level, and the predicting set of biomarkers had to be sufficiently large. The power of the lipidomics association was demonstrated by the finding that the addition of routine clinical laboratory variables, e.g., high-density lipoprotein (HDL)- or low-density lipoprotein (LDL)- cholesterol did not improve the model further. Correlation analyses of the individual lipid species, controlled for age and separated by sex, underscores the multiparametric and lipid species-specific nature of the correlation with the BFP. Lipidomic measurements in combination with machine intelligence modeling contain rich information about body fat amount and distribution beyond traditional clinical assays.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Biomarkers
Sphingomyelins

Year: 2019 PMID： 31626640 PMCID： PMC6799887 DOI： 10.1371/journal.pbio.3000443

Source DB: PubMed Journal: PLoS Biol ISSN： 1544-9173 Impact factor: 8.029

Introduction

Obesity, the abnormal or excessive fat accumulation that may impair health [1], is associated with increased morbidity and mortality from diseases such as type 2 diabetes and cardiovascular disease [2, 3]. According to World Health Organization, obesity has nearly tripled since 1975, which resulted in 39% of overweight and 13% of obese adults worldwide in 2016 [1]. Obesity can be estimated in a variety of ways: Most commonly, the body mass index (BMI), a ratio of body weight-for-height [4], is used as an indicator of general adiposity. It is convenient and simple but results in varying cardiovascular and metabolic manifestations across individuals. Although BMI largely increases as adiposity increases, it does not distinguish between fat and lean mass, and therefore, individuals with greater muscle mass will also have higher BMIs [5]. The waist-hip ratio (WHR) is an easily accessible measure of body fat distribution and consists of a comparison of waist and hip circumferences. Larger WHR indicates more intra-abdominal fat and is associated with higher risk for type 2 diabetes, cardiovascular disease, and mortality [6]. Similarly, waist circumference (WC) can be used and has been considered a more straight forward and reliable measure compared with WHR [7]. Furthermore, body fat percentage (BFP) is a measure of proportion of adipose tissue in the body compared with lean mass and water [8] and is mostly determined using bioelectrical impedance in field methods. Bioelectrical impedance analysis is a repeatable, easy-to-use, and low-cost method for the estimation of BFP; however, its reliability can be influenced by various factors, including the equation used and the characteristics of the sample in which they have been validated in [9]. BFP is associated with increased all-cause mortality independently of BMI and is often suggested to be a better estimation of adiposity than BMI for prognostic and exploratory purposes [10]. The human genetic predisposition to obesity is rather low. For example, a set of 97 genetic loci have been found associated with BMI, but they accounted for only 2.7% of BMI variation [11]. Similarly, a set of 12 loci explained 0.58% of the variance in BFP [12]. Thus, the genotype may not provide sufficient information for reliable risk assessment of obesity and associated outcomes, highlighting the need for more direct, phenotypic read-outs. Lipidomics is an omic science, which comprehensively measures the entirety of lipid molecules in a sample [13-15]—the lipid phenotype—and can be used to identify multiparametric biomarkers for disease detection, prediction, and patient stratification. For shotgun lipidomics, this can be obtained in a single mass spectrometric measurement after direct infusion of the sample. The plasma lipidome offers a plethora of information on lipids, the metabolism, and biological functions that are currently inaccessible to routine clinical lipid chemistry. This information can be used to obtain insights into many complex disease processes [16, 17]. The shotgun lipidomics technique, in which lipids are efficiently obtained from biological material by automated organic solvent extraction and measured quantitatively and reproducibly in an automated high throughput approach, allows fast screening of several thousand samples with high reproducibility [18], rendering this technology a promising tool for clinical risk assessment and precision biomedicine. Although first lipidomic biomarkers are entering the clinic [19, 20], certain analytical standards, such as intersite reproducibility, need to be established in order to make lipidomic measurements generally accepted in clinical settings [21, 22]. Here, we applied machine learning to model obesity estimates for a lipidomics data set of the large FINRISK 2012 population cohort comprising 1,061 plasma samples [23]. We identified a complex lipidomic signature for BFP and validated the model with an independent data set of the Malmö Diet and Cancer Cardiovascular Cohort (MDC-CC) comprising randomly selected 250 plasma lipidomes [24, 25] measured on the same platform [18]. We could predict BFP with an error of 8% of its full range and explain 73% of its variation based on age, sex, and the lipidome. This lipidomic signature of obesity outperforms classical clinical lipid measures and provides fine-grained and quantitative molecular phenotype enabling stratification and identification of different obesity manifestations. Analyzing the plasma lipidome or the metabolome [26] to estimate obesity is of course much more complicated than by direct measurement and not what we aimed for. Instead, we are investigating how the plasma lipids reflect metabolic status and whether the plasma lipidome can be used to predict health and disease. There is already ample evidence that the plasma lipidome is changing in different disease states [16, 27], and here, we show that the plasma lipidome indeed gives information beyond obesity measures and classical clinical lipid parameters, such as triglycerides and cholesterol. We find that the lipidome gives information about the body fat distribution as measured by the WHR because a number of lipid species correlate with the WHR, even when controlled for BMI. Lipidomes show differences between the sexes, concerning lipid levels, lipid coefficients of variation, and correlations of lipid species with obesity measures. These correlation profiles were similar between the 3 obesity estimates but very different from those lipids correlating with high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, and triglyceride levels indicating that these commonly used lipid markers only insufficiently capture molecular lipid metabolism. We discuss correlations with obesity measures and find that highest lipid impact on our modeling algorithm features 4E,14Z-sphingadiene containing sphingomyelins. Finally, we look the variation not explained by the BMI and BFP regression and find those related to other clinical parameters, such as HDL and LDL cholesterol.

Results and discussion

We performed lipidomics analysis of 1,061 plasma samples of the FINRISK 2012 cohort (S2 Table shows clinical baseline characteristics). Plasma lipid species vary substantially between individuals and on a day-to-day basis [28, 29]. Coefficients of variation for each lipid subspecies showed population variations of 23% to 150% (S1A Fig), which is considerably larger than our 6.0% median technical coefficient of subspecies variation as assessed by reference samples (method precision). Low biological variation was found in lipid classes such as cholesterol (26%) and sphingomyelin (SM, median of 26%), whereas high variation was seen in dietary lipids like triacylglyceride (TAG) and diacylglyceride (DAG) species but also for phosphatidylethanolamine (PE) species. There are differences in variations between the sexes (S1B Fig), with TAGs varying more in males and SM varying more in females. Sex-specific differences are well documented in lipidomics studies [27, 30, 31].

Modeling obesity

Associations of BMI and obesity with lipidomes were investigated before [27, 32], and a more detailed discussion can be found in the S1 Text. We proceeded to construct models predicting obesity from the lipidome of the FINRISK data set. Models were trained on lipid subspecies, including age and sex (S3 Fig) as covariables. Using a Lasso model [33] trained in a cross-validation loop, we first used BMI as our obesity measure and reached a mean absolute error (MAE) of 2.5 ± 0.18 and an explained variation of 47% (S7 Table). Then, using the same procedure, we analyzed how the plasma lipidome is predicting other obesity measures compared with the models we obtained for BMI using a normalized MAE. On this comparable metric, BMI was outperformed (Fig 1A and S7 Table) by WC (MAE = 6.5 ± 0.59, 64% variation explained, WHR (MAE = 0.039 ± 0.0033, 65% variation explained), and BFP (MAE = 3.6 ± 0.33, 73% variation explained). This indicates that the lipidomic information about adiposity, as measured by WHR, WC, and BFP, is more precise than for BMI. Therefore, lipidomes contain information about the actual amount of body fat (BFP) and its distribution (WHR/WC). In the case of BFP, the high variation explained by the model is probably due to specific lipids released by the adipose tissue into the plasma. A similar notion has been reported in the case of branched-chain and aromatic amino acids [34].

Fig 1

Regression of obesity measures by lipidome, age, and sex.

Regression of obesity measures by lipidome, age, and sex.

(A) The NMAE (MAE divided by the range from the 5th to 95th percentile; S2 Fig) of different obesity measures based on Lasso regression of molar amount data. Only subjects were used, for which all obesity measures were available. (B) MAE of BFP comparing different regression algorithms on molar lipid amount data (S7 Table). (C) Lasso based NMAE of BFP comparing direct molar amounts (pmol) to molar amounts standardized to the total lipid amount within a sample (mol%). A two-sided, unpaired Mann–Whitney U test resulted in a p-value of 0.99. (A–C) The summary statistics of a 5× repeated 10-fold cross validation of the FINRISK training data set (80% of the data set) are shown. (D) Quantile-quantile plot of the training residuals of the Lasso BFP model against a normal distribution. (E) Original BFP values (reference) in the FINRISK training, test, and the MDC-CC validation data set plotted against the prediction of Lasso regression based FINRISK training set. n signifies the number of samples in each set. (F) Histogram of fasting times of subjects in the FINRISK data set and scatter plot of the Lasso residuals against fasting time, including a linear model. The slope of the linear model had a p-value of 0.33. BFP, body fat percentage; BMI, body mass index; gbm, stochastic gradient boosting; lm, linear model; MAE, mean absolute error; MDC-CC, Malmö Diet and Cancer Cardiovascular Cohort; mol%, molar fraction; NMAE, normalized MAE; pls, partial least squares; pmol, picomol; rf, random forest; WC, waist circumference; WHR, waist-hip ratio. We tested the presence of BFP-specific information in the lipidome by creating linear models for each lipid subspecies controlled for age and sex. This returned 141 significant lipid species after controlling for multiple testing. A similar amount of lipid species remained significant, even when the model was controlled for BMI (n = 82), WHR (n = 109), or BMI and WHR together (n = 52, S5 Table). A similar situation is found for WHR, for which linear models controlled for age and sex still returned similarly high amount of lipid specific for WHR (n = 134), when additionally controlled for BMI (n = 103), BFP (n = 90), and the combination of BMI and BFP (n = 93). As the relation of WHR and BFP with BMI seems nonlinear (S2 Fig), we also tested the relation using natural splines with similar results (S5 Table). All these results argue for a BFP and WHR specific but BMI independent lipid biology captured by human plasma lipidome, which is still largely unexplored.

Different BFP models and conditions

Six different models predicting BFP were trained and their parameters learned on 796 random training samples in a cross-validation loop (Fig 1B, Results for WHR and BMI in S7 Table). Tree-based random forest [35] and stochastic gradient boosting [36] do not perform significantly better than an ordinary linear model [37] of all lipid predictors. Partial least squares [38], which is well suited for the multicollinearity characterizing lipidomic data sets, was performing better but the Lasso [33] and Cubist [39] models showed even better performance. The simple Lasso model fit the data equally well as the Cubist model, and we used it for all remaining analyses because of its simplicity and interpretability. We also tested whether normalizing absolute lipid amounts to the total lipid amount in a sample (molar fraction [mol%]) would improve the fit by removing the influence of different lipid levels between samples. However, we found no evidence of this (Fig 1C).

Description of the BFP model

The best performing BFP Lasso model (MAE = 3.61 ± 0.33, variation explained = 73.2 ± 5%) resulted in 58 predictors, but there is also a slightly less performing Lasso model (MAE = 3.65 ± 0.33, variation explained = 72.9 ± 5.1%) with only 45 predictors within 1 standard error (S4 Fig and S6 Table). The simpler multiparametric model based on 45 predictors is essentially a subset of the complex multiparametric model based on 58 predictors (Figs 2 and S5).

Fig 2

Lasso model predictors.

Lasso model predictors.

Pearson correlation network of the lipid predictors of the best Lasso model predicting BFP with the lowest MAE and the model at 1 standard error distance (as in S4 Fig). A network cutoff of |r|>0.5 was used. Nodes are shaped as diamonds for predictors in both models and as circles if the predictor appears only in one model. Nodes are filled according to the β-coefficients of the model with the lowest MAE, with a gradient from blue to white for negative β-coefficients and a gradient from white to red for positive β-coefficients. Lipid labels are colored blue for negative β-coefficients and red for positive β-coefficients. Edge weights indicates the value of the correlation coefficient (r). All values of r are positive in this network. The data are reported in S6 Table, and β-coefficients are plotted in S5 Fig. BFP, body fat percentage; CE, cholesteryl ester; Chol, cholesterol; DAG, diacylglyceride; LPC, lysophosphatidylcholine; MAE, mean absolute error; PC, phosphatidylcholine; PE, phosphatidylethanolamine; PI, phosphatidylinositol; SM, sphingomyelin; TAG, triacylglyceride. The Pearson correlation network of the predictors of both Lasso models (Fig 2) shows several interesting features. Within the common lipid predictors of both BFP Lasso models, SM 34:1;2 has the greatest negative and SM 34:2;2 the greatest positive lipid β-coefficients by far (S5 Fig and S6 Table), whereas both are correlated with each other in the correlation network within a cluster of other SM species. The additional double bond in SM 34:2;2 is likely due to an 18;2;2 long-chain base [40, 41], which is present in human plasma [41] and has been shown to be a 4E,14Z-sphingadiene [42], thus suggesting SM 18:2;2/16:0;0 as the subspecies for SM 34:2;2 in plasma [31]. SM 34:2;2 and further doubly unsaturated SMs correlate positively with BFP, i.e., SM 36:2;2 and SM 38:2;2 especially in females (S11 Table), in which they also show higher levels (S3 Fig and S4 Table, [27, 31]). The 4E,14Z-sphingadiene is suggested to be produced by an unknown desaturase, which also creates the single 14Z double bond in 1-deoxysphingolipids [43]. Its supposed higher activity in females results in higher levels of the respective ceramides (Cers) and SMs [27]. As SM 34:1;2 has been reported to be >96% SM 18:1;2/16:0;0 in plasma [31], it is the occurrence of 4E,14Z-sphingadiene in specifically SMs with a 16:0;0 fatty acid, which is the major correlation with BFP picked up by the Lasso models. Their significance is supported by the reduction in prediction power if the SM class is removed from the model (S8 Fig), the fact the SMs are a particularly stable lipid class in plasma (S1 Fig), and that long-chain base effects of plasma sphingolipids have been recently reported to correlate with BMI [31]. How the balance between sphingosine and 4E,14Z-sphingadiene is mechanistically related to the overall metabolic status and its usefulness as a general BFP biomarker needs to be further investigated. Associated with the SM cluster are multiple lipid predictors (cholesteryl ester [CE] 15:0;0, CE 17:0;0, and PC O-17:0;0/17:1;0) with odd chain fatty acids (Fig 2), which could be due to dairy consumption [44] or dietary fiber intake [45]. However, their association with SM and Cer species (Fig 2) might also indicate that these fatty acids are derived from hydroxylated fatty acids in glycosphingolipids or phytosphingosine [46] and therefore link the model to sphingolipids not measured in this study. Furthermore, we find a cluster of correlated lyso-lipids and of TAG species (Fig 2). TAGs with positive β-coefficients are largely consistent with common fatty markers [47]. A more detailed discussion of this observation and the association of product-to-precursor ratios of lipid metabolism enzymes to obesity measures is provided in the S1 Text. Although the Lasso models are dominated by 2 coefficients of the sphingadiene SMs, the error of the model increases significantly, when less than 20 lipid predictors are used (S4 Fig), arguing that a single biomarker, or a small set of biomarkers, are not sufficient to predict and to faithfully capture the complex molecular scenario associated with obesity. In the FINRISK data set, fasting duration for subjects peak around 5 hours (semifasting); however, we saw no trend in the model residuals with fasting length (Fig 1F), indicating that differences in fasting time do not have an impact on the accuracy of the prediction of BFP. This is likely due to the fact that our model is not only based on diet-derived lipids, the levels of which are acutely varying in the blood plasma, but that the predictors of the model are spread across all lipid classes except the one HexCer species (S12A Fig). For example, changes in the diet are reflected in serum TAGs within the first few hours, whereas serum CE and phospholipids reflect the last 3 to 6 weeks [48].

Independent validation of the obesity model

Training of the BFP model on the FINRISK test set resulted in a cross-validation MAE of 3.61 ± 0.33 BFP units, which is about 8% of the BFP range (S2C Fig). The training error of the model was found at a MAE of 3.33 BFP units, and the mean error of the hold out FINRISK test data was at 3.84 (Fig 1E & Table 1). We validated the FINRISK based BFP model in a second, independent data set (MDC-CC), the clinical baseline characteristics of which differ from the FINRISK data set (S2 Table).

Table 1

Reproducibility of the model.

MAE (BFP)	Data set	n	MAE
Cross validation	FINRISK	796	3.61 ± 0.33
Training	FINRISK	796	3.33
Testing	FINRISK	206	3.84
Validation	MDC-CC	250	3.67

Abbreviations: BFP, body fat percentage; MAE, mean absolute error; MDC-CC, Malmö Diet and Cancer Cardiovascular Cohort

Reproducibility of the model.

Models were trained on the FINRISK training data set in a cross-validation loop, which results in a BFP cross validation MAE. Fitting the model on all the training data, using the best performing parameter set, results in a training MAE, testing the model on the hold out test data gives the testing error, and applying the model to the independent MDC-CC data set results in the validation error. See S7 Table for results of all models. Abbreviations: BFP, body fat percentage; MAE, mean absolute error; MDC-CC, Malmö Diet and Cancer Cardiovascular Cohort This validation resulted in a MAE of 3.67, which is only slightly above the cross-validation error obtained with the FINRISK data set. The validation also confirms that the models obtained were independent of the fasting duration, because the participants from the MDC-CC cohort were fasted over night. The MDC-CC validation data set was measured 2 years later than the FINRISK data set on the same platform, arguing for our shotgun lipidomic approach to be highly reproducible. Taken together, these results show that we have identified a robust BFP lipidomic signature (Fig 2 and S6 Table), which was validated in an entirely independent data set. It would be interesting to see whether the model is transferable to subjects from other geographic regions with different population structures and lifestyle habits, because both data sets used originate from northern European countries.

Comparison to a metabolomic obesity model

Recently, a metabolomic data set was used to model BMI using 49 selected metabolites. This study found that this set of metabolites explained 43% of BMI variation when age and sex were included [26]. If the model was extended to the full set of 650 metabolites measured in the study, 47% to 49% of the BMI variation could be explained. In both cases, a major fraction of the metabolites (47% and 40%, respectively) were associated with the lipid superpathway. Our BMI model, although similar in many modeling aspects, is exclusively based on shotgun lipidomics. With 75 predictors in a Lasso model, it explains 47% of the BMI variation, and a model with only 50 predictors resulted in 46.5% of the variation explained. Although the population, experimental set-up, and computational modeling in the metabolomic study and in our study are not directly comparable, this suggests that the data generated with our lipidome shotgun method provide predictions of comparable quality as liquid chromatography-mass spectrometry (LC-MS) metabolomic data used in the above-mentioned study. However, the goal was achieved with a single measurement in a fully high-throughput assay. Therefore, shotgun mass spectrometry lipidomics, with its quantitative and straight-forward approach, together with fast measurements, is reproducible, robust [18], and well prepared to be used in a routine clinical setting. Although the metabolome-derived model [26] explained only about 50% of the actual BMI variation, the metabolome-predicted BMI had improved features, such as better correlations with other clinically important variables, e.g., insulin resistance and HDL cholesterol levels. In addition, if the metabolome predicted a substantially higher BMI than the actual BMI of the subject, these subjects scored worse on a set of clinical health measures. If, however, the metabolome predicted a lower BMI than their actual BMI, the subjects scored better on the respective health measures. Because of uneven distribution of outliers in our models, we were not able to fully show the just described outlier characteristics as in Cirulli and colleagues [26]. However, when we restrict our models to a range of the FINRISK data set, in which both over- and underestimated outliers are present, we observe similar effects (S9 Fig). Still, the overpredicted outliers are in the range of low observed BMI and the underpredicted outliers in a range of high observed BMI. Therefore, the mean BMI of overpredicted samples (24.3 ± 2.0) is much lower than the mean BMI of underpredicted samples (26.9 ± 2.1). Despite this adverse setting, we could validate that individuals who had a lower BMI than predicted from the plasma lipidome had worse routine clinical laboratory values, e.g., HDL and LDL cholesterol, than those individuals whose actual BMI was higher than predicted (S9 Fig). Similar results were obtained for our BFP regression of female subjects (S10 Fig), and weaker trends were observed for male subjects (S11 Fig). Therefore, our results confirm the earlier outlier findings [26] but extend them to a lipidomic setting and also to BFP as an obesity measure. They support the conclusion that a multiparametric lipidomic estimate has a stronger predictive value on obesity than the classical predictors, improving on shortcomings in terms of total fat amount and distribution. They further show that lipidomics captures obesity-related metabolic aberrations more accurately than classical clinical parameters. The fact that outliers of the obesity predictions align with better or worse clinical laboratory values suggests overlapping markers of obesity and other diseases, e.g., a dysregulated lipid metabolism, which not only align with obesity but a range of diseases [47]. Although the lipid metabolism measured by the lipidome would be interpreted by the model as, e.g., a higher BFP, these markers might actually be hinting at other diseases. This unaccounted variation should be further explored.

Effect of input variables and level of lipidome resolution on BFP

To test the quality of the lipidomic predictions, we compared our results with predictions of BFP based on clinical parameters (Fig 3 and S8 Table). As a zero model, we used the mean of the BFP distribution with a MAE of 7.3 (Fig 3B, −Age, −Sex, No Lipids [L]) or 0% of variation explained (S7 Fig and S9 Table). Addition of routine clinical laboratory values (e.g., total cholesterol, triglycerides, LDL cholesterol, HDL cholesterol) to the model hardly improved the BFP prediction (Fig 3B [C], MAE = 7.1 or 8.0% of variation explained). Inclusion of additional variables, e.g., smoking status or blood pressure treatment alone (Fig 3B [A], MAE = 7.2 or 5.8% variation explained) or together with the routine clinical laboratory values (Fig 3B [C + A], MAE = 7.0 or 11% of variation explained) also did not improve prediction of BFP.

Fig 3

Effect of different input variables and lipidome detail on the BFP regression.

Effect of different input variables and lipidome detail on the BFP regression.

(A) Lipidome hierarchy. The FINRISK lipidome, used for modeling, can be aggregated into 4 categories, 14 lipid classes, 143 species, and 183 lipid species and subspecies (S6 Fig). (B) MAE cross-validation mean and standard deviation (n = 50) based on lipidomes of the FINRISK training data set (n = 796) are shown on the y-axis. Modeling was either done without age and sex as covariables (−Age −Sex) or with (+Age +Sex). Results are colored according to lipidome detail. Either no lipid information was used (No Lipids), or lipidome information was aggregated into lipid categories (Categories), lipid classes (Classes), and lipid species (Species). Subspecies denotes the highest structural resolution possible on the platform with a mix of species and subspecies. Variables in addition to the lipidome are shown on the x-axis: No additional input [L]; routine clinical laboratory variables [C]: total cholesterol, HDL cholesterol, LDL cholesterol, triglycerides, HDL to LDL ratio, total cholesterol to HDL ratio, triglycerides to HDL ratio; additional variables [A]: blood pressure treatment, lipid treatment, smoker, pregnant, fasting, prevalent diabetes, prevalent CVD, prevalent liver disease, prevalent coronary heart disease, prevalent stroke, systolic blood pressure, diastolic blood pressure; or the combination of clinical and additional variables [C + A]. Special points are the zero model (L, No Lipids, −Age −Sex), which does not use any predictors but returns the mean of the BFP variable, and the regression only based on age and sex, (L, No Lipids, +Age +Sex), both of which are used as references for BFP predictability without regression based on L, C, or A input (S8 Table). BFP, body fat percentage; CVD, Cardiovascular disease; HDL, high-density lipoprotein; LDL, low-density lipoprotein; MAE, mean absolute error. We then assessed how increased structural resolution of the lipidome influenced the predictive outcome (Figs 3A and S6). Already including the total molar amounts of 4 plasma lipid categories [49], glycerolipids, glycerophospholipids, sphingolipids, and sterol lipids, improved prediction outcomes to a MAE of 6.7 to 7.1 (7.0%–18% variation explained; Categories). Because this enhancement is adding to the improvement obtained by clinical parameters alone, it shows that lipid category amounts add information not contained in the other variables. The next level of structural detail is that of plasma lipid classes (e.g., PC or PE). Addition of the total molar amounts of 14 lipid classes to the BFP model further improved the prediction to 6.0 to 6.1 or 27% to 32% variation explained (Classes). The biggest improvement of the model was obtained when information of molar amounts of individual lipid molecules (143 species or a mixture of 183 species and subspecies) was used, reaching a MAE of 4.8 ± 0.43 or 55% to 57% variation explained (Species/Subspecies). Therefore, molecular lipid information is clearly superior in predicting BFP over more aggregated measures, such as HDL cholesterol, LDL cholesterol, total cholesterol, lipid categories, or lipid classes. We confirmed that the prediction based on lipid subspecies was not improved by including information on classical clinical parameters. This is expected, since LDL cholesterol, HDL cholesterol, and triglycerides have very distinct correlation patterns with the lipid subspecies profile (S16 Fig) and are, therefore, already represented in the lipidome. The correlations are in line with reported relative amounts of lipids found in these lipoproteins [50]. We also observed multiple interesting lipid species-specific differences in these correlations, e.g., sex-specific signs for correlations of PE species with HDL cholesterol. In the case of the clinical triglycerides measurement, the major correlating lipid classes are expectedly TAG and DAG (S16 Fig). However, some highly unsaturated TAGs do not correlate strongly with the triglycerides value. Furthermore, sex-specific correlations of triglycerides are observed for cholesterol and ceramides, both of which show greater correlation coefficients in males than in females. This is in agreement with the results provided in previous studies [30, 51]. Contrary to BMI, BFP is strongly influenced by gender (S2 Fig), which is reflected by the improved prediction outcome after including age and sex variables into the model (S7 Fig). BFP predictions based on age and sex variables alone already have a MAE of 4.5 to 5.2 or 45% to 57% variation explained (Fig 3, +Age +Sex). When age and sex are considered, also routine clinical laboratory values result in improved BFP prediction (Fig 3, +Age +Sex: C, C+A). However, when the structural detail of lipid information is increased, predictions of BFP improved even further. For the model containing age, sex, and lipid subspecies, a MAE as low as 3.6 ± 0.33 or 73% variation explained was achieved. In this case, 62% of the variation not explained by age and sex is explained by the lipidome (S10 Table). These subspecies models are also not improved by the addition of clinical parameters or additional variables, which again shows that these parameters provide no additional information for BFP prediction. Similar models for BMI, WC, and WHR show comparable results (S7 Fig and S8 Table) with some variation on the magnitude of dependency on age and sex or classical clinical parameters. We also tested whether a lipid class was necessary for prediction or could be compensated for by the other lipid classes. The most important lipid class for predicting BFP was SM. Apart from PC, the only lipid class, the removal of which reduces model performance is SM (S8 Fig). This is in agreement with complex correlation patterns observed for SM species. As mentioned above, SM 34:1;2 is inversely correlated with BFP in males, whereas SM 34:2;2 is directly correlated in females (Fig 4), both with a similar correlation estimate and the greatest positive (SM 34:1;2) and negative (SM 34:2;2) β-coefficients in the Lasso models (S5 Fig).

Fig 4

Correlation of lipid subspecies with BFP.

Correlation of lipid subspecies with BFP.

Spearman correlation coefficients (ρ) and their 95% CI for each sex (male and female) and adjusted for subject age are shown for lipid subspecies. HC signifies the HexCer lipid class. Green ticks at the bottom indicate subspecies used in the model with the lowest MAE (n = 58), whereas yellow ticks indicate the model within 1 standard error from the lowest MAE (n = 45, S4 Fig). Correlations with Benjamini-Hochberg corrected p < 0.05 are shown with filled points, whereas correlations with p > 0.05 are shown with open circles and transparently. Differences between male and female correlations were compared and significant differences are indicated by an asterisk (*). A total number of 202 lipid subspecies were tested in 1,005 subjects, of those, 53.5% (108) were significantly correlated with BFP in females and 65.3% (132) in males. Some lipid subspecies of interest have been labeled. Additional correlations are shown in Fig S12 and correlation data is provided in S11 Table. BFP, body fat percentage; CI, confidence interval; MAE, mean absolute error. We conclude that the plasma lipidomes, measured by a single shotgun mass spectrometric analysis, have significantly more predictive power predicting obesity than classically used clinical parameters and that it is the resolution to molecular detail at the subspecies level that provides the relevant information. It is to be expected that similar predictive information on other metabolic states (disease and life style) are represented in the multiparametric lipidomics data.

Lipid correlations with BFP

Although the predictive algorithm (Lasso) selects features on the basis of overall prediction error, it cannot be employed to define the individual lipidomic features correlating with BFP. Therefore, a comprehensive Spearman correlation analysis of lipid subspecies and additional features was performed for male and female subjects individually, including age as a covariable (Figs 4 and S12). The results showed that 53.5% (108) of lipid subspecies in females and 65.3% (132) in males of a total of 202 tested lipid species correlated significantly with BFP after Benjamini-Hochberg correction for multiple hypothesis testing. The BFP to lipidome correlation profile is similar to the BMI (S13 Fig), WHR (S14 Fig), and WC (S15 Fig) profiles; however, the magnitude of the correlation coefficients reflects the respective MAEs in modeling (Fig 1A), with BFP showing highest correlation coefficients and lowest error. Lipidome correlations to HDL cholesterol, LDL cholesterol, or triglycerides (S16 Fig), on the other hand, show very different profiles. As expected from the prediction results, there is a lipid subspecies-specific effect, i.e., individual lipid species showing proportional or inverse correlations, despite being in the same lipid class, e.g., CE 20:1;0 and CE 20:2;0 are inversely correlating, whereas CE 20:3;0 and 20:4;0 are proportionally correlating to BFP. All lipid classes are observed to contribute significant correlations (Fig 4 and S1 Text). These results suggest complex systemic perturbations of lipid metabolism in the obese state.

Conclusion

We show that by exploiting the species diversity revealed by a single quantitative measurement in a lipidomics readout of a large population cohort, we can use machine learning to model and validate obesity estimates better than by using classical clinical parameters, such as total triglycerides and cholesterol. These results show that the molecular details of the plasma lipidome capture obesity-related metabolic aberrations more accurately than these classical clinical parameters. We further confirmed that outliers in the correlation between lipidomic profiles and obesity measures have clinical profiles that could predispose for later obesity-related noncommunicable diseases [26, 52]. The future challenge will be to use this technology to stratify obesity to accurately predict who will stay healthy and who will progress toward disease.

Materials and methods

FINRISK 2012 cohort

The National FINRISK Study is a Finnish population survey conducted every 5 years since 1972 [23]. Samples of the FINRISK 2012 underwent lipidomics measurements (1,141 randomly selected individuals) of which 1,061 were used (See S2 Table) after lipidomic quality control based on total lipid amount or disturbed lipid profile. FINRISK participants were advised to fast at least for 4 hours before the examination and avoid heavy meals earlier during the day. Measurements were obtained as described by Borodulin and colleagues [23]. BFP in the FINRISK study was measured using bioelectrical impedance device (Tanita TBF-300MA, Tanita Corporation, Tokyo, Japan). The FINRISK 2012 survey was approved by the Coordinating Ethical Committee of the Helsinki and Uusimaa Hospital District, the participants gave a written informed consent, and the study was conducted following the principles of the declaration of Helsinki. All data discussed in the paper can be made available to established researchers by a written application to the FINRISK Executive Board. Application portal is located at https://thl.fi/fi/tutkimus-ja-kehittaminen/tutkimukset-ja-hankkeet/finriski-tutkimus/tietoa-tutkijoille. More information can be obtained through info@med.lu.se.

The MDC-CC

MDC-CC is a Swedish cohort designed to study the epidemiology of carotid artery disease from 1991 through 1994 [24, 25]. The MDC-CC was approved by the Regional Board of Ethics in Lund, Dnr 2009/63, and all participants provided written informed consent. A total of 250 subjects were randomly selected as a validation data set, and clinical characteristics of the study samples are presented in S2 Table. MDC-CC participant plasma samples were collected after overnight fasting. Bioelectrical impedance analyzers (BIAs) were used to estimate body composition, and BFP was calculated using an algorithm according to procedures provided by the manufacturer (BIA 103, single-frequency analyzer, JRL Systems, Detroit, IL, USA) [8]. MDC-CC data discussed in the paper will be made available to readers based on a written application to the MDC-CC steering committee (info@med.lu.se).

Lipid nomenclature

Lipid molecules are identified as species or subspecies. Fragmentation of the lipid molecules in MSMS mode delivers subspecies information, i.e., the exact acyl chain (e.g., fatty acid) composition of the lipid molecule. MS only mode, acquiring data without fragmentation, cannot deliver this information and provides species information only. In that case, the sum of the carbon atoms and double bonds in the hydrocarbon moieties is provided. Lipid species are annotated according to their molecular composition as lipid class carbon atoms>:;< sum of hydroxyl groups>. For example, PI 34:1;0 denotes phosphatidylinositol with a total length of its fatty acids equal to 34 carbon atoms, total number of double bonds in its fatty acids equal to 1 and 0 hydroxylations. In case of sphingolipids, SM 34:1;2 denotes a sphingomyelin species with a total of 34 carbon atoms, 1 double bond, and 2 hydroxyl groups in the ceramide backbone. Lipid subspecies annotation contains additional information on the exact identity of their acyl moieties and their sn-position (if available). For example, PI 18:1;0_16:0;0 denotes phosphatidylinositol with octadecenoic (18:1;0) and hexadecanoic (16:0;0) fatty acids, for which the exact position (sn-1 or sn-2) in relation to the glycerol backbone cannot be discriminated (underline "_" separating the acyl chains). On contrary, PC O-18:1;0/16:0;0 denotes an ether-phosphatidylcholine, in which an alkyl chain with 18 carbon atoms and 1 double bond (O-18:1;0) is ether-bound to sn-1 position of the glycerol and a hexadecanoic acid (16:0;0) is connect via an ester bond to the sn-2 position of the glycerol (slash “/” separating the chains signifies that the sn-position on the glycerol can be resolved). Lipid identifiers of the SwissLipids database [53] (http://www.swisslipids.org) are provided in S1 Table.

Analytical process design

Samples were divided into analytical batches of 84 samples each. Each batch was accompanied by a set of 4 blank samples (150 mM ammonium bicarbonate [in water]) and a set of identical 8 control reference samples (human blood plasma). These control samples in groups of 1 blank and 2 reference samples were distributed evenly across each batch and extracted and processed together with study samples to control for background and intrarun reproducibility.

Lipid extraction for mass spectrometry lipidomics

Mass spectrometry–based lipid analysis was performed as described by Surma and colleagues [18]. For lipid extraction, an equivalent of 1 μL of undiluted plasma was used, and plasma lipids were extracted with methyl tert-butyl ether/methanol (7:2, V:V) [54]. Internal standards (Avanti Polar Lipids, Birmingham, AL) were premixed with the organic solvents mixture. The internal standards included known amounts of: cholesterol D6 (Chol), cholesterol ester 20:0 (CE), ceramide 18:1;2/17:0 (Cer), diacylglycerol 17:0/17:0 (DAG), phosphatidylcholine 17:0/17:0 (PC), phosphatidylethanolamine 17:0/17:0 (PE), lysophosphatidylcholine 12:0 (LPC), lysophosphatidylethanolamine 17:1 (LPE), triacylglycerol 17:0/17:0/17:0 (TAG), and sphingomyelin 18:1;2/12:0 (SM). After extraction, the organic phase was transferred to an infusion plate and dried in a speed vacuum concentrator. Dried extract was resuspended in 7.5 mM ammonium acetate in chloroform/methanol/propanol (1:2:4, V:V:V). All liquid handling steps were performed using Hamilton Robotics STARlet (Hamilton Robotics, Reno, NV) with the Anti Droplet Control feature for organic solvents pipetting. Chemicals and solvents of HPLC/LC-MS analytical grade were used (Merck, Darmstadt, Germany).

MS data acquisition

Samples were analyzed by direct infusion in a QExactive mass spectrometer (Thermo Scientific, Bremen, Germany) equipped with a TriVersa NanoMate ion source (Advion Biosciences, Ltd., Ithaca, NY). Samples were analyzed in both positive and negative ion modes with a resolution of R = 280,000 for MS and R = 17,500 for MSMS experiments in a single acquisition. MSMS was triggered by an inclusion list encompassing corresponding MS mass ranges scanned in 1 Da increments. Both MS and MSMS data were combined to monitor CE, DAG, and TAG ions as ammonium adducts; PC, PC O−, as acetate adducts and PE, PE O−, and PI as deprotonated anions. MS only was used to monitor LPE and LPE O− as deprotonated anions; Cer, SM, LPC, and LPC O− as acetate adducts; and cholesterol as ammonium adduct.

Postprocessing

Spectra were analyzed with in-house developed lipid identification software based on LipidXplorer [55, 56]. TAGs are quantified as species (e.g., TAG 48:0;0). Fatty acid amounts within TAG species are achieved by distributing the total species amount by fatty acid fragment intensities. Data postprocessing and normalization were performed using an in-house developed data management system. Only lipid identifications with a signal-to-noise ratio >5, and a signal intensity 5-fold higher than in corresponding blank samples were considered for further data analysis. For FINRISK, using 3 reference samples per 96-well plate batch, lipid amounts were corrected for batch variations. An occupational threshold of 70% was applied to the data, keeping lipid species, which were present in at least 70% of the subjects. The median coefficient of subspecies variation, as accessed by reference samples, was 5.96%. In the MDC-CC data set, batch correction was applied using 8 reference samples per 96-well. Amounts were also corrected for analytical drift if the p-value of the slope was below 0.05 with an R2 greater than 0.75 and the relative drift was above 5%. Median coefficient of subspecies variation, as accessed by reference samples, was 10.49%, and no occupational threshold was applied. For predictive modeling, lipid species were matched between the FINRISK and MDC-CC datasets.

Data analysis

Data were analyzed with R version 3.4.2 [37] using tidyverse packages [57]. For correlations analysis, outliers were removed, which were more than 4.5 interquartile ranges from the median of the lipid species, whereas the full set was used for predictive modeling. The data set was split for the 2 sexes, and for each subset, age-adjusted Spearman correlation coefficients (ρ) were calculated using the RVAideMemoire::pcor.test() function [58]. CIs were thereby created using 1,000 bootstrap resamples. When testing whether male and female correlation coefficients were significantly different from each other, the cocor.indep.groups() function from the cocor package was was used with default parameters. Correlation coefficients were significantly different if the confidence interval of the difference did not include zero [59, 60]. The correlation network was calculated with the stats::cor() function using the Pearson correlation method and pairwise complete observations. The network was visualized using Cytoscape version 3.5.0 [61]. For linear models of obesity measures with covariables, outliers were removed, which were more than 4.5 interquartile ranges from the median of the lipid species. Regression models were created by the lm() function, and 95% CIs were calculated using the confint() function. Natural splines were created with the ns() function of the splines package. Degrees of freedom of the splines were analyzed in a 10-fold cross-validation loop with MAE as readout: no effect was determined for age, and 3 degrees of freedom were used as a default for the obesity estimate covariables. A coefficient of partial determination [62] was calculated as the proportion of variation that cannot be explained in a reduced model, using the residual sum of squares (RSS) of the full model or reduced model:

Predictive modeling

Cubist [39], Lasso [33], partial least squares [38], stochastic gradient boosting [36], random forest [35], and linear models were trained within the caret package, version 6.0–76 [63]. Input data were randomly split into a training (80%, n = 796) and a test data set (20%, n = 206) were used for all models. The input data were also filtered to contain only complete measurements of all modeled variables (BFP, BMI, WHR, WC, n = 1,002). Models were trained using a 5× repeated 10-fold cross-validation loop. Within the cross-validation loop data were centered and scaled (Z-score) to avoid predominance of the most abundant features, missing values were imputed by the median value of the predictor and near zero-variance variables were removed [63]. The final model was fit on all training data. MAEs are calculated by predicting the training data (training MAE), hold out test data (test MAE), and validation data (validation MAE).

Product-to-precursor ratios

Fatty acid desaturases and elongation activities were estimated by calculating product-to-precursor ratios of sums of fatty acids in all lipids measured on the subspecies level (CE, DAG, TAG, LPC, LPE, PC, PC O−, PE, PE O−, PI) as described and discussed in work by Vessby and colleagues and Kjellqvist and colleagues [48, 64]. The following indices were used: The ratio of 16:1;0 to 16:0;0 (C16) and 18:1;0 to 18:0;0 (C18) was used to estimate the SCD1 Δ-9-desaturase (D9D) activity [48, 65, 66]. The ratio of 20:4;0 to 20:3;0 was used to estimate Δ-5-desaturase (D5D) activity [64, 66, 67]. The ratio of 18:3;0 to 18:2;0 for D6D activity [48, 66]. The ratio of 20:3;0 to 18:3;0 for ELOVL5 activity [64]. The ratio of 18:0;0 to 16:0;0 for ELOVL6 activity [68, 69]. The ratio of 16:0;0 to 18:2;0 as de novo lipogenesis index (DNL) [70].

Further discussion on lipid correlations with BFP.

BFP, body fat percentage. (PDF) Click here for additional data file.

Lipid species coefficients of variation.

(A) Coefficient of variation () for all lipid subspecies, clustered and colored by lipid class. The number of measurements per species is indicated as point size (n). (B) Coefficient of variation for all lipid species within each sex individually are shown on the x-axis. On the y-axis Spearman correlations (ρ) to BFP are displayed as in Fig 4 in the main article. Coefficient of variation data is provided in the S3 Table. BFP, body fat percentage. (PDF) Click here for additional data file.

FINRISK dependent variables.

Characteristics of FINRISK variables: Distribution of (A) BMI, (B) WHR, (C) BFPs (Fat mass [%]) in the data set. (A–C) The distribution’s mean, standard deviation, and number of subjects (n), excluding missing values, are shown in the upper left corner. Vertical dashed lines indicate the 5th and 95th percentile. Relationship of BMI to (D) WHR and (E) Fat mass according to subject sex. Lines are based on local polynomial regression fitting (loess). BFP, body fat percentage; BMI, body mass index; WHR, waist-hip ratio. (PDF) Click here for additional data file.

Sex differences of lipid species.

Volcano plot of sex differences between 571 female and 490 male subjects. P-values of the Mann–Whitney U test are displayed on the y-axis, fold changes of means are shown on the x-axis, which is calculated as female (f) divided by male (m) lipid levels (). Labels at the top also indicate the direction of the fold change. Points with outlines show significance <0.05 after Benjamini-Hochberg correction for multiple testing. Some of most significant SM species containing 4E,14Z-sphingadiene are labeled. All data displayed is provided in S4 Table. SM, sphingomyelin. (PDF) Click here for additional data file.

Lasso models.

MAE for BFP Lasso models with different number of predictors. The lowest MAE of 3.61 ± 0.33 was achieved by a fraction of 0.24, which used 58 lipid subspecies, whereas a model at 1 standard error distance with a MAE of 3.65 ± 0.33 used a fraction of 0.2 and 45 lipid predictors. The data are shown in S6 Table. BFP, body fat percentage; MAE, mean absolute error. (PDF) Click here for additional data file. β-coefficients of the predictors of the best Lasso model predicting BFP with the lowest MAE and the model at 1 standard error distance (as in S4 Fig) are displayed. The data are shown in S6 Table. BFP, body fat percentage; MAE, mean absolute error. (PDF) Click here for additional data file.

Lipidome hierarchy.

The FINRISK lipidome can be aggregated into 4 categories, 14 lipid classes, 143 species, and 183 lipid species and subspecies. (PDF) Click here for additional data file.

Effect of different input variables by R2.

Effect of different input variables and lipidome detail on the BFP, BMI, WC, and WHR regressions: R2 cross-validation mean and standard deviation (n = 50) based on the FINRISK training data set (n = 805) are shown on the y-axis. Zero models, which always give back the mean of the distribution, were set to 0. Modeling was done either without age and sex as covariables (−Age −Sex) or with (+Age +Sex). Results are colored according to lipidome detail. Either no lipid information was used (No Lipids), or lipidome information was aggregated into lipid categories (Categories), lipid classes (Classes), and lipid species (Species). Subspecies denotes the highest structural resolution possible on the platform with a mix of species and subspecies. Variables in addition to the lipidome input used are shown on the x-axis: No additional input [L]; routine clinical laboratory variables [C]: total cholesterol, HDL cholesterol, LDL cholesterol, triglycerides, HDL to LDL ratio, total cholesterol to HDL ratio, triglycerides to HDL ratio; additional variables [A]: Blood pressure treatment, lipid treatment, smoker, pregnant, fasting, prevalent diabetes, prevalent CVD, prevalent liver disease, prevalent coronary heart disease, prevalent stroke, systolic blood pressure, diastolic blood pressure; or the combination of clinical and additional variables [C + A]. Special points are the zero model (L, No Lipids, −Age, −Sex), which does not use any predictors but returns the mean of the respective obesity measure variable, or the regression only based on age and sex, (L, No Lipids, +Age, +Sex), both of which are used as references predictability without regression. Values are listed in S9 Table. BFP, body fat percentage; BMI, body mass index; HDL, high-density lipoprotein; LDL, low-density lipoprotein; WC, waist circumference; WHR, waist-hip ratio. (PDF) Click here for additional data file.

Dependency of the models on lipid classes.

(A) Lasso models for BFP, WHR, WC, and BMI were built on the full data set plus age and sex (indicated by −), or the lipids of the indicated class were removed (indicated by the class). R2 values are shown and sorted by decreasing BFP results. (B–D) Lasso models for BFP, WHR, and BMI were built on the lipids of one lipid class only plus age and sex. R2 values are shown and sorted by increasing results of the respective variable. As TAG and DAG are very correlated, both were also removed together. BFP, body fat percentage; BMI, body mass index; DAG, diacylglyceride; TAG, triacylglyceride; WC, waist circumference; WHR, waist-hip ratio. (PDF) Click here for additional data file.

Outlier analysis of BMI regression.

(A) Scatter plot of predicted BMI versus observed BMI. The dashed diagonal line represents a predicted BMI perfectly matching the observed BMI. Individual samples are shown as a scatter plot. Outliers are defined as the upper or lower 15% of the residual distribution. n shows the number of FINRISK samples used for this analysis. Samples, which are not classified as outliers, are colored as Normal (BMI < 25), Overweight (25 ≥ BMI < 30), or Obese (BMI ≥ 25). Horizontal dotted lines indicate samples used for analysis. (B) Samples used for analysis: Because samples classified as "pred < obs" and "pred > obs" are not evenly distributed along the observed BMI axis; the analysis is restricted to observed BMI 21.4 to 29.73, using the lower value of the obs. BMI of the "pred < obs" and the upper value of the "pred > obs" samples as borders for the restricted area. This selects a total 699 of 1,002 (69.8%) samples, with 242 Normal, 311 Overweight, 104 "pred > obs," and 42 "pred < obs" samples. (C) Plotted variables are scaled to a common scale ranging from 0 to 1 and shown as a box plot. Outliers are plotted as black points. BMI, body mass index; obs, observed; pred, predicted. (PDF) Click here for additional data file.

Outlier analysis of BFP regression for female samples.

(A) Scatter plot of predicted BFP versus observed BFP. The dashed diagonal line represents a predicted BFP perfectly matching the observed BFP. Individual samples are shown as scatter plot. Outliers are defined as the upper or lower 15% of the residual distribution. n shows the number of FINRISK samples used for this analysis. Samples, which are not classified as outliers, are colored as Normal (BFP < 35) or Obese (BFP ≥ 35) [71, 72]. Horizontal dotted lines indicate samples used for analysis. (B) Samples used for analysis: Because samples classified as "pred < obs" and "pred > obs" are not evenly distributed along the observed BFP axis, the analysis is restricted to observed BFP 28 to 39, using the lower value of the obs. BFP of the "pred < obs" and the upper value of the "pred > obs" samples as approximate guides for the restricted area. The "pred > obs" value beyond obs. BFP of 40 was classified as too extreme and ignored. This selects a total 274 of 534 (51.3%) samples, with 141 Normal, 74 Overweight, 26 "pred > obs," and 33 "pred < obs" samples. (C) Plotted variables are scaled to a common scale ranging from 0 to 1 and shown as a box plot. Outliers are plotted as black points. BFP, body fat percentage; obs, observed; pred, predicted. (PDF) Click here for additional data file.

Outlier analysis of BFP regression for male samples.

(A) Scatter plot of predicted BFP versus observed BFP. The dashed diagonal line represents a predicted BFP perfectly matching the observed BFP. Individual samples are shown as scatter plot. Outliers are defined as the upper or lower 15% of the residual distribution. n shows the number of FINRISK samples used for this analysis. Samples, which are not classified as outliers, are colored as Normal (BFP < 25) or Obese (BFP ≥ 25) [71, 72]. Horizontal dotted lines indicate samples used for analysis. (B) Samples used for analysis: Because samples classified as "pred < obs" and "pred > obs" are not evenly distributed along the observed BFP axis, the analysis is restricted to observed BFP 19.5 to 30.8, using the lower value of the obs. BFP of the "pred < obs" and the upper value of the "pred > obs" samples as approximate guides for the restricted area. This selects a total 294 of 468 (62.8%) samples, with 134 Normal, 105 Overweight, 29 "pred > obs," and 26 "pred < obs" samples. (C) Plotted variables are scaled to a common scale ranging from 0 to 1 and shown as a box plot. Outliers are plotted as black points. BFP, body fat percentage; obs, observed; pred, predicted. (PDF) Click here for additional data file.

Correlations with BFP.

Spearman correlation coefficients (ρ) and their 95% CI for each sex (male and female) and adjusted for subject age are shown for (A) lipid subspecies; HC signifies the HexCer lipid class. Green ticks at the bottom indicate that this subspecies was used in the model with the lowest MAE, whereas yellow ticks indicate the model with 1 standard error from the lowest MAE (S4 Fig). A total number of 202 lipid subspecies were tested in 1,005 subjects, of those 53.5% (108) were significantly correlated with BFP in females and 65.3% (132) in males. (B) class sums; (C) fatty acid sums within the lipid classes CE, DAG, TAG, LPC, LPE, PC, PC O-, PE, PE O-, PI; (D) sums of TAG total lengths; (E) sums of TAG total double bonds; (F) sums of sphingolipid total lengths; (G) Product-to-precursor fatty acid ratios for SCD1/ Δ-9-desaturase (SCD-16, SCD-18), Δ-6-desaturase (D6D), Δ-5-desaturase (D5D), fatty acid elongases ELOVL5 and ELOVL6, and DNL index. Correlations with Benjamini-Hochberg corrected p < 0.05 are shown with filled points, whereas correlations with p > 0.05 are shown with open circles and transparently. Differences between male and female correlations were compared, and significant differences are indicated by an asterisk (*). Results are provided in S11 Table. BFP, body fat percentage; CE, cholesteryl ester; DAG, diacylglyceride; LPC, lysophosphatidylcholine; LPE, lysophosphatidylethanolamine; MAE, mean absolute error; PC, phosphatidylcholine; PC O-, 1-O-alkyl- or 1-O-alkenyl-phosphatidylcholine; PE, phosphatidylethanolamine; PE O-, 1-O-alkyl- or 1-O-alkenyl-phosphatidylethanolamine; PI, phosphatidylinositol; TAG, triacylglyceride. (PDF) Click here for additional data file.

Correlations with BMI.

Spearman correlation coefficients (ρ) and their 95% CI for each sex (male and female) and adjusted for subject age are shown for (A) lipid subspecies; HC signifies the HexCer lipid class. A total number of 202 lipid subspecies were tested in 1,061 subjects, of those 59.9% (121) were significantly correlated with BMI in females and 60.4% (122) in males. (B) Class sums; (C) fatty acid sums within the lipid classes CE, DAG, TAG, LPC, LPE, PC, PC O-, PE, PE O-, PI; (D) sums of TAG total lengths; (E) sums of TAG total double bonds; (F) sums of sphingolipid total lengths; (G) Product-to-precursor fatty acid ratios for SCD1/Δ-9-desaturase (SCD-16, SCD-18), Δ-6-desaturase (D6D), Δ-5-desaturase (D5D), fatty acid elongases ELOVL5 and ELOVL6, and DNL index. Correlations with Benjamini-Hochberg corrected p < 0.05 are shown with filled points, whereas correlations with p > 0.05 are shown with open circles and transparently. Differences between male and female correlations were compared and significant differences are indicated by an asterisk (*). Results are provided in S11 Table. BMI, body mass index; CE, cholesteryl ester; DAG, diacylglyceride; LPC, lysophosphatidylcholine; LPE, lysophosphatidylethanolamine; MAE, mean absolute error; PC, phosphatidylcholine; PC O-, 1-O-alkyl- or 1-O-alkenyl-phosphatidylcholine; PE, phosphatidylethanolamine; PE O-, 1-O-alkyl- or 1-O-alkenyl-phosphatidylethanolamine; PI, phosphatidylinositol; TAG, triacylglyceride. (PDF) Click here for additional data file.

Correlations with WHR.

Spearman correlation coefficients (ρ) and their 95% CI for each sex (male and female) and adjusted for subject age are shown for (A) lipid subspecies; HC signifies the HexCer lipid class. A total number of 202 lipid subspecies were tested in 1,051 subjects, of those 49.5% (100) were significantly correlated with WHR in females and 58.4% (118) in males. (B) Class sums; (C) fatty acid sums within the lipid classes CE, DAG, TAG, LPC, LPE, PC, PC O-, PE, PE O-, PI; (D) sums of TAG total lengths; (E) sums of TAG total double bonds; (F) sums of sphingolipid total lengths; (G) Product-to-precursor fatty acid ratios for SCD1/Δ-9-desaturase (SCD-16, SCD-18), Δ-6-desaturase (D6D), Δ-5-desaturase (D5D), fatty acid elongases ELOVL5 and ELOVL6, and DNL index. Correlations with Benjamini-Hochberg corrected p < 0.05 are shown with filled points, whereas correlations with p > 0.05 are shown with open circles and transparently. Differences between male and female correlations were compared and significant differences are indicated by an asterisk (*). Results are provided in S11 Table. CE, cholesteryl ester; DAG, diacylglyceride; LPC, lysophosphatidylcholine; LPE, lysophosphatidylethanolamine; MAE, mean absolute error; PC, phosphatidylcholine; PC O-, 1-O-alkyl- or 1-O-alkenyl-phosphatidylcholine; PE, phosphatidylethanolamine; PE O-, 1-O-alkyl- or 1-O-alkenyl-phosphatidylethanolamine; PI, phosphatidylinositol; TAG, triacylglyceride; WHR, waist-hip ratio. (PDF) Click here for additional data file.

Correlations with WC.

Spearman correlation coefficients (ρ) and their 95% CI for each sex (male and female) and adjusted for subject age are shown for (A) lipid subspecies; HC signifies the HexCer lipid class. A total number of 202 lipid subspecies were tested in 1,052 subjects, of those 59.4% (120) were significantly correlated with WC in females and 57.9% (117) in males. (B) class sums; (C) fatty acid sums within the lipid classes CE, DAG, TAG, LPC, LPE, PC, PC O-, PE, PE O-, PI; (D) sums of TAG total lengths; (E) sums of TAG total double bonds; (F) sums of sphingolipid total lengths; (G) Product-to-precursor fatty acid ratios for SCD1/Δ-9-desaturase (SCD-16, SCD-18), Δ-6-desaturase (D6D), Δ-5-desaturase (D5D), fatty acid elongases ELOVL5 and ELOVL6, and DNL index. Correlations with Benjamini-Hochberg corrected p < 0.05 are shown with filled points, whereas correlations with p > 0.05 are shown with open circles and transparently. Differences between male and female correlations were compared and significant differences are indicated by an asterisk (*). Results are provided in S11 Table. CE, cholesteryl ester; DAG, diacylglyceride; LPC, lysophosphatidylcholine; LPE, lysophosphatidylethanolamine; MAE, mean absolute error; PC, phosphatidylcholine; PC O-, 1-O-alkyl- or 1-O-alkenyl-phosphatidylcholine; PE, phosphatidylethanolamine; PE O-, 1-O-alkyl- or 1-O-alkenyl- phosphatidylethanolamine; PI, phosphatidylinositol; TAG, triacylglyceride; WC, waist circumference. (PDF) Click here for additional data file.

Lipid species correlations with HDL, LDL, and triglycerides.

Spearman correlation coefficients (ρ) and their 95% CI for each sex (male and female) and adjusted for subject age are shown for lipid species and (A) HDL, (B) LDL, and (C) triglycerides, as well as lipid class sums, and (D) HDL, (E) LDL, and (F) triglycerides; correlations with Benjamini-Hochberg corrected p < 0.05 are shown with filled points, whereas correlations with p > 0.05 are shown with open circles and transparently. Differences between male and female correlations were compared, and significant differences are indicated by an asterisk (*). HDL, high-density lipoprotein; LDL, low-density lipoprotein. (PDF) Click here for additional data file.

Lipid identifiers.

Identifiers of lipids used in this study to the SwissLipids database [53] (http://www.swisslipids.org) are provided. (XLSX) Click here for additional data file.

Clinical baseline characteristics of the study populations.

Normal distributed variables are shown as mean with SD, and p-values were calculated using a ANOVA. Variables Age and total triglycerides are shown as medians with IQR, and p-values were calculated using a Mann–Whitney U test. IQR, interquartile range. (XLSX) Click here for additional data file.

Lipid coefficient of variation.

Coefficient of variation () for all lipid subspecies as shown in (S1 Fig). (XLSX) Click here for additional data file. Results of a Mann–Whitney U test of sex differences between 571 female and 490 male subjects as shown in S3 Fig. (XLSX) Click here for additional data file.

Linear models for BFP and WHR with covariables.

The first sheet reports the "Number of significant lipids" for each model after correction for multiple testing. The other sheets report linear BFP or WHR models for each lipid subspecies (lipid) and covariables as indicated in the first line. For variables using natural splines variables are shown within the ns() function with the degrees of freedom indicated (e.g., df = 3). Subspecies with Benjamini-Hochberg corrected p > 0.05 are listed. For each model, the following information for the subspecies is provided: beta: β-coefficient; CI low/CI high: lower and higher bound of the 95% CI; p-value and BH: Benjamini-Hochberg corrected p-value. BFP, body fat percentage; WHR, waist-hip ratio. (XLSX) Click here for additional data file. Summary statistics and predictor β-coefficients are shown of the optimal model (lowest MAE) and the model within one standard error (lowest MAE + 1SE). The data are plotted in S4 Fig. MAE, mean absolute error. (XLSX) Click here for additional data file.

Reproducibility of all models on all obesity estimates using MAE, NMAE, and R2 as metrics.

Models were trained on the FINRISK training data set in a 5× repeated 10-fold cross-validation loop, which results in a cross-validation error. Fitting the models on all the training data, using the best performing parameter set, results in a training error, whereas testing the models on the hold out test data gives the testing error, and applying the model to the independent MDC-CC data set results in the validation error. Only subjects have been used for which all obesity measures were available (n = 796). MAE, mean absolute error; MDC-CC, Malmö Diet and Cancer Cardiovascular Cohort; NMAE, normalized MAE. (XLSX) Click here for additional data file.

Effect of different input variables: MAE values.

Effect of different input variables and lipidome detail on the BFP, BMI, WC, and WHR regressions. Modeling was either done without age and sex as covariables (−Age, −Sex) or with (+Age, +Sex). Results are shown according to lipidome detail (Level). Either no lipid information was used (No Lipids), or lipidome information was aggregated into lipid categories (Categories), lipid classes (Classes), and lipid species (Species). Subspecies denotes the highest structural resolution possible on the platform with a mix of species and subspecies. Variables in addition to the lipidome input used are: No additional input [L], Routine clinical laboratory variables [C], Additional variables [A], or the combination of clinical and additional variables [C+A]. See caption of S7 Fig for more details. MAE cross-validation mean and standard deviation (n = 50) based on the FINRISK training data set, and only subjects have been used for which all obesity measures were available (n = 796). BFP, body fat percentage; BMI, body mass index; MAE, mean absolute error; WC, waist circumference; WHR, waist-hip ratio. (XLSX) Click here for additional data file.

Effect of different input variables: R2 values.

Effect of different input variables and lipidome detail on the BFP, BMI, WC, and WHR regressions. Modeling was either done without age and sex as covariables (−Age, −Sex) or with (+Age, +Sex). Results are shown according to lipidome detail (Level). Either no lipid information was used (No Lipids), or lipidome information was aggregated into lipid categories (Categories), lipid classes (Classes), and lipid species (Species). Subspecies denotes the highest structural resolution possible on the platform with a mix of species and subspecies. Variables in addition to the lipidome input used are: No additional input [L], Routine clinical laboratory variables [C], Additional variables [A], or the combination of clinical and additional variables [C+A]. See caption of S7 Fig for more details. R2 cross-validation mean and standard deviation (n = 50) based on the FINRISK training data set, and only subjects have been used for which all obesity measures were available (n = 796). BFP, body fat percentage; BMI, body mass index; WC, waist circumference; WHR, waist-hip ratio. (XLSX) Click here for additional data file.

Effect of different input variables: Coefficients of partial determination (pR2).

Effect of different input variables and lipidome detail on the BFP, BMI, WC, and WHR regressions. Coefficients of partial determination (pR2) indicate the proportion of variation that cannot be explained in a with a "No Lipids" model (S9 Table). Modeling was either done without age and sex as covariables (−Age, −Sex) or with (+Age, +Sex). Results are shown according to lipidome detail (Level). Either no lipid information was used (No Lipids), or lipidome information was aggregated into lipid categories (Categories), lipid classes (Classes), and lipid species (Species). Subspecies denotes the highest structural resolution possible on the platform with a mix of species and subspecies. Variables in addition to the lipidome input used are: No additional input [L], Routine clinical laboratory variables [C], Additional variables [A], or the combination of clinical and additional variables [C+A]. See caption of S7 Fig for more details. pR2 cross-validation mean and standard deviation (n = 50) based on the FINRISK training data set, and only subjects have been used for which all obesity measures were available (n = 796). BFP, body fat percentage; BMI, body mass index; WC, waist circumference; WHR, waist-hip ratio. (XLSX) Click here for additional data file.

Correlation analyses of obesity measures, HDL, LDL, and Triglycerides.

Results of the correlation analyses as shown for BFP (S12 Fig), BMI (S13 Fig), WHR (S14 Fig), WC (S15 Fig), and HDL, LDL, and Triglycerides (S16 Fig). Values are shown for male and female separately, which are then directly compared. Estimate: Spearman correlation coefficients (ρ), n: number of features available, CI 95% confidence interval, BH: Benjamini-Hochberg corrected p-value. BFP, body fat percentage; BMI, body mass index; HDL, high-density lipoprotein; LDL, low-density lipoprotein; WC, waist circumference; WHR, waist-hip ratio. (XLSX) Click here for additional data file. 3 Jun 2019 Dear Dr Gerl, Thank you for submitting your manuscript entitled "Machine learning of human plasma lipidomes for obesity estimation in a large population cohort" for consideration as a Research Article by PLOS Biology. Your manuscript has now been evaluated by the PLOS Biology editorial staff as well as by an academic editor with relevant expertise and I am writing to let you know that we would like to send your submission out for external peer review. However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire. Please re-submit your manuscript within two working days, ie. by Jun 05 2019 11:59PM. Login to Editorial Manager here: https://www.editorialmanager.com/pbiology During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit. Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review. Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission. Kind regards, Lauren A Richardson, Ph.D Senior Editor PLOS Biology 25 Jun 2019 Dear Dr Gerl, Thank you very much for submitting your manuscript "Machine learning of human plasma lipidomes for obesity estimation in a large population cohort" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by several independent reviewers. As you will read, the reviewers find many aspects of your work well done. However, they also raise some key questions that will need to be rigorously addressed in a revision. Of particular note, two of the reviewers question how useful this model is and how it will benefit other clinicians and researchers. We will need to be convinced by your justification to pursue this manuscript further for publication. In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome resubmission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers. Your revisions should address the specific points made by each reviewer. Please submit a file detailing your responses to the editorial requests and a point-by-point response to all of the reviewers' comments that indicates the changes you have made to the manuscript. In addition to a clean copy of the manuscript, please upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type. You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Before you revise your manuscript, please review the following PLOS policy and formatting requirements checklist PDF: http://journals.plos.org/plosbiology/s/file?id=9411/plos-biology-formatting-checklist.pdf. It is helpful if you format your revision according to our requirements - should your paper subsequently be accepted, this will save time at the acceptance stage. Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Upon resubmission, the editors will assess your revision and if the editors and Academic Editor feel that the revised manuscript remains appropriate for the journal, we will send the manuscript for re-review. We aim to consult the same Academic Editor and reviewers for revised manuscripts but may consult others if needed. We expect to receive your revised manuscript within two months. Please email us (plosbiology@plos.org) to discuss this if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not wish to submit a revision and instead wish to pursue publication elsewhere, so that we may end consideration of the manuscript at PLOS Biology. When you are ready to submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record. Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Lauren A Richardson, Ph.D Senior Editor PLOS Biology ***************************************************** Reviews Reviewer #1: Jens Nielsen, signed review This is a very interesting paper describing the use of shut-gun lipidomics for profiling of obese subjects. Lipidomics data are used together with traditional measurements such as BMI, WC and WHR to identify a subset of lipid measurements that can be used to predict phenotypes. The authors use non-linear regression analysis (machine learning) for this analysis, and the derived model is shown to have good predictive strength. I think there are particular two interesting findings from the study: 1) that a small set of biomarkers are not sufficient to predict and capture the complex phenotypes in obese subjects, and 2) through a validation cohort the it is found that the exact timing of fasting is not influencing the set of biomarkers. I only have one major comment and one minor comment. Major comment: There is no discussion of the possible molecular mechanisms associated with the biomarkers. The authors could look into whether these are mainly diet associated or are they associated with certain features, e.g. inflammation. I know it will be hard to have a detailed mechanistic discussion/explanation, but some insight along this line would significantly enrich the paper. Minor comment: It is stated that the 45 lipid species in the reduced model are essentially the same as the 58 lipid species in the better model. I suggest the authors quantify this statement instead of just saying essentially. Why not give the exact overlap in the lipid species in the two models. This could also be used in the discussion I am requesting on mechanisms, as overlapping lipid species are likely associated with some sort of mechanisms. --------------- Reviewer #2: The authors used two sets of lipidomics data ("1061 participants of the FINRISK 2012 population cohort" and "250 randomly chosen participants of the Malmö Diet and Cancer Cardiovascular Cohort") to build and test the regression model for predicting body mass index (BMI), waist circumference (WC), waist-hip ratio (WHR) and body fat percentage (BFP) with consideration of gender and age. In the conclusion, the authors conclude that “we can use machine learning to model and validate obesity estimates better than by using classical clinical parameters and find lipid specific differences between the individual estimates.” My major concern is that as the conventional standard of obesity such as the "body mass index (BMI), waist circumference (WC), waist-hip ratio (WHR) and body fat percentage (BFP)" are easier to obtain and measure, whether the authors can demonstrate additional benefits of this model besides finding small lipid molecules associated with the obesity indicators. Other comments: 1: Page 7 “Five different models predicting BFP were trained and their parameters learned on 796 random training samples in a cross-validation loop (Fig. 1B, Results for WHR and BMI in Table S2).” There are 6 methods in Figure 1B, not 5. 2: Figure 1C. When comparing the two groups, please perform statistical test(s). 3: Figure 2B. As age and gender have strong weight in the model is very obvious, you can build a model of age and gender separately to see how important it is. 4: Figure S3. The meaning of the X axis is not clear, which side represents Male/Female? 5: Figure S8B, S9B, S10B. How to choose the restricted area? What is the percentage of samples selected as a whole sample? And the restricted area is close to the middle, but a large number of outliers are not in this range. 6: Table S2. Please add the test set and validation set results. --------------- Reviewer #3: Gerl et al employed a novel mass spectrometric shotgun approach and measured the levels of 183 plasma lipid species in participants of FINRISK 2012 population cohort comprising 1061 plasma samples. Based on this data, authors performed advanced machine learning using different predictive models and identified the association of lipid profile and information about body fat amount and distribution. The conclusion was further validated using an independent dataset of the Malmo Diet and Cancer Cardiovascular Cohort comprising randomly selected 250 plasma lipidomes. It is an interesting study but there are some concerns. 1. Authors need to better explain the merit of this study. Are lipid predictors predict development of obesity at the moment of in future? The study would be of great value if lipid predictors were measured before individuals developed obesity yet, but it would be of little value if lipid predictors simply separate obese and non-obese individuals at the moment, which wouldn’t need complex lipid measurement and modelling. 2. Line 109, “Each batch was accompanied by a set of 4 blank samples (150 mM ammonium bicarbonate (in water)”. It is not clear why 150 mM ammonium bicarbonate is chosen as blank samples instead of reconstitution solvent “7.5 mM ammonium acetate in chloroform/methanol/propanol”. 3. In method section, line 118 the authors described “Internal standards were pre-mixed with the organic solvent mixture. “ More details are needed to assess the robustness of this method. For instance, are internal standards pre-mixed with the organic solvent mixture right before each batch? Organic solvent used for lipid extraction is very volatile so it is challenging to have internal standards in such solvent for long term with consistent concentrations. In addition, what is the volume of organic solvent mixture used in this study? It is generally challenging to consistently transfer small amount of organic solvent due to the less retention on pipette tips. 4. In MS data acquisition section, “both MS and MSMS data were combined to 136 monitor CE, DAG and TAG ions as ammonium adducts”. The authors bypassed LC chromatography. It is unclear how authors dealt with in-source fragmentation issues. For instance, TAG would contribute to DAG signals through in-source fragmentation. PC would contribute to lyso PC signals due to in-source fragmentation. Without LC separation, it is hard to tell how authors distinguish these species. 5. Was relative intensity or absolute concentration of lipids used for modelling? 25 Jul 2019 Submitted filename: Response_to_Reviewers.pdf Click here for additional data file. 30 Jul 2019 Dear Dr Gerl, Thank you for submitting your revised Research Article entitled "Machine learning of human plasma lipidomes for obesity estimation in a large population cohort" for publication in PLOS Biology. The Academic Editor and I have now assessed your revised manuscript and we're delighted to let you know that we're now editorially satisfied with your manuscript. We will publish your study, assuming you are willing it modify it to meet our production requirements. Congratulations! Before we can formally accept your paper and consider it "in press", we need to ensure that your article conforms to our guidelines. A member of our team will be in touch shortly with a set of requests. As we can't proceed until these requirements are met, your swift response will help prevent delays to publication. Upon acceptance of your article, your final files will be copyedited and typeset into the final PDF. While you will have an opportunity to review these files as proofs, PLOS will only permit corrections to spelling or significant scientific errors. Therefore, please take this final revision time to assess and make any remaining major changes to your manuscript. Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article. To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include a cover letter, a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable), and a track-changes file indicating any changes that you have made to the manuscript. Please do not hesitate to contact me should you have any questions. Sincerely, Lauren A Richardson, Ph.D Senior Editor PLOS Biology ------------------------------------------------------------------------ ETHICS STATEMENT: The Ethics Statements in the submission form and Methods section of your manuscript should match verbatim. Please ensure that any changes are made to both versions. -- Please provide the information for the approval of both the FINRISK and MDC-CC studies, including information about the form of consent (written/oral) given for research involving human participants. All research involving human participants must have been approved by the authors' Institutional Review Board (IRB) or an equivalent committee, and all clinical investigation must have been conducted according to the principles expressed in the Declaration of Helsinki. ------------------------------------------------------------------------ DATA POLICY: Please ensure that your Data Statement in the submission system accurately describes where your data can be found. For the MDC-CC application information, please provide an institutional email address (rather than a single person) so that the data can still be accessed even if that person departs the organization. 4 Sep 2019 Dear Dr Gerl, On behalf of my colleagues and the Academic Editor, Jason W. Locasale, I am pleased to inform you that we will be delighted to publish your Research Article in PLOS Biology. The files will now enter our production system. You will receive a copyedited version of the manuscript, along with your figures for a final review. You will be given two business days to review and approve the copyedit. Then, within a week, you will receive a PDF proof of your typeset article. You will have two days to review the PDF and make any final corrections. If there is a chance that you'll be unavailable during the copy editing/proof review period, please provide us with contact details of one of the other authors whom you nominate to handle these stages on your behalf. This will ensure that any requested corrections reach the production department in time for publication. Early Version The version of your manuscript submitted at the copyedit stage will be posted online ahead of the final proof version, unless you have already opted out of the process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. PRESS We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf. We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/. Thank you again for submitting your manuscript to PLOS Biology and for your support of Open Access publishing. Please do not hesitate to contact me if I can provide any assistance during the production process. Kind regards, Sofia Vickers Senior Publications Assistant PLOS Biology On behalf of, Lauren Richardson, Senior Editor PLOS Biology

58 in total

Review 1. Could Ceramides Become the New Cholesterol?

Authors: Scott A Summers
Journal: Cell Metab Date: 2018-01-04 Impact factor: 27.287

2. Excess deaths associated with underweight, overweight, and obesity.

Authors: Katherine M Flegal; Barry I Graubard; David F Williamson; Mitchell H Gail
Journal: JAMA Date: 2005-04-20 Impact factor: 56.272

3. Associations Among Fatty Acids, Desaturase and Elongase, and Insulin Resistance in Children.

Authors: Lori M Beccarelli; Rachel Erin Scherr; John W Newman; Alison G Borkowska; Ira J Gray; Jessica D Linnell; Carl L Keen; Heather M Young
Journal: J Am Coll Nutr Date: 2017-10-18 Impact factor: 3.169

4. Fatty acid composition and estimated desaturase activities are associated with obesity and lifestyle variables in men and women.

Authors: Eva Warensjö; Margareta Ohrvall; Bengt Vessby
Journal: Nutr Metab Cardiovasc Dis Date: 2005-10-17 Impact factor: 4.222

Review 5. Cardiovascular and Metabolic Heterogeneity of Obesity: Clinical Challenges and Implications for Management.

Authors: Ian J Neeland; Paul Poirier; Jean-Pierre Després
Journal: Circulation Date: 2018-03-27 Impact factor: 29.690

Review 6. Interplay between lipids and branched-chain amino acids in development of insulin resistance.

Authors: Christopher B Newgard
Journal: Cell Metab Date: 2012-05-02 Impact factor: 27.287

7. Harmonizing lipidomics: NIST interlaboratory comparison exercise for lipidomics using SRM 1950-Metabolites in Frozen Human Plasma.

Authors: John A Bowden; Alan Heckert; Candice Z Ulmer; Christina M Jones; Jeremy P Koelmel; Laila Abdullah; Linda Ahonen; Yazen Alnouti; Aaron M Armando; John M Asara; Takeshi Bamba; John R Barr; Jonas Bergquist; Christoph H Borchers; Joost Brandsma; Susanne B Breitkopf; Tomas Cajka; Amaury Cazenave-Gassiot; Antonio Checa; Michelle A Cinel; Romain A Colas; Serge Cremers; Edward A Dennis; James E Evans; Alexander Fauland; Oliver Fiehn; Michael S Gardner; Timothy J Garrett; Katherine H Gotlinger; Jun Han; Yingying Huang; Aveline Huipeng Neo; Tuulia Hyötyläinen; Yoshihiro Izumi; Hongfeng Jiang; Houli Jiang; Jiang Jiang; Maureen Kachman; Reiko Kiyonami; Kristaps Klavins; Christian Klose; Harald C Köfeler; Johan Kolmert; Therese Koal; Grielof Koster; Zsuzsanna Kuklenyik; Irwin J Kurland; Michael Leadley; Karen Lin; Krishna Rao Maddipati; Danielle McDougall; Peter J Meikle; Natalie A Mellett; Cian Monnin; M Arthur Moseley; Renu Nandakumar; Matej Oresic; Rainey Patterson; David Peake; Jason S Pierce; Martin Post; Anthony D Postle; Rebecca Pugh; Yunping Qiu; Oswald Quehenberger; Parsram Ramrup; Jon Rees; Barbara Rembiesa; Denis Reynaud; Mary R Roth; Susanne Sales; Kai Schuhmann; Michal Laniado Schwartzman; Charles N Serhan; Andrej Shevchenko; Stephen E Somerville; Lisa St John-Williams; Michal A Surma; Hiroaki Takeda; Rhishikesh Thakare; J Will Thompson; Federico Torta; Alexander Triebl; Martin Trötzmüller; S J Kumari Ubhayasekera; Dajana Vuckovic; Jacquelyn M Weir; Ruth Welti; Markus R Wenk; Craig E Wheelock; Libin Yao; Min Yuan; Xueqing Heather Zhao; Senlin Zhou
Journal: J Lipid Res Date: 2017-10-06 Impact factor: 5.922

8. Deletion of ELOVL6 blocks the synthesis of oleic acid but does not prevent the development of fatty liver or insulin resistance.

Authors: Young-Ah Moon; Courtney R Ochoa; Matthew A Mitsche; Robert E Hammer; Jay D Horton
Journal: J Lipid Res Date: 2014-10-03 Impact factor: 5.922

9. New loci for body fat percentage reveal link between adiposity and cardiometabolic disease risk.

Authors: Yingchang Lu; Felix R Day; Stefan Gustafsson; Martin L Buchkovich; Jianbo Na; Veronique Bataille; Diana L Cousminer; Zari Dastani; Alexander W Drong; Tõnu Esko; David M Evans; Mario Falchi; Mary F Feitosa; Teresa Ferreira; Åsa K Hedman; Robin Haring; Pirro G Hysi; Mark M Iles; Anne E Justice; Stavroula Kanoni; Vasiliki Lagou; Rui Li; Xin Li; Adam Locke; Chen Lu; Reedik Mägi; John R B Perry; Tune H Pers; Qibin Qi; Marianna Sanna; Ellen M Schmidt; William R Scott; Dmitry Shungin; Alexander Teumer; Anna A E Vinkhuyzen; Ryan W Walker; Harm-Jan Westra; Mingfeng Zhang; Weihua Zhang; Jing Hua Zhao; Zhihong Zhu; Uzma Afzal; Tarunveer Singh Ahluwalia; Stephan J L Bakker; Claire Bellis; Amélie Bonnefond; Katja Borodulin; Aron S Buchman; Tommy Cederholm; Audrey C Choh; Hyung Jin Choi; Joanne E Curran; Lisette C P G M de Groot; Philip L De Jager; Rosalie A M Dhonukshe-Rutten; Anke W Enneman; Elodie Eury; Daniel S Evans; Tom Forsen; Nele Friedrich; Frédéric Fumeron; Melissa E Garcia; Simone Gärtner; Bok-Ghee Han; Aki S Havulinna; Caroline Hayward; Dena Hernandez; Hans Hillege; Till Ittermann; Jack W Kent; Ivana Kolcic; Tiina Laatikainen; Jari Lahti; Irene Mateo Leach; Christine G Lee; Jong-Young Lee; Tian Liu; Youfang Liu; Stéphane Lobbens; Marie Loh; Leo-Pekka Lyytikäinen; Carolina Medina-Gomez; Karl Michaëlsson; Mike A Nalls; Carrie M Nielson; Laticia Oozageer; Laura Pascoe; Lavinia Paternoster; Ozren Polašek; Samuli Ripatti; Mark A Sarzynski; Chan Soo Shin; Nina Smolej Narančić; Dominik Spira; Priya Srikanth; Elisabeth Steinhagen-Thiessen; Yun Ju Sung; Karin M A Swart; Leena Taittonen; Toshiko Tanaka; Emmi Tikkanen; Nathalie van der Velde; Natasja M van Schoor; Niek Verweij; Alan F Wright; Lei Yu; Joseph M Zmuda; Niina Eklund; Terrence Forrester; Niels Grarup; Anne U Jackson; Kati Kristiansson; Teemu Kuulasmaa; Johanna Kuusisto; Peter Lichtner; Jian'an Luan; Anubha Mahajan; Satu Männistö; Cameron D Palmer; Janina S Ried; Robert A Scott; Alena Stancáková; Peter J Wagner; Ayse Demirkan; Angela Döring; Vilmundur Gudnason; Douglas P Kiel; Brigitte Kühnel; Massimo Mangino; Barbara Mcknight; Cristina Menni; Jeffrey R O'Connell; Ben A Oostra; Alan R Shuldiner; Kijoung Song; Liesbeth Vandenput; Cornelia M van Duijn; Peter Vollenweider; Charles C White; Michael Boehnke; Yvonne Boettcher; Richard S Cooper; Nita G Forouhi; Christian Gieger; Harald Grallert; Aroon Hingorani; Torben Jørgensen; Pekka Jousilahti; Mika Kivimaki; Meena Kumari; Markku Laakso; Claudia Langenberg; Allan Linneberg; Amy Luke; Colin A Mckenzie; Aarno Palotie; Oluf Pedersen; Annette Peters; Konstantin Strauch; Bamidele O Tayo; Nicholas J Wareham; David A Bennett; Lars Bertram; John Blangero; Matthias Blüher; Claude Bouchard; Harry Campbell; Nam H Cho; Steven R Cummings; Stefan A Czerwinski; Ilja Demuth; Rahel Eckardt; Johan G Eriksson; Luigi Ferrucci; Oscar H Franco; Philippe Froguel; Ron T Gansevoort; Torben Hansen; Tamara B Harris; Nicholas Hastie; Markku Heliövaara; Albert Hofman; Joanne M Jordan; Antti Jula; Mika Kähönen; Eero Kajantie; Paul B Knekt; Seppo Koskinen; Peter Kovacs; Terho Lehtimäki; Lars Lind; Yongmei Liu; Eric S Orwoll; Clive Osmond; Markus Perola; Louis Pérusse; Olli T Raitakari; Tuomo Rankinen; D C Rao; Treva K Rice; Fernando Rivadeneira; Igor Rudan; Veikko Salomaa; Thorkild I A Sørensen; Michael Stumvoll; Anke Tönjes; Bradford Towne; Gregory J Tranah; Angelo Tremblay; André G Uitterlinden; Pim van der Harst; Erkki Vartiainen; Jorma S Viikari; Veronique Vitart; Marie-Claude Vohl; Henry Völzke; Mark Walker; Henri Wallaschofski; Sarah Wild; James F Wilson; Loïc Yengo; D Timothy Bishop; Ingrid B Borecki; John C Chambers; L Adrienne Cupples; Abbas Dehghan; Panos Deloukas; Ghazaleh Fatemifar; Caroline Fox; Terrence S Furey; Lude Franke; Jiali Han; David J Hunter; Juha Karjalainen; Fredrik Karpe; Robert C Kaplan; Jaspal S Kooner; Mark I McCarthy; Joanne M Murabito; Andrew P Morris; Julia A N Bishop; Kari E North; Claes Ohlsson; Ken K Ong; Inga Prokopenko; J Brent Richards; Eric E Schadt; Tim D Spector; Elisabeth Widén; Cristen J Willer; Jian Yang; Erik Ingelsson; Karen L Mohlke; Joel N Hirschhorn; John Andrew Pospisilik; M Carola Zillikens; Cecilia Lindgren; Tuomas Oskari Kilpeläinen; Ruth J F Loos
Journal: Nat Commun Date: 2016-02-01 Impact factor: 14.919

10. Fatty acid biomarkers of dairy fat consumption and incidence of type 2 diabetes: A pooled analysis of prospective cohort studies.

Authors: Fumiaki Imamura; Amanda Fretts; Matti Marklund; Andres V Ardisson Korat; Wei-Sin Yang; Maria Lankinen; Waqas Qureshi; Catherine Helmer; Tzu-An Chen; Kerry Wong; Julie K Bassett; Rachel Murphy; Nathan Tintle; Chaoyu Ian Yu; Ingeborg A Brouwer; Kuo-Liong Chien; Alexis C Frazier-Wood; Liana C Del Gobbo; Luc Djoussé; Johanna M Geleijnse; Graham G Giles; Janette de Goede; Vilmundur Gudnason; William S Harris; Allison Hodge; Frank Hu; Albert Koulman; Markku Laakso; Lars Lind; Hung-Ju Lin; Barbara McKnight; Kalina Rajaobelina; Ulf Risérus; Jennifer G Robinson; Cécilia Samieri; David S Siscovick; Sabita S Soedamah-Muthu; Nona Sotoodehnia; Qi Sun; Michael Y Tsai; Matti Uusitupa; Lynne E Wagenknecht; Nick J Wareham; Jason Hy Wu; Renata Micha; Nita G Forouhi; Rozenn N Lemaitre; Dariush Mozaffarian
Journal: PLoS Med Date: 2018-10-10 Impact factor: 11.069

16 in total

1. Metabolic View on Human Healthspan: A Lipidome-Wide Association Study.

Authors: Justin Carrard; Hector Gallart-Ayala; Denis Infanger; Tony Teav; Jonathan Wagner; Raphael Knaier; Flora Colledge; Lukas Streese; Karsten Königstein; Timo Hinrichs; Henner Hanssen; Julijana Ivanisevic; Arno Schmidt-Trucksäss
Journal: Metabolites Date: 2021-04-30

Review 2. Identifying Key Determinants of Childhood Obesity: A Narrative Review of Machine Learning Studies.

Authors: Madison N LeCroy; Ryung S Kim; June Stevens; David B Hanna; Carmen R Isasi
Journal: Child Obes Date: 2021-03-04 Impact factor: 2.867

Review 3. Toward a Standardized Strategy of Clinical Metabolomics for the Advancement of Precision Medicine.

Authors: Nguyen Phuoc Long; Tran Diem Nghi; Yun Pyo Kang; Nguyen Hoang Anh; Hyung Min Kim; Sang Ki Park; Sung Won Kwon
Journal: Metabolites Date: 2020-01-29

4. MethylDetectR: a software for methylation-based health profiling.

Authors: Robert F Hillary; Riccardo E Marioni
Journal: Wellcome Open Res Date: 2021-04-13

5. A Pilot Study for Metabolic Profiling of Obesity-Associated Microbial Gut Dysbiosis in Male Wistar Rats.

Authors: Julia Hernandez-Baixauli; Pere Puigbò; Helena Torrell; Hector Palacios-Jordan; Vicent J Ribas Ripoll; Antoni Caimari; Josep M Del Bas; Laura Baselga-Escudero; Miquel Mulero
Journal: Biomolecules Date: 2021-02-18

6. Untargeted Metabolomics Analysis of the Serum Metabolic Signature of Childhood Obesity.

Authors: Lukasz Szczerbinski; Gladys Wojciechowska; Adam Olichwier; Mark A Taylor; Urszula Puchta; Paulina Konopka; Adam Paszko; Anna Citko; Joanna Goscik; Oliver Fiehn; Sili Fan; Anna Wasilewska; Katarzyna Taranta-Janusz; Adam Kretowski
Journal: Nutrients Date: 2022-01-04 Impact factor: 5.717

7. LipidSig: a web-based tool for lipidomic data analysis.

Authors: Wen-Jen Lin; Pei-Chun Shen; Hsiu-Cheng Liu; Yi-Chun Cho; Min-Kung Hsu; I-Chen Lin; Fang-Hsin Chen; Juan-Cheng Yang; Wen-Lung Ma; Wei-Chung Cheng
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

8. Machine learning applied to serum and cerebrospinal fluid metabolomes revealed altered arginine metabolism in neonatal sepsis with meningoencephalitis.

Authors: Peng Zhang; Zhangxing Wang; Huixian Qiu; Wenhao Zhou; Mingbang Wang; Guoqiang Cheng
Journal: Comput Struct Biotechnol J Date: 2021-05-18 Impact factor: 7.271

9. Mouse lipidomics reveals inherent flexibility of a mammalian lipidome.

Authors: Michał A Surma; Mathias J Gerl; Ronny Herzog; Jussi Helppi; Kai Simons; Christian Klose
Journal: Sci Rep Date: 2021-09-29 Impact factor: 4.379

10. Plasma lipidomics of monozygotic twins discordant for multiple sclerosis.

Authors: Horst Penkert; Chris Lauber; Mathias J Gerl; Christian Klose; Markus Damm; Dirk Fitzner; Andrea Flierl-Hecht; Tania Kümpfel; Martin Kerschensteiner; Reinhard Hohlfeld; Lisa A Gerdes; Mikael Simons
Journal: Ann Clin Transl Neurol Date: 2020-11-07 Impact factor: 5.430