| Literature DB >> 33208769 |
Rikke Linnemann Nielsen1,2, Marianne Helenius1, Sara L Garcia1, Henrik M Roager3,4, Derya Aytan-Aktug1,4, Lea Benedicte Skov Hansen1, Mads Vendelbo Lind3, Josef K Vogt5, Marlene Danner Dalgaard1, Martin I Bahl4, Cecilia Bang Jensen1, Rasa Muktupavela1, Christina Warinner6, Vincent Aaskov5, Rikke Gøbel5, Mette Kristensen3, Hanne Frøkiær7, Morten H Sparholt8, Anders F Christensen8, Henrik Vestergaard5,9, Torben Hansen5, Karsten Kristiansen10, Susanne Brix11, Thomas Nordahl Petersen4, Lotte Lauritzen12, Tine Rask Licht13, Oluf Pedersen14, Ramneek Gupta15,16.
Abstract
Diet is an important component in weight management strategies, but heterogeneous responses to the same diet make it difficult to foresee individual weight-loss outcomes. Omics-based technologies now allow for analysis of multiple factors for weight loss prediction at the individual level. Here, we classify weight loss responders (N = 106) and non-responders (N = 97) of overweight non-diabetic middle-aged Danes to two earlier reported dietary trials over 8 weeks. Random forest models integrated gut microbiome, host genetics, urine metabolome, measures of physiology and anthropometrics measured prior to any dietary intervention to identify individual predisposing features of weight loss in combination with diet. The most predictive models for weight loss included features of diet, gut bacterial species and urine metabolites (ROC-AUC: 0.84-0.88) compared to a diet-only model (ROC-AUC: 0.62). A model ensemble integrating multi-omics identified 64% of the non-responders with 80% confidence. Such models will be useful to assist in selecting appropriate weight management strategies, as individual predisposition to diet response varies.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33208769 PMCID: PMC7674420 DOI: 10.1038/s41598-020-76097-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Study design including data availability, feature development and selection, best features selected for model and clinical prognosis of weight loss. Participants achieving any weight loss during 8 weeks dietary intervention were considered weight loss responders. Different combinations of features were selected for modelling e.g. included the faecal stool samples by butyrate-producing species from MGmapper catalog Bacteria draft and by forward selected 16S taxonomies selected from a pool of the top 250 most varying. These were combined with forward selected urine metabolites identified by LC–MS. Only measurements from the beginning of the intervention periods were used as features for development of predictive weight loss models.
Figure 2Weight changes in the two dietary intervention arms during 8 weeks (a) Distribution of percentage changes in body weight for the whole grain study. (b) Distribution of percentage changes in body weight for the gluten study. The coloured lines denote mean and standard deviations for the diet groups (green = whole grain-rich diet, orange = refined grain diet, blue = low-gluten diet).
Overview of datasets, number of features and feature selection for random forest models.
| Data type | Data label | Number of features before filtering | Number of features after prior knowledge filtering | Was data-driven feature selection applied (Y/N) |
|---|---|---|---|---|
| Diet: Binary features that represent the type of diet as whole grain-rich, low-gluten or refined grain | Diet | 3 | – | N |
| Anthropometrics and physiological | ClinicalA | 28 | 8 (age, sex, BMI and blood CRP, IL-6, HbA1c, HOMA-IR and zonulin) | N |
| ClinicalB | – | Y | ||
| Whole grain and gluten intake | ContinuousIntake | 2 | – | N |
| Gastrointestinal transit time | TransitTime | 1 | – | N |
| Self-reported | VAS | 16 | – | N |
| Postprandial response | PostPran | 5 | – | N |
| 272,588 SNPs (after QC) | ||||
| Literature pathways | LitPath | 703 SNPs | Y | |
| LD pruned literature pathways | LitPathLD | 56 SNPs | Y | |
| Genetic risk scores | GRS | 5 GRS’s of 32 SNPs | N | |
| 16S (taxonomies) | 10,093 | |||
| Top 10 most variating | 16S_A | 10 | N | |
| Top 250 most variating | 16S_B | 250 | Y | |
| Prevalence | 16S_C | 3321 | Y | |
| Bacteria catalogue | MGm_A | 464 | 9 | N |
| Bacteria draft catalogue | MGm_B | 1318 | – | Y |
| MGm_B1 | 11 | N | ||
| Human microbiome catalogue | MGm_C | 444 | 10 | N |
| Butyrate-producing species from MGmapper catalogues | MGm | – | 30 | N |
| Metagenomic species | MGS | 1264 | – | Y |
| Top 14 from whole grain and gluten studies | topMGS | 28 | N | |
| GC–MS | GC–MS | 85 | – | Y |
| LC–MS | LC–MS | 1285 | – | Y |
Model performances for models run on a set of 130 individuals with complete data in all below data combinations.
This is reported as mean of five cross-validations repeated 50 times with random shuffles of the cross-validation splits. The blue-red colorbar is for area under the receiver operating characteristic curve (ROC-AUC), sensitivity and specificity, while the blue-yellow–red colorbar is for Matthews correlation coefficient (MCC). Diet represents the dataset consisting of the three features indicating which diet was consumed. EnergyIntake is the energy intake at baseline, while ContinuousIntake is the total intake of whole grain (g/day) and gluten (mg/day) at baseline. ClinicalA and B are both feature subsets selected by prior knowledge and forward selection, respectively, from the set of 28 anthropometric and physiological features. LithPathLD and GRS are subsets of genetic variants selected by prior knowledge, where LithPathLD also was subject to forward selection. 16S_B is the set of forward selected 16S-based OTUs selected from a pool of the top 250 most varying features. MGm_B and MGm_B1 are subsets of species mapped by MGmapper to the Bacteria draft catalogue, which are selected by forward selection and prior knowledge as butyrate-producing species, respectively. LC–MS[45-lcPos_142-lcPos] holds a pair of urine metabolites identified by LC–MS. PostPranFluc3_50 is the post prandial response features free fatty acids, GLP-2, glucose and insulin, which are represented by the third image analysis method with a grid-size of 50 × 50 (see “Methods”). Abbreviations for model combinations are explained in Table 1 and in the main text. Performances of all models run on the 130 individuals are in Supplementary Material 3.
Model performances for models run on all individuals available for a given data combination.
This is reported as mean of five cross-validations repeated 50 times with random shuffles of the cross-validation splits. Models in bold were included in an ensemble (ROC-AUC > 0.62). The blue-red colorbar is for area under the receiver operating characteristic curve (ROC-AUC), sensitivity and specificity, while the blue-yellow–red colorbar is for Matthews correlation coefficient (MCC). Diet represents the dataset consisting of the three features indicating which diet was consumed. EnergyIntake is the energy intake at baseline, while ContinuousIntake is the total intake of whole grain (g/day) and gluten (mg/day) at baseline. VAS represents the self-reported features measured by Visual Analogue Scale. ClinicalA and ClinicalB are both feature subsets selected by prior knowledge and forward selection, respectively, from the set of 28 anthropometric and physiological features. TransitTime is the baseline transit time. LithPathLD and GRS are subsets of genetic variants selected by prior knowledge, where LithPathLD also was subject to forward selection. 16S_B is the set of forward selected 16S-based OTUs selected from a pool of the top 250 most varying features. MGm_B and MGm_B1 are subsets of species mapped by MGmapper to the Bacteria draft catalogue, which are selected by forward selection and prior knowledge as butyrate-producing species, respectively. topMGS is the top 28 selected MGSs from the whole grain and gluten studies. LC–MS[45-lcPos_142-lcPos] holds a pair of urine metabolites identified by LC–MS. PostPranFluc3_50 is the post prandial response features free fatty acids, GLP-2, glucose and insulin, which are represented by the third image analysis method with a grid-size of 50 × 50 (see “Methods”). Abbreviations for model combinations is explained in Table 1 and in main text. Performances of all models run are in Supplementary Material 5.
Figure 3Feature importance for the models. (a) Models have data combinations of the type of diet, forward selected 16S-based OTUs from a pool of the top 250 most varying (left, Diet.16S_B.LC–MS) or butyrate-producing species (right, Diet.MGm_B1.LC–MS) and forward selected urine metabolites identified by LC–MS for features selected minimum 15% across all trained models. The columns represent the two data combinations, and the rows represent the runs on 130 common individuals with complete data across 18 out of 22 datasets (Only individuals with complete data) as well as runs on all individuals available for the data combination (All available individuals). The red line marks features of highest importance given the relative Gini coefficient. (b,c) Illustrations of the random forest models trained and tested on diet, forward selected 16S-based OTUs from a pool of the top 250 most varying (b, Diet.16S_B.LC–MS) or butyrate-producing species (c, Diet.MGm_B1.LC–MS) and forward selected urine metabolites identified by LC–MS with all available individuals. The labels on the tree leaves represent per leaf the gini impurity, the number of unique individuals, the distribution of classes for the bootstrap sample and the class which holds majority. The colours denote the class which holds majority as well as magnitude of majority by more saturation, where orange colour is the non-responders (class 0) and blue is the responders (class 1).
Figure 4Ensemble of weight loss models. (a) Performances based on four scoring schemes and different classification thresholds for predictive models included in different personalised ensemble models. The confidence column shows the applied prediction score thresholds, where s is the prediction score. The first four rows are the ensemble presented consisting of models in bold in Table 3, where an extension of this to other confidence thresholds is found in Supplementary Material 7, Table S.6. The ensemble in bold is represented in (b,c). Without microbiome is an ensemble of same models as in bold in Table 3 but excluding all models that contain microbiome data. Prior knowledge only is an ensemble of models that only include features selected by prior knowledge feature selection approaches. (b) The prediction scores across responders or non-responders with colors representing the type of dietary intervention. The scores shown are from ensemble scoring method mean of confident scores (s ≤ 0.25 or s ≥ 0.75). (c) The sensitivity, positive predictive value (PPV), specificity and negative predictive value (NPV) are calculated at different score thresholds to separate the classes for the ensemble model shown in (b). MCC: Matthews correlation coefficient.