| Literature DB >> 31592078 |
Richard D Khusial1, Catherine E Cioffi2, Shelley A Caltharp3,4, Alyssa M Krasinskas4, Adina Alazraki3,5, Jack Knight-Scott3, Rebecca Cleeton6, Eduardo Castillo-Leon6, Dean P Jones7, Bridget Pierpont8, Sonia Caprio8, Nicola Santoro8, Ayman Akil1, Miriam B Vos2,3,6.
Abstract
Nonalcoholic fatty liver disease (NAFLD) is the most common chronic liver disease in children, but diagnosis is challenging due to limited availability of noninvasive biomarkers. Machine learning applied to high-resolution metabolomics and clinical phenotype data offers a novel framework for developing a NAFLD screening panel in youth. Here, untargeted metabolomics by liquid chromatography-mass spectrometry was performed on plasma samples from a combined cross-sectional sample of children and adolescents ages 2-25 years old with NAFLD (n = 222) and without NAFLD (n = 337), confirmed by liver biopsy or magnetic resonance imaging. Anthropometrics, blood lipids, liver enzymes, and glucose and insulin metabolism were also assessed. A machine learning approach was applied to the metabolomics and clinical phenotype data sets, which were split into training and test sets, and included dimension reduction, feature selection, and classification model development. The selected metabolite features were the amino acids serine, leucine/isoleucine, and tryptophan; three putatively annotated compounds (dihydrothymine and two phospholipids); and two unknowns. The selected clinical phenotype variables were waist circumference, whole-body insulin sensitivity index (WBISI) based on the oral glucose tolerance test, and blood triglycerides. The highest performing classification model was random forest, which had an area under the receiver operating characteristic curve (AUROC) of 0.94, sensitivity of 73%, and specificity of 97% for detecting NAFLD cases. A second classification model was developed using the homeostasis model assessment of insulin resistance substituted for the WBISI. Similarly, the highest performing classification model was random forest, which had an AUROC of 0.92, sensitivity of 73%, and specificity of 94%.Entities:
Year: 2019 PMID: 31592078 PMCID: PMC6771165 DOI: 10.1002/hep4.1417
Source DB: PubMed Journal: Hepatol Commun ISSN: 2471-254X
Sociodemographic and Health Characteristics of the Sample Stratified by NAFLD Status
| Non‐NAFLD (n = 337) | NAFLD (n = 222) |
| |||
|---|---|---|---|---|---|
| N Obs | Estimate | N (%) | Estimate | ||
| Age, mean (SD) | 337 | 13.7 (4.0) | 222 | 13.6 (2.9) | 0.687 |
| Sex, count (%) | |||||
| Male |
|
|
|
|
|
| Race/ethnicity, count (%) | |||||
| Non‐Hispanic black |
|
|
|
|
|
| Non‐Hispanic white | 128 (37.8) | 70 (31.5) | 0.16 | ||
| Hispanic |
|
|
| ||
| Asian/other | 10 (2.9) | 4 (1.8) | 0.75 | ||
| Weight status, count (%) | |||||
| Normal/underweight |
|
|
|
|
|
| Overweight |
|
|
| ||
| Obese |
|
|
| ||
| Waist circumference, mean (SD) |
|
|
|
|
|
| Hip circumference, mean (SD) | 236 | 104.9 (16.46) | 140 | 107.0 (16.93) | 0.24 |
| Fasting glucose (mg/dL), mean (SD) | 328 | 87.9 (12.7) | 220 | 88.3 (12.4) | 0.69 |
| 2‐Hour glucose (mg/dL), mean (SD) |
|
|
|
|
|
| Fasting insulin (mg//dL), mean (SD) |
|
|
|
|
|
| WBISI, mean (SD) |
|
|
|
|
|
| Insulinogenic index, mean (SD) | 218 | 4.7 (5.47) | 112 | 5.6 (4.58) | 0.15 |
| Disposition index, mean (SD) | 217 | 9.3 (16.64) | 112 | 7.1 (6.10) | 0.19 |
| HOMA‐IR, mean (SD) |
|
|
|
|
|
| HBA1C (%), mean (SD) | 204 | 5.5 (0.28) | 97 | 5.5 (0.34) | 0.07 |
| Diabetes diagnoses, count (%) | |||||
| Type 2 diabetes | 328 | 8 (2.7) | 220 | 4 (2.3) | 0.99 |
| Impaired glucose tolerance | 233 | 40 (17.2) | 115 | 26 (22.6) | 0.28 |
| Impaired fasting glucose | 328 | 26 (10.2) | 220 | 24 (14.0) | 0.69 |
| ALT (U/L), median (IQR) |
|
|
|
|
|
| AST (U/L), median (IQR) |
|
|
|
|
|
| Total‐C (mg/dL), mean (SD) | 304 | 158.4 (41.2) | 203 | 163.8 (37.9) | 0.13 |
| Triglycerides (mg/dL), mean (SD) |
|
|
|
|
|
| HDL‐C (mg/dL), mean (SD) |
|
|
|
|
|
| LDL‐C (mg/dL), mean (SD) | 304 | 93.8 (38.0) | 203 | 96.7 (32.3) | 0.34 |
P values calculated using Student t tests or chi‐square tests. For ALT and AST, nonparametric Mann‐Whitney U test was performed due to non‐normality of the variables. Bold values indicate clinically important differences between groups.
Abbreviations: C, cholesterol, HBA1C, hemoglobin A1C; IQR, interquartile range.
Figure 1Workflow of the machine learning‐based approach used for developing and testing the NAFLD screening panels. The proposed workflow consists of a dimension reduction technique, followed by a two‐step machine learning approach. After a potential subset of features is obtained, algorithm mapping with three classifiers is explored to identify an optimal model.
Annotated Identities for the 11 Metabolic Features From HRM Retained After Feature Selection
|
| Time (Seconds) | Compound Name | HMDB ID | Formula | MSI Level | Adduct | Difference in Expression (NAFLD vs. Control) |
|---|---|---|---|---|---|---|---|
| 106.0499 | 81.1 | Serine | HMDB00187 | C3H7NO3 | 1 | M+H | ↓ |
| 129.0658 | 61.3 | Dihydrothymine | HMDB00079 | C5H8N2O2 | 2 | M+H | ↓ |
| 132.1019 | 44.9 | Leucine/isoleucine | HMDB00172 | C6H13NO2 | 1 | M+H | ↑ |
| 133.1052 | 45.9 | Leucine/isoleucine | HMDB00172 | C6H13NO2 | 1 | (M+1)+H | ↑ |
| 196.8651 | 57 | ‐ | ‐ | ‐ | 4 | ‐ | ↓ |
| 205.0972 | 45.8 | Tryptophan | HMDB00929 | C11N12N2O2 | 1 | M+H | ↑ |
| 206.1005 | 45.7 | Tryptophan | HMDB00929 | C11N12N2O2 | 1 | (M+1)+H | ↑ |
| 434.6894 | 55.3 | ‐ | ‐ | ‐ | 4 | ‐ | ↓ |
| 510.3552 | 37.6 | LysoPE(20:0) | HMDB11481 | C25H52NO7P | 2 | M+H | ↓ |
| 544.3377 | 39 | LysoPC(18:1) | HMDB02815 | C26H52NO7P | 2 | M+Na | ↓ |
| 545.3415 | 37.9 | LysoPC(18:1) | HMDB02815 | C26H52NO7P | 2 | (M+1)+Na | ↓ |
Confidence levels assigned according to MSI criteria, whereby level 1 is identified compounds, level 2 is putatively annotated compounds, level 3 is putatively characterized compound class, and level 4 is unknown compounds.
Abbreviations: HMDB, Human Metabolome Database; MSI, Metabolomics Standard Initiative.
Evaluation Metrics for the Metabolomics‐Only and the Combined Metabolomics–Clinical Models With and Without OGTT Variables Applied to the Training and Test Sets
| Classifier | Metabolomics‐Only Model | Combined Model #1 (With OGTT Variables) | Combined Model #2 (Without OGTT Variables) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| AUROC | Sensitivity | Specificity | AUROC | Sensitivity | Specificity | AUROC | Sensitivity | Specificity | |
| Training Set | |||||||||
| Naive Bayes | 0.847 | 77% | 78% | 0.880 | 82% | 77% | 0.866 | 81% | 81% |
| Random forest | 0.846 | 60% | 84% |
|
|
|
|
|
|
| Logistic |
|
|
| 0.875 | 75% | 87% | 0.876 | 73% | 88% |
| Test Set | |||||||||
| Naive Bayes | 0.825 | 73% | 78% | 0.848 | 73% | 75% | 0.842 | 73% | 76% |
| Random forest | 0.818 | 60% | 87% |
|
|
|
|
|
|
| Logistic |
|
|
| 0.863 | 72% | 83% | 0.858 | 70% | 81% |
Bold values indicate the classifier with the highest AUROC for each model in the training or test set.
Abbreviations: AUROC, Area under the receiver operator curve; OGTT, Oral glucose tolerance test.
Figure 2AUROC curves produced from random forest, logistic regression, and Naive Bayes classifiers. (A) Metabolomics‐only model, (B) combined metabolomics–clinical model with OGTT variables, (C) combined metabolomics–clinical model without OGTT variables.