Literature DB >> 34927450

Machine Learning Approach to Classify Cardiovascular Disease in Patients With Nonalcoholic Fatty Liver Disease in the UK Biobank Cohort.

Divya Sharma¹, Neta Gotlieb², Michael E Farkouh³, Keyur Patel⁴, Wei Xu^1,5, Mamatha Bhat⁶.

Abstract

Background Nonalcoholic fatty liver disease (NAFLD) is the most prevalent liver disease worldwide. Cardiovascular disease (CVD) is the leading cause of mortality among patients with NAFLD. The aim of our study was to develop a machine learning algorithm integrating clinical, lifestyle, and genetic risk factors to identify CVD in patients with NAFLD. Methods and Results We created a cohort of patients with NAFLD from the UK Biobank, diagnosed according to proton density fat fraction from magnetic resonance imaging data sets. A total of 400 patients with NAFLD with subclinical atherosclerosis or clinical CVD, defined by disease codes, constituted cases and 446 NAFLD cases with no CVD constituted controls. We evaluated 7 different supervised machine learning approaches on clinical, lifestyle, and genetic variables for identifying CVD in patients with NAFLD. The most significant clinical and lifestyle variables observed by the predictive modeling were age (59 years [54.00-63.00 years]), hypertension (145 mm Hg [134.0-156.0 mm Hg] and 85 mm Hg [79.00-93.00 mm Hg]), waist circumference (98 cm [95.00-105.00 cm]), and sedentary lifestyle, defined as time spent watching TV >4 h/d. In the genetic data, single-nucleotide polymorphisms in IL16 and ANKLE1 gene were most significant. Our proposed ensemble-based integrative machine learning model achieved an area under the curve of 0.849 using the random forest modeling for CVD prediction. Conclusions We propose a machine learning algorithm that identifies CVD in patients with NAFLD through integration of significant clinical, lifestyle, and genetic risk factors. These patients with NAFLD at higher risk of CVD should be flagged for screening and aggressive treatment of their cardiometabolic risk factors to prevent cardiovascular morbidity and mortality.

Entities: Chemical

Keywords: cardiovascular disease; machine learning; nonalcoholic fatty liver disease

Mesh：

Substances：

Year: 2021 PMID： 34927450 PMCID： PMC9075189 DOI： 10.1161/JAHA.121.022576

Source DB: PubMed Journal: J Am Heart Assoc ISSN： 2047-9980 Impact factor: 6.106

carotid intima‐media thickness machine learning magnetic resonance imaging–derived proton density fat fraction nonalcoholic fatty liver disease random forest

Clinical Perspective

What Is New?

An integrative machine learning model can identify patients with nonalcoholic fatty liver disease (NAFLD) at high risk for developing subclinical and clinical cardiovascular complications. Components of the “metabolic syndrome,” sedentary lifestyle, and specific genetic single‐nucleotide polymorphisms are among the most significant contributors for cardiovascular disease (CVD) complications in patients with NAFLD. Best model performance is when integrating clinical, lifestyle, and genetic data, reflecting the complexity of NAFLD as a risk factor for CVD.

What Are the Clinical Implications?

A machine learning algorithm can be used in clinical practice to flag those patients with early NAFLD at high risk of CVD. CVD screening and treatment of metabolic risk factors as early as possible can potentially reduce the morbidity and mortality associated with CVD as the most common complication of NAFLD. Nonalcoholic fatty liver disease (NAFLD) has become the most prevalent liver disease worldwide, affecting ≈25% of the population globally. It has become the main cause for liver cirrhosis and hepatocellular carcinoma, and is predicted to soon become the leading indication for liver transplantation, thereby representing a significant economic burden. , , , , , Cardiovascular disease (CVD) is the most important cause of morbidity and mortality among patients with NAFLD. CVD dictates outcomes in patients with NAFLD to a greater extent than does the progression of liver disease, resulting in ≈40% to 45% of the total deaths in this population. , Furthermore, a meta‐analysis found that patients with NAFLD had a 64% increased odds ratio for CVD during a median follow‐up period of 7 years. The strong association between NAFLD and CVD is the result of shared metabolic risk factors, such as hypertension, dyslipidemia, and insulin resistance. In addition, NAFLD is an independent risk factor for CVD, suggesting that it should be considered as the hepatic component of the metabolic syndrome. , , , , Through a bidirectional relationship between NAFLD and metabolic syndrome, NAFLD accelerates the progression of subclinical atherosclerosis and promotes premature CVD events and mortality. Furthermore, NAFLD may directly contribute to atherosclerosis and CVD via hepatic secretion of proinflammatory markers, atherogenic lipoproteins, and procoagulant factors, which results in arterial wall inflammation and secondary plaque vulnerability. , Consequently, NAFLD is strongly associated with several markers of subclinical atherosclerosis, including carotid intima‐media thickening (CIMT), increased coronary artery calcification, impaired flow‐mediated vasodilation, and arterial stiffness. Indeed, several large cross‐sectional studies have shown that NAFLD is associated with clinical CVD independent of traditional risk factors and metabolic syndrome, making CVD prediction in patients with NAFLD an important research topic. , , , In the recent times, researchers have also explored the capability of machine learning (ML) algorithms to improve accuracy of cardiovascular risk prediction. , , , , The aim of our study was to develop a novel integrative ML algorithm that could classify CVD in patients with NAFLD using the richly annotated clinical, demographic, and laboratory data from the UK Biobank. Identifying those patients with NAFLD at higher risk of CVD could guide appropriate preventive and therapeutic interventions, thereby preventing the most important reason for morbidity and mortality in patients with NAFLD.

Methods

Data Availability

Publicly available data from the UK Biobank study were analyzed in this study. The data sets are available to researchers through an open application via https://www.ukbiobank.ac.uk/register‐apply/. Code for our integrative ML modeling is available at the link: https://github.com/divya031090/ML_NAFLD_CVD.

Setting

The UK Biobank is a large, prospective study of >500 000 individuals aged 40 to 69 years, recruited between 2006 and 2010. For the UK Biobank, ethical procedures are controlled by a dedicated Ethics and Guidance Council (http://www.ukbiobank.ac.uk/ethics), with institutional review board approval obtained from the North‐West Multi‐Center Research Ethics Committee. All participants provided written informed consent before enrollment in the UK Biobank. Access to the data was granted for this work under UK Biobank application number 53976. The study collected extensive phenotypic and genotypic details about its participants, including data from questionnaires, physical measures, accelerometery, multimodal imaging, genome‐wide genotyping, and longitudinal follow‐up for a wide range of health‐related outcomes. , Detailed cohort protocol, scientific rationale, and study design are available online.

Definition and Diagnosis of NAFLD

The definition of NAFLD requires evidence of hepatic steatosis by either histology or imaging, with exclusion of other causes for liver diseases. Magnetic resonance imaging (MRI) and magnetic resonance spectroscopy are now considered “gold‐standard” methods for quantitative hepatic fat measurement. MRI‐derived proton density fat fraction (MRI‐PDFF) is a method that quantifies hepatic steatosis with a high degree of accuracy and is considered a well‐validated diagnostic tool that is not significantly impacted by demographics, histologic activity, or coexisting hepatic conditions. , To create a cohort of subjects with NAFLD, we selected subjects with a PDFF >5% (MRI‐PDFF ≥5), which is the threshold for hepatic steatosis, with high sensitivity and specificity according to previous validated studies. , From the cohort of patients with MRI‐PDFF ≥5, we excluded all subjects who were diagnosed with alcoholic liver disease (defined as alcohol consumption >30 g for men and 20 g daily for women), alcoholic cirrhosis, obstruction/ascending cholangitis/sclerosing cholangitis, α‐1 antitrypsin deficiency, Wilson disease, hemochromatosis, primary biliary cholangitis, and viral hepatitis.

CVD Diagnosis

For the diagnosis of CVD, we used parameters indicating both subclinical atherosclerosis and clinical CVD. CIMT is a noninvasive measurement of the arterial wall thickness secondary to atherosclerotic plaques, using ultrasound imaging. CIMT indicates subclinical atherosclerosis and is a validated and well‐described predictive marker of major cardiovascular events. Previous studies showed that maximum CIMT >900 mm is associated with increased risk for coronary artery disease. , , , , We calculated the mean of the maximum CIMT in 4 angles: 120, 150, 210, and 240 degrees. An individual whose mean maximal CIMT was >900 mm was considered to have subclinical CVD. We further included individuals with clinical CVD characterized by ischemic heart disease (history of myocardial infarction and angina pectoris) and/or heart failure. Subjects with prior CVD, defined as self‐reported prior myocardial infarction, stroke, and transient ischemic attack, as well as family history of CVD and prior diagnoses identified using International Classification of Diseases (ICD‐10) codes were excluded from the study. Clinical CVD was defined as the subject’s first year of hospital admission attributable to CVD after recruitment or death from CVD based on ICD‐10 and I20 to I25 codes identified from linkages to the national death index and Hospital Episode Statistics. CIMT measurements were recorded for the subjects at an imaging visit in the year 2014.

Clinical Data

We analyzed clinical and demographic variables that are known as risk factors for NAFLD and CVD. , To expand the data available on metabolic risk factors, we also looked at the medication intake of the subjects (presented in Table S1). In addition, we analyzed laboratory parameters, including aspartate aminotransferase, alanine aminotransferase, γ‐glutamyl transferase, alkaline phosphatase, total bilirubin, creatinine, hemoglobin, platelet count, urate, and levels of total cholesterol, low‐density lipoproteins, triglycerides, glucose, and hemoglobin A1c. The summary of the clinically important variables for cardiovascular outcome used in the analysis is presented in Table 1. Novel risk factors for CVD, such as markers for inflammation (eg, hs‐CRP [high‐sensitivity C‐reactive protein], high‐density lipoproteins, and albumin), were also included in the analysis.

Table 1

Baseline Characteristics of Variables That Significantly Contribute to CVD

Variables	Cases (N = 400)	Controls (N = 446)	P value
Sex
Women	174 (43.5)	213 (47.7)	0.20
Men	226 (56.5)	233 (52.3)
Diabetes
Yes	132 (33)	80 (17.9)	1.83E‐06
No	268 (67)	366 (82.1)
Race
White	392 (98)	433 (97)	0.978
East Asian	2 (0.5)	4 (0.9)
Southeast Asian	3 (0.75)	6 (1.3)
Black	1 (0.25)	0 (0)
Other*	3 (0.75%)	3 (0.7%)
Age, y	59.00 (54.00–63.00)	55.00 (48.00–60.00)	1.74E‐15
Weight, kg	86.55 (75.90–95.78)	83.20 (73.30–94.00)	7.91E‐3
BMI, kg/m²	29.23 (26.80–31.94)	28.46 (25.99–31.08)	1.55E‐3
Diastolic blood pressure, mm Hg	85.00 (79.00–93.00)	82.00 (76.00–89.00)	6.92E‐7
Systolic blood pressure, mm Hg	145.0 (134.0–156.0)	135.00 (125.00–146.5)	4.22E‐16
Waist circumference, cm	98.00 (91.00–105.00)	95.00 (88.00–101.00)	1.37E‐5
Low‐density lipoprotein, mmol/L	3.53 (2.95–4.133)	3.74 (3.18–4.34)	1.63E‐4
Glucose, mmol/L	5.04 (4.63–5.61)	4.97 (4.62–5.36)	0.04
Alanine aminotransferase, U/L	27.24 (20.18–36.80)	25.60 (19.09–34.06)	0.04
Aspartate aminotransferase, U/L	26.60 (22.60–31.73)	25.60 (22.12–30.70)	0.05
Alkaline phosphatase, U/L	82.10 (68.42–97.47)	78.90 (67.70–94.67)	0.04
γ‐Glutamyl transferase, U/L	34.10 (23.82–50.08)	33.20 (22.60–51.95)	0.2
Bilirubin, µmol/L	8.29 (6.65–10.37)	8.31 (6.67–10.52)	0.90
Albumin, g/L	45.52 (43.84–47.24)	45.41 (43.70–47.15)	0.40
White blood cell count, 10⁹ cells/L	6.89 (6.00–7.98)	6.70 (5.70–7.77)	0.02
High‐density lipoprotein, mmol/L	1.25 (1.08–1.41)	1.24 (1.06–1.47)	0.8
Triglycerides, mmol/L	1.97 (1.39–2.80)	1.94 (1.40–2.71)	0.99
Creatinine, µmol/L	73.95 (64.20–83.55)	72.70 (62.98–82.30)	0.1
Moderate physical activity
Yes	137 (34.2)	147 (32.95)	0.81
No	261 (65.2)	294 (65.91)
Alcohol consumption status
Yes	373 (93.2)	405 (90.6)	0.48
No	26 (6.5)	40 (9.1)
Smoking status
Yes	190 (47.5)	174 (39)	0.02
No	207 (51.75)	265 (59.41)

The summary of the categorical variables is represented using frequency and percentage of each category of the variables, whereas continuous variables are summarized using median values and interquartile range. BMI indicates body mass index; and CVD, cardiovascular disease..

*The Race, subjects categorized as "Other" are those whose race was not categorized as either White, Mixed, Asian or Black as per the UK Biobank documentation.

Baseline Characteristics of Variables That Significantly Contribute to CVD The summary of the categorical variables is represented using frequency and percentage of each category of the variables, whereas continuous variables are summarized using median values and interquartile range. BMI indicates body mass index; and CVD, cardiovascular disease.. *The Race, subjects categorized as "Other" are those whose race was not categorized as either White, Mixed, Asian or Black as per the UK Biobank documentation.

Lifestyle Data

Lifestyle variables related to CVD that were analyzed include alcohol consumption, salt intake, and status of cigarette smoking. Individuals who consume alcohol >30 g for men and >20 g daily for women, as well as all patients with any kind of alcoholic disorder or alcoholic liver disease, defined by ICD codes, were excluded. We included in the analysis those patients who reported “yes” for alcohol consumption status. Because of a lot of missing data for other diet variables in the subjects with NAFLD, their inclusion in the analysis was not feasible. We also analyzed variables for physical activity, including time spent watching TV and time spent using a computer, which are markers of sedentary lifestyle and are considered as risk factors for CVD. Moderate physical activity was binarized into 2 categories: subjects who take part in at least 150 minutes weekly of moderate exercise compared with those who do <150 minutes weekly exercise. Time spent watching TV and using computer were categorized into 3 categories, as follows: usage <1, 1 to 4, and >4 h/d.

Genetic Data

From the cohort of 846 subjects with NAFLD, we procured genetic information for chromosome 1‐22 from the UK Biobank, for 831 samples and 363 381 single‐nucleotide polymorphisms (SNPs) after genetic quality control. On division of the cohort based on 70% training and 30% testing, we obtained 585 samples (322 men, 263 women) in the training set and 246 samples (131 men, 115 women) in the test set. We carried a genome‐wide association study using 3 principal components (1, 2, and 3), age, sex, body mass index, and systolic blood pressure as covariates and CVD status as outcome in the training data. Top 100 SNPs with smallest P values from genome‐wide association study results were selected, and these genetic data were used for our genetic domain‐based ML models. SNPs were further identified on the basis of their importance to the CVD prediction.

Proposed Framework

The novel integrative framework is described in Figure 1, and each step in the learning process is detailed in the flowchart provided in Figure 2. The proposed framework consists of 2 levels of assessment: (1) ML model assessment for each individual domain and (2) integration/ensemble of the best model from each domain into a naive Bayes classifier for final prediction of CVD outcome.

Figure 1

Multimodal integrative framework for cardiovascular disease prediction from clinical, genetic, and lifestyle data domains among subjects with nonalcoholic fatty liver disease.

Figure 2

Flowchart describing the details of each step of the integrative machine learning (ML) modeling.

GWAS indicates genome‐wide association study.

Multimodal integrative framework for cardiovascular disease prediction from clinical, genetic, and lifestyle data domains among subjects with nonalcoholic fatty liver disease.

Flowchart describing the details of each step of the integrative machine learning (ML) modeling.

GWAS indicates genome‐wide association study.

Statistical Analysis

We considered 7 algorithms covering different classes of ML modeling approaches for the first level of assessment: support vector machines, random forest (RF), neural networks, logistic regression, Lasso regression, ridge regression, and naive Bayes classification. To tune the ML models and select the models with highest accuracy, hyperparameters were determined via grid search. We trained our networks on an NVIDIA Tesla P100 GPU with 16GB of RAM in R version 3.5.3. In RF training, a maximum of 500 trees and 3 node‐wise predictors sampled for splitting were set. The support vector machine was trained with a linear kernel and regularization term of 10. The Lasso and ridge regression models were trained using iterative fitting of L1 penalty and λ. In the neural network model, tuning of learning rate was ensured to achieve lowest loss and highest accuracy. Missing data imputation through chained equations, followed by standardization and normalization of variables, was done. A total of 70% of the subjects were part of the training set, and the remaining 30% were part of the test set. The 10‐times, 10‐fold cross‐validation was performed on the training set to tune parameters. The performance was evaluated through a mean area under the curve (AUC), calculated through receiver operating characteristics (ROC) curve. Bootstrapping was performed on the test set to calculate 95% CIs of the AUC values.

Results

Characteristics of the Study Population

The study flowchart is shown in Figure 3. PDFF and patient meta‐data were obtained through UK Biobank access application number 53976. PDFF was successfully calculated from 4617 MRI samples, among whom 1011 individuals had an MRI‐PDFF ≥5. After the exclusion of other liver diseases and alcohol consumption above the threshold, as mentioned before, a total of 846 were considered to have NAFLD. Subjects were further classified according to the presence of CVD. Cases were composed of patients with NAFLD and CVD, and controls were composed of patients with NAFLD and without CVD. A total of 400 cases were diagnosed with CVD compared with 446 controls, defined as patients with NAFLD with no CVD. A total of 194 cases had subclinical CVD detected through CIMT, and 285 cases had CVD detected through disease codes specified in the UK Biobank, with 79 subjects common between both. Patient characteristics and significant variables are presented in Table 1. The full distribution of CVD among NAFLD cases is presented in Table S2. The complete summary of variables is presented in Table S3.

Figure 3

Flowchart illustrating stepwise study design to categorize subjects with nonalcoholic fatty liver disease (NAFLD) who develop cardiovascular disease.

PDFF indicates proton density fat fraction.

Flowchart illustrating stepwise study design to categorize subjects with nonalcoholic fatty liver disease (NAFLD) who develop cardiovascular disease.

PDFF indicates proton density fat fraction.

Comparison of Predictive Performance for CVD

To test the robustness and generalizability of the ML models, 10 times, 10‐fold cross‐validation analysis was performed on the clinical variables by partitioning the training set into 1‐fold of test set and 9‐folds of training sets to evaluate the model, as illustrated in Figure S1. For our test set of the cohort, the ROC curves obtained are presented in Figure 4, wherein, the orange plot line depicts the ROC curve for RF with an AUC of 0.799, followed in performance by Lasso regression (AUC = 0.753). DeLong test using the pROC package in R, to compare significance of AUC difference between best‐performing RF model and rest 6 comparative approaches, gave significant P values of 0.05 in comparison with Lasso, 0.02 in comparison with ridge, 0.02 in comparison with support vector machine, 0.02 in comparison with naive Bayes, 0.01 in comparison with logistic regression, and 0.005 in comparison with neural network model.

Figure 4

Receiver operating characteristics curve obtained on the test set of the cardiovascular disease cohort using the clinical variables.

Receiver operating characteristics curve obtained on the test set of the cardiovascular disease cohort using the clinical variables.

The test set was composed of 153 controls and 145 cases. The gray dotted line corresponds to area under the curve (AUC) equal to 0.5, indicating a random classification model. NB indicates naïve Bayes; NN, neural network; RF, random forest; and SVM, support vector machine. In the genetic data, we conducted ML modeling using the 7 ML models and observed AUC values for RF model to be higher than the other comparative methods (refer Table 2, column 5). A total of 26 SNPs were identified in gene IL16 from chromosome 15, and 6 SNPs were identified in gene ANKLE1 from chromosome 19 to be important to the CVD outcome. Furthermore, in IL16 gene, SNP rs4531696 with a P value of 0.012, and in ANKLE1 gene, SNP rs891017 with a P value of 0.009 were selected and adjusted for in the clinical data. Integrating these covariates in the clinical data improved the performance of prediction, increasing AUC from 0.799 to 0.820 on the test set, as shown in the orange plot line in Figure 5. The DeLong test, comparing AUC difference between the model with only clinical data versus the model with both clinical and genetic data, gave a P value of 0.03. As tabulated in Table 2, in the lifestyle variables, RF method performed the best with an AUC of 0.652, followed by ridge (AUC = 0.633), Lasso (AUC = 0.632), support vector machine (AUC = 0.612), logistic regression (AUC = 0.610), naive Bayes (AUC = 0.591), and neural network (AUC = 0.585). However, herein, we observed a dip in AUC values, attributed to the smaller number of variables feasible to include in the lifestyle data. Table 2 illustrates mean AUC and 95% CIs across all individual domains for the 7 ML approaches.

Table 2

Performance of the 7 Models in Each Individual Domain for Both the Training and Test Data Sets

Methods	Clinical	Genetic	Lifestyle
Random forest	0.810 (0.788–0.828)	0.799 (0.779–0.817)	0.624 (0.605–0.643)	0.617 (0.599–0.637)	0.673 (0.655–0.687)	0.652 (0.639–0.665)
Lasso	0.760 (0.742–0.779)	0.747 (0.727–0.765)	0.610 (0.591–0.625)	0.602 (0.585–0.620)	0.649 (0.631–0.665)	0.632 (0.619–0.645)
Ridge	0.762 (0.749–0.781)	0.753 (0.731–0.777)	0.611 (0.593–0.629)	0.605 (0.581–0.628)	0.645 (0.630–0.657)	0.633 (0.617–0.643)
Naïve Bayes	0.764 (0.748–0.788)	0.744 (0.729–0.761)	0.573 (0.552–0.591)	0.564 (0.549–0.581)	0.603 (0.586–0.618)	0.591 (0.578–0.604)
SVM	0.759 (0.747–0.776)	0.743 (0.731–0.759)	0.611 (0.593–0.634)	0.603 (0.585–0.620)	0.620 (0.608–0.633)	0.612 (0.599–0.625)
Logistic regression	0.743 (0.724–0.759)	0.740 (0.722–0.757)	0.579 (0.550–0.595)	0.571 (0.550–0.588)	0.619 (0.602–0.635)	0.610 (0.597–0.623)
Neural network	0.734 (0.718–0.751)	0.728 (0.708–0.745)	0.600 (0.582–0.619)	0.592 (0.575–0.610)	0.624 (0.600–0.639)	0.612 (0.595–0.628)

Methods

Clinical

Genetic

Lifestyle

Training

Test

Training

Test

Training

Test

Random forest

0.810

(0.788–0.828)

0.799

(0.779–0.817)

0.624

(0.605–0.643)

0.617

(0.599–0.637)

0.673

(0.655–0.687)

0.652

(0.639–0.665)

Lasso

0.760

(0.742–0.779)

0.747

(0.727–0.765)

0.610

(0.591–0.625)

0.602

(0.585–0.620)

0.649

(0.631–0.665)

0.632

(0.619–0.645)

Ridge

0.762

(0.749–0.781)

0.753

(0.731–0.777)

0.611

(0.593–0.629)

0.605

(0.581–0.628)

0.645

(0.630–0.657)

0.633

(0.617–0.643)

Naïve Bayes

0.764

(0.748–0.788)

0.744

(0.729–0.761)

0.573

(0.552–0.591)

0.564

(0.549–0.581)

0.603

(0.586–0.618)

0.591

(0.578–0.604)

SVM

0.759

(0.747–0.776)

0.743

(0.731–0.759)

0.611

(0.593–0.634)

0.603

(0.585–0.620)

0.620

(0.608–0.633)

0.612

(0.599–0.625)

Logistic regression

0.743

(0.724–0.759)

0.740

(0.722–0.757)

0.579

(0.550–0.595)

0.571

(0.550–0.588)

0.619

(0.602–0.635)

0.610

(0.597–0.623)

Neural network

0.734

(0.718–0.751)

0.728

(0.708–0.745)

0.600

(0.582–0.619)

0.592

(0.575–0.610)

0.624

(0.600–0.639)

0.612

(0.595–0.628)

Data are given as area under the curve (95% CI). The top row shows that random forest method performed the best in each of the domains for predicting cardiovascular outcome in subjects with nonalcoholic fatty liver disease. SVM indicates support vector machine.

Figure 5

Receiver operating characteristics curves comparing the performance enhancement observed by integrating domains relevant to the cardiovascular disease outcome.

AUC indicates area under the curve.

Performance of the 7 Models in Each Individual Domain for Both the Training and Test Data Sets 0.810 (0.788–0.828) 0.799 (0.779–0.817) 0.624 (0.605–0.643) 0.617 (0.599–0.637) 0.673 (0.655–0.687) 0.652 (0.639–0.665) 0.760 (0.742–0.779) 0.747 (0.727–0.765) 0.610 (0.591–0.625) 0.602 (0.585–0.620) 0.649 (0.631–0.665) 0.632 (0.619–0.645) 0.762 (0.749–0.781) 0.753 (0.731–0.777) 0.611 (0.593–0.629) 0.605 (0.581–0.628) 0.645 (0.630–0.657) 0.633 (0.617–0.643) 0.764 (0.748–0.788) 0.744 (0.729–0.761) 0.573 (0.552–0.591) 0.564 (0.549–0.581) 0.603 (0.586–0.618) 0.591 (0.578–0.604) 0.759 (0.747–0.776) 0.743 (0.731–0.759) 0.611 (0.593–0.634) 0.603 (0.585–0.620) 0.620 (0.608–0.633) 0.612 (0.599–0.625) 0.743 (0.724–0.759) 0.740 (0.722–0.757) 0.579 (0.550–0.595) 0.571 (0.550–0.588) 0.619 (0.602–0.635) 0.610 (0.597–0.623) 0.734 (0.718–0.751) 0.728 (0.708–0.745) 0.600 (0.582–0.619) 0.592 (0.575–0.610) 0.624 (0.600–0.639) 0.612 (0.595–0.628) Data are given as area under the curve (95% CI). The top row shows that random forest method performed the best in each of the domains for predicting cardiovascular outcome in subjects with nonalcoholic fatty liver disease. SVM indicates support vector machine.

Receiver operating characteristics curves comparing the performance enhancement observed by integrating domains relevant to the cardiovascular disease outcome.

AUC indicates area under the curve. In the final integrative modeling, we first selected best performing models in individual domains (clinical, lifestyle, and genetic). Furthermore, we experimented with a few common ensemble methods, such as bagging (bootstrap aggregating), RF, AdaBoost, and naïve Bayes classifier for integrating the domains , , ; and as illustrated in Table S4, the performance of naïve Bayes was better than other ensemble methods. Therefore, we determined the performance of the final predictive modeling that combines the 3 domains through integration using the naive Bayes ensemble. The ROC plot for the comparison of performance of the integrative modeling is illustrated in Figure 5. The black plot line with an AUC of 0.849 shows the performance edge that the integrative modeling has compared with the prediction through individual domains. We observed that the classification improved considerably from AUCs of 0.799 (95% CI, 0.779–0.817), 0.652 (95% CI, 0.639–0.665), and 0.617 (95% CI, 0.599–0.637) using clinical, lifestyle, and genetic data domains individually, respectively, to 0.849 (95% CI, 0.840–0.855) using the integrated model. The sensitivity and the specificity of the integrated model on the test data set were 71.4% and 84.2%, respectively, using the Youden index, with a positive and negative predictive value of 80.3% and 76.7%, respectively, showing our model’s efficiency during classification. The DeLong test for comparison of AUC differences along with Bonferroni correction between the ML model on the clinical domain versus the ML model on the clinical and genetic domains gave a P value of 0.03, and a significant P value of 0.009, for the integrative modeling on clinical, genetic, and lifestyle data domains compared with AUC obtained using only the clinical domain. We also did a subgroup analysis for both clinical and subclinical CVD (determined by CIMT) and observed higher AUCs (≈11% increase) in the clinical CVD group compared with the subgroup determined by CIMT threshold, as tabulated in Table S5.

Variable Importance

Figure 6 illustrates an importance plot for the clinical data variables ranked according to their contribution to the predictions through the RF model. Age followed by systolic blood pressure, diastolic blood pressure, and waist circumference were the most important variables. Red blood cell size distribution and diabetes were also significant to CVD prediction, however, to a lesser extent. We also took into account the influence of medication on the CVD prediction by categorizing subjects into 3 groups, as per their medication consumption: cholesterol‐lowering medication versus blood pressure medication versus others. As tabulated in Table S6, the prediction performance in terms of AUC was comparable and consistent with and without inclusion of medication information to our analysis.

Figure 6

Variable importance plot, demonstrating the importance of clinical variables obtained through the machine learning modeling on the clinical data.

Variable importance plot, demonstrating the importance of clinical variables obtained through the machine learning modeling on the clinical data.

ALT indicates alanine aminotransferase; AST, aspartate aminotransferase; BMI, body mass index; GGT, γ‐glutamyl transferase; HbA1c, hemoglobin A1c; HDL, high‐density lipoprotein; hs‐CRP, high‐sensitivity C‐reactive protein; LDL, low‐density lipoprotein; and PDFF, proton density fat fraction. The important variables, as observed through the RF modeling, were in concordance with the univariate analysis, as tabulated in Table S7, showing that the RF model can capture the accurate essential clinical data variables in CVD prediction. A closer look at the RF tree gave a set of tree‐based rules and thresholds used to classify subjects into risk of CVD versus no risk. An example tree illustrating such rules is shown in Figure 7. The full tree obtained can be traversed by a computational tool to evaluate the risk of CVD in the subjects based on various clinical/lifestyle parameters and aid the clinician to screen subjects with high risk of CVD.

Figure 7

Example tree illustrating some set of rules and thresholds from the dense random forest tree used for classification in the analysis.

The “0” and “1” on the leaf node represent no risk of cardiovascular disease (CVD) and risk of CVD in the subject, respectively. BMI indicates body mass index; N, no; and Y, yes.

Example tree illustrating some set of rules and thresholds from the dense random forest tree used for classification in the analysis.

The “0” and “1” on the leaf node represent no risk of cardiovascular disease (CVD) and risk of CVD in the subject, respectively. BMI indicates body mass index; N, no; and Y, yes. Similarly, to assess the important variables in the lifestyle data, we plotted a similar variable importance plot, presented in Figure S2. Time spent watching TV was the most significant variable, followed by salt intake and smoking status.

Discussion

We have established an integrated ML model that accurately identifies individuals with CVD in the setting of NAFLD using the UK Biobank database. Our model integrated clinical, lifestyle, and genetic parameters of patients with NAFLD to identify those with CVD, the most common and fatal complication of NAFLD, with a high AUC of 0.849 (71.4% sensitivity and 84.2% specificity). This reflects the fact that NAFLD is a complex entity integrating environmental and genetic factors that influence each other in a reciprocal manner. This algorithm could be used in practice to flag those patients with early NAFLD at high risk of CVD. As such, our model delineates those patients with early NAFLD along with age >59 years, hypertension, and high waist circumference with a sedentary lifestyle and specific SNPs as being high risk for CVD. Therefore, our model goes beyond the current literature, by identifying patients with early NAFLD at risk for CVD. Metabolic‐associated fatty liver disease was recently suggested to better define fatty liver disease and metabolic dysfunction, rather than using the term “nonalcoholic.” Metabolic‐associated fatty liver disease reflects a heterogeneous phenotype that is influenced by multiple factors, including age, sex, hormonal status, ethnicity, diet, alcohol intake, smoking, genetic predisposition, the microbiota, and metabolic status, which all interact with each other in a reciprocal manner and reflect the fact that modifying these factors may ultimate influence the disease course and future complications. Similarly, in our study, we showed that when integrating clinical and genetic data, outcome prediction (in this case, CVD) is better than analyzing each risk factor separately. A median age >59 years (95% CI, 54.00–63.00 years) was the strongest predictor for CVD among the clinical data parameters in our model, indicating its importance as a strong risk factor for CVD in the population with NAFLD. Age is a major and well‐established risk factor for CVD, , , exposing an individual to metabolic and environmental risk factors for a longer duration. Hypertension, with a systolic blood pressure ≥145 mm Hg (95% CI, 134.0–156.0 mm Hg) and a diastolic blood pressure ≥85 mm Hg (95% CI, 79.00–93.00 mm Hg), and waist circumference >98 cm (95% CI, 91.00–105.00 cm) were the next strongest variables in our model. Hypertension has been established as the strongest risk factor for CVD. Furthermore, there is a gradual increase in coronary artery calcium, and the risk for CVD progression increases alongside increases in systolic blood pressure values. , Waist circumference, which represents abdominal adiposity, is strongly associated with cardiovascular mortality to a much larger extent than body mass index alone. Diabetes and triglycerides were significant contributors to the model, however, to a lesser extent. In our total cohort of individuals diagnosed with NAFLD, 24% were diagnosed with type 2 diabetes and 30% of those with CVD had diabetes. Markers of inflammation, such as hs‐CRP, high‐density lipoprotein, albumin, arterial stiffness, and visceral adipose tissue volume, had only a modest contribution to the model. Most of the patients in our cohort had normal levels of both alanine aminotransferase and aspartate aminotransferase, where alanine aminotransferase had a median of 26.22 and an interquartile range of 19.70 to 35.41 U/L and aspartate aminotransferase had a median of 26.00 and an interquartile range of 22.32 to 31.20 U/L, suggesting, however not proving, that our cohort mostly experienced simple steatosis rather than nonalcoholic steatohepatitis. From the lifestyle parameters, time spent in front of the TV was associated with the highest risk for developing CVD among patients with NAFLD, with an AUC of 0.65 in the ML model, followed by salt intake, smoking, and physical activity. We considered time spent in front of the TV as a marker of sedentary lifestyle, recently recognized as an independent cardiovascular risk factor, , consistent with previous studies showing a clear association between sedentary lifestyle and NAFLD mainly secondary to obesity. , , Several studies have aimed to link certain genes to CVD in the population with NAFLD. For instance, PNPLA3 and TM6SF2 might decrease the risk and possibly protect from CVD, whereas variants in GCKR may be associated with increased CVD risk. However, despite robust investigations, studies have failed to allocate specific genetic components that link NAFLD to CVD in terms of causality. , We observed 2 SNPs in IL16 and ANKLE1 genes as highly associated with CVD. Studies have shown a significant correlation between increased levels of IL16, body mass index, and waist circumference. Furthermore, IL16 mRNA is reflective of the inflammatory process in individuals with overweight/obesity, which is strongly associated with NAFLD and CVD. , , ANKLE1 SNPs are expressed in hematopoietic tissues in human and have previously been associated with genomic instability in colorectal and breast cancer, but not CVD to date. Some of the limitations of our study include the inability to bucket our risk prediction in a 5‐ or 10‐year risk timeline because of variability in the duration from baseline to the time of clinical or subclinical CVD in the cohort. However, with ML modeling, we could still allocate those patients with NAFLD at risk of CVD, with high AUC of 0.849. In the future, we will explore studies with data focused on longer follow‐up, to provide a better insight into predicting CVD in specific time frames. Also, the data in our study were taken at a single point of time, and longitudinal assessment for patterns could not be performed; however, the data were representative enough to offer sound observations. In the future, we would explore cohorts with follow‐up data to establish a longitudinal prediction model for CVD. Another limitation of our data was the small number of cases diagnosed with NAFLD compared with the large number of participants in the UK Biobank as the diagnosis was based on MRI‐PDFF rather than on abnormal biochemistry or ultrasound findings. MRI‐PDFF is the gold standard for quantifying hepatic fat; however, the number of patients whose MRI‐PDFF data were available for analysis was rather small compared with the total number of participants in the UK Biobank. However, we observed that the number of subjects in the study was sufficient for our ML analysis to yield good performance (AUC = 0.849) while classifying subjects with risk of CVD and justify the purpose of integrative ML modeling for CVD prediction. Also, because of the fact that most of the patients had normal enzymes, Fibrosis‐4 (FIB‐4) and NAFLD fibrosis scores, which were validated to assess the degree of fibrosis, were not calculated; and as a result, we assume that the degree of fibrosis was low in most of the cohort.

NAFLD‐Related CVD Risk Stratification

There are currently no guidelines for CVD screening in patients with NAFLD. Global cardiovascular risk assessment scores available for the general population, including the Framingham risk score, atherosclerotic CVD, and others, use multiple traditional cardiovascular risk factors for risk assessment in all asymptomatic adults without a clinical history of CVD. NAFLD, however, is not included in these scores. , Other cardiovascular risk scores have been suggested in NAFLD, such as the coronary artery calcium scores, Leaman scores, and one based on age, mean platelet volume, and diabetes. None of these scores is yet validated, and it is uncertain to which patients with NAFLD they should be applied. As CVD risk increases in concordance with NAFLD severity, it is reasonable to screen high‐risk groups with obesity and diabetes. , , , As shown in our study, it may be worth screening those individuals with early NAFLD who are present with certain clinical and genetic risk factors and could potentially benefit from early interventions that would prevent cardiovascular complications. The clinical and lifestyle variables that were included in the model can be easily and routinely collected during clinic visits. Moreover, the most significant variables identified in the model are those that can be retrieved by the general practitioner in the community setting and hence flag patients at risk as early as possible. On the other hand, genetic testing is not performed in patients with NAFLD as part of routine clinical care. It is still relatively new, and its utility would need to be clearly demonstrated for implementation, particularly in a public health care framework. Our study demonstrates that genetic data are additive to clinical and lifestyle data in predicting CVD among individuals with NAFLD. In conclusion, our ML model integrates important clinical, lifestyle, and genetic risk factors to efficiently identify CVD in the population with early NAFLD, thereby flagging those patients who will derive the greatest benefit from CVD screening and treatment of metabolic risk factors. This has the potential to help reduce the morbidity and mortality associated with CVD as the most common complications of NAFLD.

Sources of Funding

This work was funded by Toronto General and Western Hospital Foundation. None of the authors of this article has a financial or personal relationship with other people or organizations that could inappropriately influence the content of the article.

Disclosures

Dr Farkouh received research grant support from Amgen, Novartis, and Novo Nordisk. The remaining authors have no disclosures to report. Tables S1–S7 Figure S1–S2 Click here for additional data file.

69 in total

1. Liver: an alarm for the heart?

Authors: Seyed Amir Mirbagheri; Armin Rashidi; Seifollah Abdi; Daryoush Saedi; Mehdi Abouzari
Journal: Liver Int Date: 2007-09 Impact factor: 5.828

2. 2013 ESH/ESC Practice Guidelines for the Management of Arterial Hypertension.

Authors: Giuseppe Mancia; Robert Fagard; Krzysztof Narkiewicz; Josep Redon; Alberto Zanchetti; Michael Böhm; Thierry Christiaens; Renata Cifkova; Guy De Backer; Anna Dominiczak; Maurizio Galderisi; Diederick E Grobbee; Tiny Jaarsma; Paulus Kirchhof; Sverre E Kjeldsen; Stephane Laurent; Athanasios J Manolis; Peter M Nilsson; Luis Miguel Ruilope; Roland E Schmieder; Per Anton Sirnes; Peter Sleight; Margus Viigimaa; Bernard Waeber; Faiez Zannad
Journal: Blood Press Date: 2013-12-20 Impact factor: 2.835

Review 3. Ageing, metabolism and cardiovascular disease.

Authors: Sarah Costantino; Francesco Paneni; Francesco Cosentino
Journal: J Physiol Date: 2015-10-22 Impact factor: 5.182

4. Association between interleukin 6 and interleukin 16 gene polymorphisms and coronary heart disease risk in a Chinese population.

Authors: Zichuan Tong; Qiang Li; Jianjun Zhang; Yu Wei; Guobin Miao; Xinchun Yang
Journal: J Int Med Res Date: 2013-07-23 Impact factor: 1.671

5. Carotid intima-media thickness and presence or absence of plaque improves prediction of coronary heart disease risk: the ARIC (Atherosclerosis Risk In Communities) study.

Authors: Vijay Nambi; Lloyd Chambless; Aaron R Folsom; Max He; Yijuan Hu; Tom Mosley; Kelly Volcik; Eric Boerwinkle; Christie M Ballantyne
Journal: J Am Coll Cardiol Date: 2010-04-13 Impact factor: 24.094

Review 6. Screening of Cardiovascular Disease in Nonalcoholic Fatty Liver Disease: Whom and How?

Authors: Narendra S Choudhary; Ajay Duseja
Journal: J Clin Exp Hepatol Date: 2019-02-15

7. Association between noninvasive fibrosis markers and mortality among adults with nonalcoholic fatty liver disease in the United States.

Authors: Donghee Kim; W Ray Kim; Hwa Jung Kim; Terry M Therneau
Journal: Hepatology Date: 2013-01-25 Impact factor: 17.425

Review 8. Noninvasive Assessment of Liver Disease in Patients With Nonalcoholic Fatty Liver Disease.

Authors: Laurent Castera; Mireen Friedrich-Rust; Rohit Loomba
Journal: Gastroenterology Date: 2019-01-18 Impact factor: 22.682

Review 9. MRI and MRE for non-invasive quantitative assessment of hepatic steatosis and fibrosis in NAFLD and NASH: Clinical trials to clinical practice.

Authors: Parambir S Dulai; Claude B Sirlin; Rohit Loomba
Journal: J Hepatol Date: 2016-06-14 Impact factor: 25.083

10. Study of cardiovascular disease prediction model based on random forest in eastern China.

Authors: Li Yang; Haibin Wu; Xiaoqing Jin; Pinpin Zheng; Shiyun Hu; Xiaoling Xu; Wei Yu; Jing Yan
Journal: Sci Rep Date: 2020-03-23 Impact factor: 4.379

1 in total

1. Identification of Drug-Induced Liver Injury Biomarkers from Multiple Microarrays Based on Machine Learning and Bioinformatics Analysis.

Authors: Kaiyue Wang; Lin Zhang; Lixia Li; Yi Wang; Xinqin Zhong; Chunyu Hou; Yuqi Zhang; Congying Sun; Qian Zhou; Xiaoying Wang
Journal: Int J Mol Sci Date: 2022-10-08 Impact factor: 6.208

1 in total