| Literature DB >> 30679510 |
Juan Zhao1, QiPing Feng2, Patrick Wu1,3, Roxana A Lupu4, Russell A Wilke4, Quinn S Wells5, Joshua C Denny1,5, Wei-Qi Wei6.
Abstract
Current approaches to predicting a cardiovascular disease (CVD) event rely on conventional risk factors and cross-sectional data. In this study, we applied machine learning and deep learning models to 10-year CVD event prediction by using longitudinal electronic health record (EHR) and genetic data. Our study cohort included 109, 490 individuals. In the first experiment, we extracted aggregated and longitudinal features from EHR. We applied logistic regression, random forests, gradient boosting trees, convolutional neural networks (CNN) and recurrent neural networks with long short-term memory (LSTM) units. In the second experiment, we applied a late-fusion approach to incorporate genetic features. We compared the performance with approaches currently utilized in routine clinical practice - American College of Cardiology and the American Heart Association (ACC/AHA) Pooled Cohort Risk Equation. Our results indicated that incorporating longitudinal feature lead to better event prediction. Combining genetic features through a late-fusion approach can further improve CVD prediction, underscoring the importance of integrating relevant genetic data whenever available.Entities:
Mesh:
Year: 2019 PMID: 30679510 PMCID: PMC6345960 DOI: 10.1038/s41598-018-36745-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1AUROC and AUPRC of gold standard and machine learning/deep learning models for predicting 10-year CVD risk on 10-fold cross validation in Experiment I. The mean values of the AUROC and AUPRC and the standard error are provided in Supplementary Table 1.
Top 10 features for machine learning prediction in descending order of coefficients or feature importance returned by RF and GBT.
| LR with aggregate features | RF with aggregate features | GBT with aggregate features | LR with longitudinal features | RF with longitudinal features | GBT with longitudinal features |
|---|---|---|---|---|---|
| EHR length | EHR length | Age | EHR length | EHR length | Age |
| Max LDL-C | Age | EHR length | Age | Age | EHR length |
| Min Creatinine | Max BMI | SD Creatinine | SD Glucose in 2000 | Aspirin in 2006 | Smoking |
| Age | Min BMI | Smoking | SD Creatinine in 2000 | Max SBP in 2006 | Heart valve disorders in 2006 |
| Max HDL-C | Median BMI | Min BMI | Max HDL-C 2005 | Min BMI in 2006 | Hypertension in 2006 |
| Max BMI | Max SBP | Heart valve disorders (Phecode 395) | SD Glucose in 2006 | Median BMI in 2005 | Aspirin in 2006 |
| Max Total Cholesterol | Median SBP | Min Glucose | Median LDL-C in 2006 | Median SBP in 2006 | Disorders of lipoid metabolism in 2006 |
| Max DBP | SD BMI | Max SBP | Median BMI in 2006 | Max BMI in 2006 | Clopidogrel in 2006 |
| Median Triglycerides | MIN SBP | Max Triglycerides | Median Total Cholesterol in 2006 | Min BMI in 2001 | Max SBP in 2006 |
| Min Cholesterol | Max DBP | Aspirin | Heart valve disorders in 2006 | Min BMI in 2002 | SD Glucose in 2006 |
LDL-C (LDL Cholesterol); HDL-C (HDL Cholesterol); Systolic Blood Pressure (SBP); Diastolic Blood Pressure (DBP); Body mass index (BMI).
Figure 2AUROC and AUPRC of gold standard, GBT model on EHR feature and late fusion on EHR and genetic feature for predicting 10-year CVD risk on 50 iterations in Experiment II. The mean values of the AUROC and AUPRC and the standard error are provided in Supplementary Table 2.
Top 10 features in pre-trained model 2 with genetic data. Features were ranked according to descending order of absolute value of coefficient effect size.
| Features | Reference gene | Coefficient |
|---|---|---|
| Age | — | 0.747 |
| rs17465637 |
| −0.334 |
| rs67180937 |
| 0.301 |
| Gender | — | −0.270 |
| EHR length | — | 0.180 |
| rs7568458 |
| 0.103 |
| rs4977574 |
| 0.095 |
| rs10455872 |
| 0.093 |
| rs1412444 |
| 0.092 |
| rs501120 |
| −0.079 |
We chose the result from the iteration which generated the closest result from the average AUROC and AUPRC.
Features included in the machine-learning models.
| Feature type | Features | Values |
|---|---|---|
| Demographic | Age* | Continuous |
| Gender* | Binary | |
| Race | Categorical | |
| Life styles | Body mass index (BMI) | Summarized data† |
| Smoking* | Binary | |
| Physical or lab measurements | Systolic blood pressure (SBP)* | Summarized data† |
| Diastolic blood pressure (DBP)* | Summarized data† | |
| Total Cholesterol (Cholesterol)* | Summarized data† | |
| HDL Cholesterol (HDL-C)* | Summarized data† | |
| LDL Cholesterol (LDL-C) | Summarized data† | |
| Creatinine | Summarized data† | |
| Glucose | Summarized data† | |
| Triglyceride | Summarized data† | |
| Diagnosis | Other tests (phecode[ | Binary |
| Benign neoplasm of skin (216) | ||
| Diabetes mellitus* (250) | ||
| Disorders of lipoid metabolism (272) | ||
| Other mental disorder, random mental disorder (306) | ||
| Heart valve disorders (395) | ||
| Hypertension (401) | ||
| Cardiomyopathy (425) | ||
| Congestive heart failure; nonhypertensive (428) | ||
| Atherosclerosis (440) | ||
| Acute upper respiratory infections of multiple or unspecified sites (465) | ||
| Chronic airway obstruction (496) | ||
| Disorders of menstruation and other abnormal bleeding from female genital tract (626) | ||
| Medication | Warfarin (RXCUI 11289) | Binary |
| Aspirin (1191) | ||
| Atenolol (1202) | ||
| Amlodipine (17767) | ||
| Carvedilol (20352) | ||
| Lisinopril(29046) | ||
| Adenosine(296) | ||
| Clopidogrel (32968) | ||
| Digoxin (3407) | ||
| Diltiazem (3443) | ||
| Ramipril (35296) | ||
| Diuretics (3567) | ||
| Dobutamine (3616) | ||
| Simvastatin(36567) | ||
| Enalapril (3827) | ||
| Sestamibi (408081) | ||
| Ethinyl Estradiol (4124) | ||
| Furosemide (4603) | ||
| Nitroglycerin (4917) | ||
| Hydrochlorothiazide(5487) | ||
| Ibuprofen (5640) | ||
| Metoprolol (6918) | ||
| Acellular pertussis vaccine (798302) | ||
| Atorvastatin(83367) | ||
| ACE inhibitors (836) | ||
| Thallium(1311633) | ||
| Clonidine (2599) | ||
| Genetic | 204 SNPs# | Categorical |
| Others | EHR length | Continuous |
*Features in ACC/AHA Equations.
†Summarized data includes minimum, maximum, median and SD within a time window.
#204 SNPs are listed in the Supplementary Data.
Figure 3Study design of experiment I. The figure illustrates how we defined the observation and prediction window. It also shows how we modeled longitudinal EHR features: i) We aggregated each feature across the 7-year observation window (e.g. median, max, min and SD of HDL); ii) we extracted each year value of each feature and concatenated the temporal values from all patients into a two-dimensional matrix for a classifier (e.g. LR, RF, GBT); we then constructed a tensor representation on temporal values from all patients for CNN and LSTM.
Figure 4Flowchart of selecting cohort for late-fusion approach.
Figure 5Framework for proposed late fusion approach to combine the genetic features with longitudinal EHR features.