Literature DB >> 34019087

Comparative Effectiveness of Machine Learning Approaches for Predicting Gastrointestinal Bleeds in Patients Receiving Antithrombotic Treatment.

Jeph Herrin¹, Neena S Abraham^2,3,4, Xiaoxi Yao^3,4,5, Peter A Noseworthy^3,4,5, Jonathan Inselman^3,4, Nilay D Shah^3,4,6, Che Ngufor^3,4,7.

Abstract

Importance: Anticipating the risk of gastrointestinal bleeding (GIB) when initiating antithrombotic treatment (oral antiplatelets or anticoagulants) is limited by existing risk prediction models. Machine learning algorithms may result in superior predictive models to aid in clinical decision-making. Objective: To compare the performance of 3 machine learning approaches with the commonly used HAS-BLED (hypertension, abnormal kidney and liver function, stroke, bleeding, labile international normalized ratio, older age, and drug or alcohol use) risk score in predicting antithrombotic-related GIB. Design, Setting, and Participants: This retrospective cross-sectional study used data from the OptumLabs Data Warehouse, which contains medical and pharmacy claims on privately insured patients and Medicare Advantage enrollees in the US. The study cohort included patients 18 years or older with a history of atrial fibrillation, ischemic heart disease, or venous thromboembolism who were prescribed oral anticoagulant and/or thienopyridine antiplatelet agents between January 1, 2016, and December 31, 2019. Exposures: A cohort of patients prescribed oral anticoagulant and thienopyridine antiplatelet agents was divided into development and validation cohorts based on date of index prescription. The development cohort was used to train 3 machine learning models to predict GIB at 6 and 12 months: regularized Cox proportional hazards regression (RegCox), random survival forests (RSF), and extreme gradient boosting (XGBoost). Main Outcomes and Measures: The performance of the models for predicting GIB in the validation cohort, evaluated using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value, and prediction density plots. Relative importance scores were used to identify the variables that were most influential in the top-performing machine learning model.
Results: In the entire study cohort of 306 463 patients, 166 177 (54.2%) were male, 193 648 (63.2%) were White, the mean (SD) age was 69.0 (12.6) years, and 12 322 (4.0%) had experienced a GIB. In the validation data set, the HAS-BLED model had an AUC of 0.60 for predicting GIB at 6 months and 0.59 at 12 months. The RegCox model performed the best in the validation set, with an AUC of 0.67 at 6 months and 0.66 at 12 months. XGBoost was similar, with AUCs of 0.67 at 6 months and 0.66 at 12 months, whereas for RSF, AUCs were 0.62 at 6 months and 0.60 at 12 months. The variables with the highest importance scores in the RegCox model were prior GI bleed (importance score, 0.72); atrial fibrillation, ischemic heart disease, and venous thromboembolism combined (importance score, 0.38); and use of gastroprotective agents (importance score, 0.32). Conclusions and Relevance: In this cross-sectional study, the machine learning models examined showed similar performance in identifying patients at high risk for GIB after being prescribed antithrombotic agents. Two models (RegCox and XGBoost) performed modestly better than the HAS-BLED score. A prospective evaluation of the RegCox model compared with HAS-BLED may provide a better understanding of the clinical impact of improved performance.

Entities: Chemical

Mesh：

Substances：

Year: 2021 PMID： 34019087 PMCID： PMC8140376 DOI： 10.1001/jamanetworkopen.2021.10703

Source DB: PubMed Journal: JAMA Netw Open ISSN： 2574-3805

Introduction

Physicians have long used prediction models to stratify patients according to their risk of adverse outcomes. Such risk stratification can promote better treatment decisions, more efficient use of monitoring, and better implementation of approaches to mitigate risk.[1] One outcome for which risk stratification is routinely used is related to the decision to prescribe antithrombotic medications (vitamin K antagonists, direct oral anticoagulants [DOACs], and/or thienopyridine antiplatelet agents) to patients with cardiovascular diseases. One of the key goals of this risk stratification is to incorporate the risk of gastrointestinal bleeding (GIB) in the context of the treatment decision. Given the severity of GIB[2,3] and the widespread use of antithrombotic agents in this population, several risk models have been developed to predict bleeding, including HAS-BLED (hypertension, abnormal kidney and liver function, stroke, bleeding, labile international normalized ratio, older age, and drug or alcohol use), ATRIA (anticoagulation and risk factors in atrial fibrillation), ORBIT (Outcomes Registry for Better Informed Treatment of Atrial Fibrillation), and HEMORR(2)HAGES (hepatic or kidney disease, ethanol abuse, malignancy, older age, reduced platelet count or function, rebleeding, hypertension, anemia, genetic factors, excessive fall risk, and stroke).[4,5,6,7] Of these models, HAS-BLED has demonstrated the best performance among these scores, with an area under the receiver operating characteristic (ROC) curve (AUC) of 0.68 in a real-world population.[4] All of the models are subject to important limitations, including model development using data sets with only 50 to 150 annual GIB events, lack of inclusion of more contemporary medications (namely DOACs and second-generation antiplatelet agents), and lack of revision to reflect medical advances, which may lead to a temporal lag in the models and outcomes as new technologies and clinical practices evolve. For example, the HAS-BLED score was derived to predict major bleeding in patients treated with warfarin, yet clinical extrapolation to estimate the risk of GIB is common. The HAS-BLED score may not accurately reflect the GIB risk in contemporary practice that has expanded to include DOACs and second-generation thienopyridine antiplatelet agents—drugs that are often used in combination. Furthermore, the HAS-BLED model may underestimate GIB in some patients, including older patients with multiple comorbidities.[8,9,10] Given that the performance and utility of prediction models depends on the data source and methods used, we hypothesized that machine learning approaches applied to a larger, more recent data set with a broader range of parameters might produce a risk model with superior performance characteristics than those of HAS-BLED. Many machine learning approaches previously have been found to perform markedly better than risk models based on conventional approaches. This is especially true for rare outcomes, such as GIB (which occurs in 5% of patients aged <75 years who receive anticoagulant or antithrombotic monotherapy[11]), for which typical data sets may contain few events and small absolute differences in detecting true-positive cases can result in low sensitivity. Therefore, we aimed to test the ability of 3 machine learning algorithms to predict GIB in patients with atrial fibrillation, ischemic heart disease, or venous thromboembolism who recently started antithrombotic treatment.

Methods

This cross-sectional study was exempted from the need for approval and informed consent by the Mayo Clinic Institutional Review Board because it used preexisting, deidentified patient data. The study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.[12]

Data Source

Medical and pharmacy claims data from the OptumLabs Data Warehouse (OLDW) were used to define the cohort. This data source includes national claims of more than 100 million privately insured individuals and Medicare Advantage enrollees in the US.[13,14] Data include enrollment information, such as age, sex, and race/ethnicity, as well as International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) and International Statistical Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis codes; International Classification of Diseases, Ninth Revision (ICD-9) and International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) procedure codes; Current Procedural Terminology, Fourth Edition (CPT-4) procedure codes; Healthcare Common Procedure Coding System procedure codes; site of service codes; and provider specialty codes (eTable 1 in the Supplement).

Study Population

We identified a cohort of patients 18 years or older who were prescribed antithrombotic drugs (vitamin K antagonists, DOACs, and/or thienopyridine antiplatelet agents) between January 1, 2016, and December 31, 2019. We restricted the cohort to patients with no prescription in the prior 12 months and with a history of atrial fibrillation, ischemic heart disease, or venous thromboembolism. The first date of prescription fill was defined as the index prescription date. All patients had been enrolled in a health plan for 12 months before the index date. We excluded patients at risk of malignancy-associated GIB by using the presence of any cancer diagnosis code during the prior year. We also excluded patients with missing sex data and without a minimum of 12 months of follow-up after the index prescription date.

Variables

The main outcome was the time in days to the first GIB, defined based on administrative codes[12,14] (eTable 1 in the Supplement). The GIB events were identified by inpatient claims with 1 of these GIB administrative codes in the first or second discharge diagnosis fields, censoring at the end of enrollment (including death). All study participants were followed from the index prescription date until the end of the study period, disenrollment from insurance coverage, or the first GIB occurrence. We considered 32 baseline (ascertained at the index date) clinical and demographic risk factors, including age, sex, race/ethnicity, condition group (atrial fibrillation, ischemic heart disease, venous thromboembolism, or combinations of these), and baseline comorbidities and medications; the complete list is provided in Table 1. For training and validation, all categorical risk factors were converted into a binary format using 1-hot encoding (ie, variables with k categories were transformed into k binary indicator variables).

Table 1.

Characteristics of the Patients Included in Study

Characteristic	Patients, No. (%)
Characteristic	No GI bleed (n = 294 141)	GI bleed (n = 12 322)	All (N = 306 463)
Condition group
Atrial fibrillation	46 716 (15.9)	1689 (13.7)	48 405 (15.8)
Atrial fibrillation and ischemic heart disease	68 212 (23.2)	3399 (27.6)	71 611 (23.4)
Atrial fibrillation, ischemic heart disease, and venous thromboembolism	9015 (3.1)	597 (4.8)	9612 (3.1)
Ischemic heart disease	113 136 (38.5)	4548 (36.9)	117 684 (38.4)
Ischemic heart disease and venous thromboembolism	18 423 (6.3)	938 (7.6)	19 361 (6.3)
Venous thromboembolism	34 616 (11.8)	972 (7.9)	35 588 (11.6)
Venous thromboembolism and atrial fibrillation	4023 (1.4)	179 (1.5)	4202 (1.4)
Age, mean (SD), y	68.9 (12.7)	71.2 (11.2)	69.0 (12.6)
Age group, y
18-64	94 600 (32.2)	2871 (23.3)	97 471 (31.8)
65-74	92 352 (31.4)	4239 (34.4)	96 591 (31.5)
75-86	107 189 (36.4)	5212 (42.3)	112 401 (36.7)
Race/ethnicity
White	185 989 (63.2)	7659 (62.2)	193 648 (63.2)
Black	31 757 (10.8)	1552 (12.6)	33 309 (10.9)
Other^a	76 395 (26.0)	3111 (25.2)	79 506 (25.9)
Sex
Female	133 703 (45.5)	6583 (53.4)	140 286 (45.8)
Male	160 438 (54.5)	5739 (46.6)	166 177 (54.2)
Treatment group
Anticoagulants	166 209 (56.5)	6618 (53.7)	172 827 (56.4)
Antiplatelets	123 180 (41.9)	5519 (44.8)	128 699 (42.0)
Anticoagulants and antiplatelets	4752 (1.6)	185 (1.5)	4937 (1.6)
Baseline comorbidities
Diabetes	122 796 (41.7)	5934 (48.2)	128 730 (42.0)
Hypertension	258 738 (88.0)	11 546 (93.7)	270 284 (88.2)
Peripheral arterial disease	45 194 (15.4)	2377 (19.3)	47 571 (15.5)
Alcoholism	18 938 (6.4)	975 (7.9)	19 913 (6.5)
Chronic kidney failure	13 615 (4.6)	922 (7.5)	14 537 (4.7)
Chronic liver disease	30 373 (10.3)	1708 (13.9)	32 081 (10.5)
Rheumatologic disease	21 033 (7.2)	1199 (9.7)	22 232 (7.3)
Carotid revascularization	13 255 (4.5)	640 (5.2)	13 895 (4.5)
Helicobacter pylori infection	5467 (1.9)	456 (3.7)	5923 (1.9)
History of GI bleeding	66 361 (22.6)	5217 (42.3)	71 578 (23.4)
Smoking	134 235 (45.6)	6261 (50.8)	140 496 (45.8)
Sleep apnea	37 322 (12.7)	1830 (14.9)	39 152 (12.8)
Thyroid disease	84 826 (28.8)	4049 (32.9)	88 875 (29.0)
Valvular heart disease	128 578 (43.7)	6101 (49.5)	134 679 (43.9)
Viral hepatitis	6867 (2.3)	386 (3.1)	7253 (2.4)
Percutaneous coronary intervention	21 485 (7.3)	1567 (12.7)	23 052 (7.5)
Charlson comorbidities^b
Cerebrovascular disease	66 909 (22.7)	3296 (26.7)	70 205 (22.9)
Dementia	17 676 (6.0)	755 (6.1)	18 431 (6.0)
Hemiplegia	12 135 (4.1)	571 (4.6)	12 706 (4.1)
AIDS	805 (0.3)	35 (0.3)	840 (0.3)
Medications
Antiarrythmic drugs	23 640 (8.0)	1041 (8.4)	24 681 (8.1)
Antihyperlipidemic drugs	170 733 (58.0)	7592 (61.6)	178 325 (58.2)
Gastroprotective agents, proton pump inhibitors, and/or histamine 2 blockers	76 972 (26.2)	5006 (40.6)	81 978 (26.7)
Selective serotonin reuptake inhibitors	38 368 (13.0)	1962 (15.9)	40 330 (13.2)
Nonsteroidal anti-inflammatory drugs	47 964 (16.3)	1926 (15.6)	49 890 (16.3)
Antihypertensive drugs	233 458 (79.4)	10 492 (85.1)	243 950 (79.6)

Abbreviation: GI, gastrointestinal.

Other includes Asian, Hispanic, and unknown.

Defined by the Charlson Comorbidity Index.

Abbreviation: GI, gastrointestinal. Other includes Asian, Hispanic, and unknown. Defined by the Charlson Comorbidity Index. Because we used claims data, there were no missing values among the outcomes or risk factors. Although the potential for misclassification remained because of unreported data (eg, omitted codes for patients with more complex conditions), we did not perform any sensitivity analyses to assess the potential impact of this owing to the large number of factors.

Machine Learning Algorithms

To predict GIB using the patient characteristics described, we trained 3 machine learning algorithms for right-censored time-to-event data: regularized Cox regression (RegCox), random survival forests (RSF), and extreme gradient boosting (XGBoost). We selected these algorithms because they were commonly used for clinical models and had implementations available for time-to-event outcomes. The RegCox model is an extension of the least absolute shrinkage and selection operator (LASSO) and the Ridge regularization methods for the Cox proportional hazards regression model; RegCox penalizes models with a large number of nonzero coefficients. The RSF algorithm is an extension of the random forests algorithm for right-censored survival data.[15,16] In RSF, multiple decision trees are trained on a random sample of observations from the study data, and the predictions are combined by using a mean value or majority vote. Unlike the traditional random forests method, in which the decision tree nodes are split by maximizing the difference in class predictions, nodes in RSF are split by maximizing differences in survival times. The XGBoost[17] algorithm is a generalized implementation of the gradient boosting machine[18,19,20] technique with several algorithmic enhancements to improve scalability, speed, and prediction accuracy. An important enhancement is the implementation of the LASSO and Ridge regularization, which penalizes more complex models to prevent overfitting. We implemented the Cox proportional hazards regression model partial log-likelihood as a loss function in XGBoost to analyze right-censored survival data.[21,22] As a reference for comparing the machine learning models, we calculated the HAS-BLED risk score for each patient. Because we were using claims data, we did not have access to clinical values used in the HAS-BLED score and instead used the presence of related inpatient or outpatient claims for hypertension, abnormal kidney or liver function, stroke, history of or presence of factors associated with bleeding, labile international normalized ratio, older age (>65 years), and use of drugs or alcohol. This modified HAS-BLED score has been used previously in claims data.[23]

Training and Validation

We divided the study sample into a development cohort (105 837 patients) and a validation cohort (200 626 patients) according to whether the index prescription date was before (development) or after (validation) January 1, 2019. The development cohort was then randomly partitioned into 10 equal subsets (folds), with random partitioning stratified by GIB status and calendar quarter. We then used the 10-fold development sample to identify the best tuning parameters (eg, the LASSO and Ridge in RegCox and XGBoost) for the RegCox, RSF, and XGBoost algorithms. Specifically, we set up a grid for each combination of parameters (chosen at random),[24] applied the algorithm with those parameters to 9 folds of the development set, and assessed the performance of the solution on the tenth (hold-out) fold. This procedure was repeated 10 times for each parameter set in the grid (leaving out each fold in turn), and the optimal parameters were selected according to a maximum AUC over the 10 applications to the 10 hold-out folds. For each machine learning algorithm, a final model was obtained by retraining the algorithm on the entire development set using the optimal parameters selected, and the performance of this final model was assessed on the validation cohort. The workflow of the training and validation procedure and the selection of the final model are shown in eFigure 1 in the Supplement.

Performance Measures

Using the predicted probability of GIB at 6 and 12 months from each model, we computed the sensitivity and specificity across all possible classification thresholds and constructed ROC curves. The classification threshold was a value of the predicted probability for which instances with predicted probabilities above that value were classified as GIB or as no GIB. We identified the optimal classification threshold (based on the development cohort) as the threshold value at which the distance between the point (0, 1) and the ROC curve was the minimum.[25] We computed the accuracy (balanced), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) at the optimal classification threshold, and we computed the ROC curve, AUC, cumulative gain curve, and Gini score at each time point by applying methods appropriate for censored data.[26,27] We also developed a graphical representation of the associations among the sensitivity, PPV, and prediction density for the classification thresholds.

Statistical Analysis

We report descriptive statistics of patient characteristics as the mean (SD) for continuous variables and as the count (percentage) for categorical variables. For each model, including the HAS-BLED model, we report the mean and 95% CI of the scalar performance and graphical performance metrics on the development cohort; these were based on the mean and SE of the 10 values. In secondary analyses, we replicated each machine learning model including HAS-BLED as a risk factor to assess whether it contributed to the model. For the best-performing machine learning model, we also report the contribution of all risk factors using a metric appropriate for the model. For this model and the HAS-BLED score, we constructed 12-month Kaplan-Meier curves using the optimal threshold to stratify the machine learning model and a score of 3 to stratify HAS-BLED. We performed 2 sensitivity analyses to confirm the robustness of the findings. First, to assess the consequences of using time-to-event scoring in the machine learning models, we replicated all 3 approaches using classification rather than survival-time outcomes, training each model to predict 6-month and 12-month events directly. Second, to evaluate the consequences of including a heterogeneous set of conditions, we also replicated the main survival analyses using only patients with a diagnosis of atrial fibrillation. Data analysis was performed using Stata, version 16 (StataCorp LLC), SAS, version 9.3 (SAS Institute), and R, version 3.5 (R Project for Statistical Computing). Specific R packages used were survival, xgboost, randomForestSRC, ranger, glmnet, and pec.

Results

Baseline Characteristics

The study sample included 306 463 patients; of these, 12 322 (4.0%) experienced a GIB event during a median follow-up of 133 days (interquartile range, 49-320 days). The mean (SD) age of the study population was 69.0 (12.6) years; 166 177 (54.2%) of the patients were men, and 193 648 (63.2%) were White. Black patients were more likely to experience GI bleeds compared with White patients and those of other races (1552 [12.6%] vs 7659 [62.2%] vs 3111 [25.2%], and female patients were more likely to experience GI bleeds than were male patients (6583 [53.4%] vs 5739 [46.6%]) (Table 1).

Risk of GIB at 6 and 12 Months

Table 2 presents the accuracy, AUC, sensitivity, specificity, PPV, and NPV of the machine learning and HAS-BLED risk models for predicting 6- and 12-month GIB risk based on 10-fold cross-validation of the survival-based machine learning models. Consistent with the results of prior studies,[5] the HAS-BLED score achieved an AUC of 0.61 (95% CI, 0.59-0.62) for predicting 6-month GIB risk, with similar performance for 12-month GIB risk (AUC, 0.60; 95% CI, 0.59-0.61). All 3 machine learning models had similar AUCs, although the RegCox model exhibited superior performance and significantly outperformed the HAS-BLED score in predicting 6- and 12-month GIB risk. Specifically, RegCox had an AUC of 0.68 (95% CI, 0.66-0.70) and 0.67 (95% CI, 0.65-0.69) for predicting GIB at 6 and 12 months, respectively. The RegCox and XGBoost models also showed consistently good performance in the validation cohort (RegCox: 6-month AUC, 0.67; 12-month AUC, 0.66; XGBoost: 6-month AUC, 0.67; 12-month AUC, 0.66), outperforming the HAS-BLED score (6-month AUC, 0.60; 12-month AUC, 0.59). Similar performance was observed for accuracy, sensitivity, specificity, and PPV.

Table 2.

Model Evaluation and Validation

Model	Month	AUC (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	PPV (95% CI)	NPV (95% CI)	Balance accuracy (95% CI)^a	Cutoff^b
XGBoost
Development cohort	6	0.68 (0.66-0.69)	0.45 (0.03-0.97)	0.71 (0.11-0.99)	0.06 (0.03-0.11)	0.98 (0.97-0.99)	0.58 (0.51-0.64)	0.02 (0.02, 0.02)
Development cohort	12	0.67 (0.65-0.68)	0.65 (0.21-0.97)	0.52 (0.10-0.90)	0.06 (0.04-0.08)	0.98 (0.97-0.99)	0.59 (0.53-0.61)	0.03 (0.03, 0.03)
Validation cohort^c	6	0.67	0.26	0.9	0.05	0.98	0.58	0.02
Validation cohort^c	12	0.66	0.59	0.66	0.04	0.99	0.62	0.03
RSF
Development cohort	6	0.66 (0.64-0.69)	0.58 (0.42-0.72)	0.65 (0.51-0.80)	0.05 (0.04-0.06)	0.98 (0.98-0.98)	0.62 (0.61-0.64)	0.04 (0.03, 0.04)
Development cohort	12	0.65 (0.62-0.67)	0.62 (0.55-0.67)	0.60 (0.54-0.64)	0.06 (0.05-0.06)	0.98 (0.97-0.98)	0.61 (0.59-0.62)	0.06 (0.05, 0.07)
Validation cohort^c	6	0.62	0.53	0.65	0.03	0.99	0.59	0.04
Validation cohort^c	12	0.60	0.58	0.58	0.03	0.98	0.58	0.06
RegCox
Development cohort	6	0.68 (0.66-0.70)	0.67 (0.64-0.70)	0.59 (0.58-0.59)	0.04 (0.04-0.05)	0.98 (0.98-0.99)	0.63 (0.62-0.64)	0.03 (0.03, 0.03)
Development cohort	12	0.67 (0.65-0.69)	0.67 (0.65-0.70)	0.56 (0.55-0.57)	0.06 (0.05-0.06)	0.98 (0.97-0.98)	0.62 (0.61-0.63)	0.05 (0.05, 0.05)
Validation cohort^c	6	0.67	0.69	0.57	0.03	0.99	0.63	0.03
Validation cohort^c	12	0.66	0.7	0.54	0.03	0.99	0.62	0.05
HAS-BLED
Development cohort	6	0.61 (0.59-0.62)	0.54 (0.51-0.56)	0.62 (0.62-0.63)	0.03 (0.03-0.04)	0.98 (0.98-0.99)	0.58 (0.56-0.59)	3
Development cohort	12	0.60 (0.59-0.61)	0.53 (0.51-0.54)	0.62 (0.62-0.63)	0.05 (0.04-0.05)	0.98 (0.97-0.98)	0.57 (0.56-0.58)	3
Validation cohort^c	6	0.60	0.57	0.58	0.02	0.99	0.56	3
Validation cohort^c	12	0.59	0.56	0.58	0.03	0.99	0.56	3

The balanced accuracy is the arithmetic mean of the sensitivity and specificity design to better judge the performance of a classifier (compared with the simple accuracy, which is the percentage of the correctly classified metric), especially in a setting in which the classes are highly imbalanced.

The cutoff is the classification threshold that minimizes the distance between the receiver operating characteristic curve and the upper left corner of the graph or the point (0, 1).

Because there was only 1 validation data set, results for the validation cohort are expressed as point estimates without 95% CIs.

Abbreviations: AUC, area under the receiving operator characteristic curve; HAS-BLED, hypertension, abnormal kidney or liver function, stroke, history of and factors associated with presence of bleeding, labile international normalized ratio, older age (>65 years), use of drugs or alcohol concomitantly; NPV, negative predictive value; PPV, positive predictive value; RegCox, regularized Cox proportional hazards regression; RSF, random survival forests; XGBoost, extreme gradient boosting. The balanced accuracy is the arithmetic mean of the sensitivity and specificity design to better judge the performance of a classifier (compared with the simple accuracy, which is the percentage of the correctly classified metric), especially in a setting in which the classes are highly imbalanced. The cutoff is the classification threshold that minimizes the distance between the receiver operating characteristic curve and the upper left corner of the graph or the point (0, 1). Because there was only 1 validation data set, results for the validation cohort are expressed as point estimates without 95% CIs. Figure 1 shows the ROC, cumulative gain, sensitivity, and PPV curves for the machine learning and HAS-BLED models in predicting 12-month GIB risk in the development data set (with mean values determined over the 10-fold cross-validations). The ROC curves reflect higher AUCs for all 3 machine learning models compared with HAS-BLED. The gain curve shows that if the top 10% of the entire population were selected as high-risk for GIB at 12 months based on the RegCox predictions, the sample would contain approximately 25% of actual patients at high-risk for GB; in comparison, the HAS-BLED predictions would contain only 15% of actual high-risk cases. The sensitivity and PPV curves of the machine learning models further confirmed this, being substantially higher than the HAS-BLED curves across most of the classification thresholds. The secondary analyses including HAS-BLED as a factor in the machine learning models gave very similar results, with identical development AUCs at 6 months for the 3 models and slightly lower 12-month AUCs.

Figure 1.

Performance Plots for Predicting Gastrointestinal Bleeding at 12 Months in the Development Cohort

Performance Plots for Predicting Gastrointestinal Bleeding at 12 Months in the Development Cohort

The Gini score was computed by dividing the area between the gain curve and the random classifier (dotted diagonal line) by the area between the perfect classifier and the random classifier. AUC indicates area under the ROC curve; HAS-BLED, hypertension, abnormal kidney or liver function, stroke, history of and factors associated with presence of bleeding, labile international normalized ratio, older age (>65 years), use of drugs or alcohol concomitantly; RegCox, regularized Cox proportional hazards regression; RSF, random survival forests; XGBoost, extreme gradient boosting. The results of the sensitivity analyses based on the classification models were similar, with validation AUCs for the machine learning methods ranging from 0.66 to 0.69, whereas the HAS-BLED model had 6- and 12-month validation AUCs of 0.60 and 0.59, respectively (eTable 2 in the Supplement). The results using only the patients with atrial fibrillation were also similar, with the machine learning AUCs ranging from 0.60 to 0.67 and the HAS-BLED model having 6- and 12-month validation AUCs of 0.60 and 0.59, respectively (eTable 3 in the Supplement). Figure 2 shows the prediction density of the HAS-BLED and RegCox models to better compare the best- and worst-performing models. The predicted class probability density plot for the HAS-BLED score showed a significant overlap between the classes, indicating an inability to discriminate the classes. In contrast, the plots for RegCox showed less overlap between the class distributions.

Figure 2.

Survival Prediction Density by Gastrointestinal (GI) Bleed Status at 12 Months for HAS-BLED and Regularized Cox Proportional Hazards Regression (RegCox) Models

HAS-BLED indicates hypertension, abnormal kidney or liver function, stroke, history of and factors associated with presence of bleeding, labile international normalized ratio, older age (>65 years), use of drugs or alcohol concomitantly.

Survival Prediction Density by Gastrointestinal (GI) Bleed Status at 12 Months for HAS-BLED and Regularized Cox Proportional Hazards Regression (RegCox) Models

Importance Scores

Given the overall better performance of RegCox, we graphed the importance scores (β coefficients) from the 12-month model for all factors to assess face validity (Figure 3). The variables with the highest importance scores in the RegCox model were prior GI bleed (0.72); atrial fibrillation, ischemic heart disease, and venous thromboembolism combined (0.38); and use of gastroprotective agents (0.32). Results were similar for the 6-month model. The Kaplan-Meier curves for the RegCox and HAS-BLED models are shown in eFigures 2 and 3 in the Supplement.

Figure 3.

Importance Scores for Factors Included in the Regularized Cox Proportional Hazards Regression (RegCox) Machine Learning Model

The importance scores (β coefficients) from the 12-month RegCox model for all factors are shown. AC indicates anticoagulants; AFIB, atrial fibrillation; AP, antiplatelets; CRF, chronic renal failure; GIB, gastrointestinal bleeding; H2, histamine 2 blocker; IHD, ischemic heart disease; PAD, peripheral arterial disease; PCI, percutaneous coronary intervention; PPI, proton pump inhibitor; SSRI, selective serotonin reuptake inhibitor; VTE, venous thromboembolism.

Importance Scores for Factors Included in the Regularized Cox Proportional Hazards Regression (RegCox) Machine Learning Model

Discussion

In this study of more than 300 000 patients receiving treatment with antithrombotic agents, we found that machine learning techniques modestly outperformed the HAS-BLED score in discriminating patients at risk of GIB after prescription of the antithrombotic agent at both 6 and 12 months. Of the 3 machine learning approaches, the RegCox model had marginally better discrimination than the others. Of more importance, this RegCox model for predicting GIB risk at 6 and 12 months performed substantially better in this population than the current standard risk model, HAS-BLED. Using claims data from the OLDW, we were able to construct a larger cohort with longer follow-up than, to our knowledge, had been used before in model development, and the cohort included patients who received treatment with a range of antithrombotic agents, thus improving the generalizability and clinical relevance of the results. We constructed prediction models using 3 machine learning approaches. By comparing these models with each other as well as with the best model from the literature, we were able to evaluate the incremental advantage, if any, of applying machine learning techniques for risk stratification. By comparing machine learning models with the HAS-BLED model, we were also able to identify whether any of these approaches improved on the established risk model. There has been increasing interest in better understanding the impact of machine learning approaches in the context of clinical prediction models. Christodoulou et al[28] undertook a systematic review of comparisons of machine learning methods with logistic regression and found no evidence that machine learning models performed better than logistic regression; the mean difference in AUCs between machine learning and logistic regression was 0.00 (95% CI, −0.18 to 0.18). A systematic review by Shung et al[29] of machine learning approaches to predict outcomes in patients with existing GIB included a separate set of studies than those reviewed by Christodoulou et al[28] and found that artificial neural networks generally outperformed all other methods, although this conclusion was based on a single study[30] of artificial neural networks that did not use a comparator approach. A later, more limited comparison[31] of machine learning approaches for coronary heart disease found that neural nets and vector machines performed better than random forests, but the performance of both depended on the software implementation. The current study’s findings add to these systematic comparisons of machine learning methods using a large data set with a time-to-event outcome. The optimal model in this study, RegCox, performed marginally better than not only other machine learning approaches but also a version of HAS-BLED, the most commonly used existing model. Given that we modified the HAS-BLED model to use claims-based risk factors, this finding should not be interpreted as a direct comparison with the established clinical model, but the finding does suggest that machine learning approaches can improve on standard approaches if the same data are used. Furthermore, other machine learning approaches might produce better-performing prediction models. For example, both extreme-learning machine Cox proportional hazards regression models[32] and super-learning survival models[33] have been found to perform better than any of the individual methods tested in this study. We chose the 3 machine learning approaches for this study because of their appropriateness for the data and their somewhat widespread use for constructing risk prediction models. However, the finding that the 3 approaches performed better than the standard model (HAS-BLED) for predicting GIB might be further supported by the use of alternative super- or extreme-learning approaches. Of more importance, we found little difference in the performance profiles of the 3 machine learning approaches, suggesting that there may be little gain in comparing models for future risk profiling. All of the approaches used in this study, including the HAS-BLED, had low PPVs and high NPVs, indicating that all of the models are better at identifying patients who will not experience a GIB than at identifying those who will. This suggests that using any of these models for clinical decision-making will be most appropriate for identifying patients at low risk. The moderate AUCs of these models indicate that they should be considered as supplementary to other input for clinical decision-making because they all had a limited ability to discriminate. This study’s findings should be viewed primarily as informing the development of better risk models for GIB. Of greater importance is the relative performance of the RegCox model, which this study found to be the best risk model for GIB. The current study included more than 300 000 patients, making it, to our knowledge, 1 of the largest analyses of GIB risk. Each fold of the 10-fold data set included nearly 20 000 patients, more than was included overall in most prior studies of GIB. With use of this cohort, the RegCox model performed only marginally better than the modified HAS-BLED risk prediction score at predicting the 6- and 12-month risk of GIB, with AUCs at 6 and 12 months of 0.67 and 0.66, respectively, compared with AUCs of 0.60 and 0.59 for the HAS-BLED model. All of the AUCs were less than 0.70, a conventionally used threshold for acceptable performance[34]; however, these values were consistent with AUCs for predicting the same outcome using other cohorts, data, and methods as well as HAS-BLED.[35,36] A key unknown factor is whether additional data could further improve the performance of either statistical or machine learning models. For example, imaging data from endoscopies or self-reported data about patient behaviors might increase the ability to predict the risk of GIB in patients being considered for antithrombotic therapies. Existing data sets are not able to address this question but will be important in future studies of combinations of novel data and novel analytics to improve the delivery of health care. Although we were able to construct a GIB risk prediction model with improved sensitivity and specificity using machine learning methods, the choice of method was not critical to the model’s performance. The machine learning models showed improvement over the existing HAS-BLED model and could serve as the basis for a clinical tool for assessing GIB risk.

Limitations

This study has limitations. The OLDW is one of the largest and most representative claims databases available, but it does not include patients who are uninsured or, more important, who are insured by Medicare. By excluding this large cohort of older patients, this study’s findings may be less easily generalized to older patients. The OLDW includes a large number of patients who are privately insured or covered by Medicare Advantage, and it is unlikely that the exclusion of publicly insured patients biased the comparisons among machine learning approaches. Also, the inclusion of other factors such as laboratory data not only might improve the performance of all machine learning approaches but might do so differentially. This does not undermine the current findings but rather points toward future areas of research.

Conclusions

In this cross-sectional study, the machine learning models examined showed similar performance in identifying patients at high risk for GIB after being prescribed antithrombotic agents. Two models (RegCox and XGBoost) performed modestly better than the HAS-BLED score. A prospective evaluation of the RegCox model compared with HAS-BLED may provide a better understanding of the clinical impact of improved performance. Furthermore, the findings suggest that developers of risk prediction tools should consider machine learning algorithms, but 1 machine learning technique might not be clearly superior to another. Future prospective studies appear to be needed to better understand the extent of improvement in predictive performance that can also improve clinical outcomes.

22 in total

1. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration.

Authors: Jan P Vandenbroucke; Erik von Elm; Douglas G Altman; Peter C Gøtzsche; Cynthia D Mulrow; Stuart J Pocock; Charles Poole; James J Schlesselman; Matthias Egger
Journal: Ann Intern Med Date: 2007-10-16 Impact factor: 25.391

2. Gastrointestinal Safety of Direct Oral Anticoagulants: A Large Population-Based Study.

Authors: Neena S Abraham; Peter A Noseworthy; Xiaoxi Yao; Lindsey R Sangaralingham; Nilay D Shah
Journal: Gastroenterology Date: 2016-12-30 Impact factor: 22.682

3. Scores to predict major bleeding risk during oral anticoagulation therapy: a prospective validation study.

Authors: Jacques Donzé; Nicolas Rodondi; Gérard Waeber; Pierre Monney; Jacques Cornuz; Drahomir Aujesky
Journal: Am J Med Date: 2012-08-30 Impact factor: 4.965

4. Assessing Bleeding Risk in Patients Taking Anticoagulants.

Authors: Juan Qiu; Kristen Grine
Journal: Am Fam Physician Date: 2017-10-01 Impact factor: 3.292

5. Super Learner for Survival Data Prediction.

Authors: Marzieh K Golmakani; Eric C Polley
Journal: Int J Biostat Date: 2020-02-22 Impact factor: 0.968

6. Machine Learning to Predict Outcomes in Patients with Acute Gastrointestinal Bleeding: A Systematic Review.

Authors: Dennis Shung; Michael Simonov; Mark Gentry; Benjamin Au; Loren Laine
Journal: Dig Dis Sci Date: 2019-05-04 Impact factor: 3.199

7. Risk of bleeding in patients with acute myocardial infarction treated with different combinations of aspirin, clopidogrel, and vitamin K antagonists in Denmark: a retrospective analysis of nationwide registry data.

Authors: Rikke Sørensen; Morten L Hansen; Steen Z Abildstrom; Anders Hvelplund; Charlotte Andersson; Casper Jørgensen; Jan K Madsen; Peter R Hansen; Lars Køber; Christian Torp-Pedersen; Gunnar H Gislason
Journal: Lancet Date: 2009-12-12 Impact factor: 79.321

8. Extreme learning machine Cox model for high-dimensional survival analysis.

Authors: Hong Wang; Gang Li
Journal: Stat Med Date: 2019-01-10 Impact factor: 2.497

9. Time-dependent ROC curve analysis in medical research: current methods and applications.

Authors: Adina Najwa Kamarudin; Trevor Cox; Ruwanthi Kolamunnage-Dona
Journal: BMC Med Res Methodol Date: 2017-04-07 Impact factor: 4.615

10. Comparative risk of gastrointestinal bleeding with dabigatran, rivaroxaban, and warfarin: population based cohort study.

Authors: Neena S Abraham; Sonal Singh; G Caleb Alexander; Herbert Heien; Lindsey R Haas; William Crown; Nilay D Shah
Journal: BMJ Date: 2015-04-24

1 in total

1. Hybrid and Deep Learning Approach for Early Diagnosis of Lower Gastrointestinal Diseases.

Authors: Suliman Mohamed Fati; Ebrahim Mohammed Senan; Ahmad Taher Azar
Journal: Sensors (Basel) Date: 2022-05-27 Impact factor: 3.847

1 in total