Literature DB >> 31777839

Supplementing Claims Data with Electronic Medical Records to Improve Estimation and Classification of Rheumatoid Arthritis Disease Activity: A Machine Learning Approach.

Candace H Feldman1, Kazuki Yoshida1, Chang Xu1, Michelle L Frits1, Nancy A Shadick1, Michael E Weinblatt1, Sean E Connolly2, Evo Alemao2, Daniel H Solomon1.   

Abstract

OBJECTIVE: Previous attempts to estimate rheumatoid arthritis (RA) disease activity using claims data only did not yield high performance. We aimed to assess whether supplementing claims data with readily available electronic medical record (EMR) data might result in improvement.
METHODS: We used a subset of the Brigham and Women's Hospital Rheumatoid Arthritis Sequential Study (BRASS) that had linked Medicare claims. The disease activity score in 28 joints with C-reactive protein (DAS28-CRP) was considered the gold standard of measure. Variables in the linked Medicare claims, as well as EMR recorded in the preceding one-year period were used as potential explanatory variables. We constructed three models: "Claims-Only," "Claims + Medications," and "Claims + Medications + Labs (laboratory data from EMR). We selected variables via adaptive LASSO. Model performance was measured with adjusted R2 for continuous DAS28-CRP and C-statistics for binary category classification (high/moderate vs low disease activity/remission).
RESULTS: We identified 300 patients with laboratory data and linked Medicare claims. The mean age was 68 years and 80% were female. The mean (SD) DAS28-CRP was 3.6 (1.6) and 51% had high or moderate DAS28-CRP. For the continuous estimation, the adjusted R2 was 0.02 for Claims-Only, 0.09 for Claims + Medications, and 0.18 for Claims + Medications + Labs. The C-statistics for discriminating the binary categories were 0.61 for Claims-Only, 0.68 for Claims + Medications, and 0.76 for Claims + Medications + Labs.
CONCLUSION: Adding EMR-derived variables to claims-derived variables resulted in modest improvement. Even with EMR variables, we were unable to estimate continuous DAS28-CRP satisfactorily. However, in claims-EMR models, we were able to discriminate between binary categories of disease activity with reasonable accuracy.
© 2019 The Authors. ACR Open Rheumatology published by Wiley Periodicals, Inc. on behalf of American College of Rheumatology.

Entities:  

Year:  2019        PMID: 31777839      PMCID: PMC6857973          DOI: 10.1002/acr2.11068

Source DB:  PubMed          Journal:  ACR Open Rheumatol        ISSN: 2578-5745


Previous attempts to estimate rheumatoid arthritis (RA) disease activity using claims data only have been unsuccessful. We demonstrated that the use of simple electronic medical record (EMR) variables linked to claims data can moderately improve binary classification (high + moderate vs low disease activity + remission). Accurate estimation of continuous disease activity score proved to be difficult even with added EMR variables. Model‐based classification, for example, can be used to examine treatment effect modification by disease activity categories.

Introduction

The ability to estimate rheumatoid arthritis (RA) disease activity would be a powerful tool for epidemiologic studies that lack direct disease activity measures such as the Disease Activity Score 28‐joint counts (DAS28) 1. Currently, despite being recognized as an important factor when examining RA‐related outcomes, patterns of medication use, or medication‐related toxicities, disease activity is infrequently accounted for in either electronic medical record (EMR)‐based studies or population‐based administrative claims‐based studies. These studies often include populations significantly larger than those available in RA‐dedicated cohorts and therefore are sufficiently powered to detect relevant but infrequent outcomes, such as medication‐related adverse events or cardiovascular events. However, without the ability to account for RA disease activity, it is often challenging to know the degree to which these adverse outcomes are associated with the exposure or whether they are a result of increased RA disease activity. Prior researchers have demonstrated challenges to developing and validating administrative claims–based algorithms that can accurately estimate rheumatoid arthritis (RA) disease activity 2, 3, 4. Data‐driven, machine learning tools are increasingly being used to accurately identify RA patients, to phenotype distinct populations, and to develop algorithms to understand comorbidities and adverse outcomes 5, 6. To date, only one study has applied machine learning methods to attempt to estimate RA disease activity using administrative claims–based data 2. Similar to prior studies, however, the final models tested showed weak accuracy. We aimed to use data‐driven, machine learning methods to explore alternative strategies to develop algorithms to estimate DAS28 with C‐reactive protein (CRP). We combined claims data with readily available electronic medical record (EMR) variables and laboratory values to construct models to estimate DAS28‐CRP.

Participants

We utilized the Brigham and Women's Hospital Rheumatoid Arthritis Sequential Study (BRASS), a single‐center observational cohort of adults (older than 18 years) with prevalent RA cared for at Brigham and Women's Hospital, which is an urban tertiary care teaching hospital 7. Over 1500 patients with confirmed RA by the 1987 ACR criteria 8 have been followed for more than 15 years with annual measurement of disease activity with the DAS28‐CRP. Among these patients, we selected individuals with at least 1 year of linked Medicare administrative claims data preceding a disease activity measurement between 2006‐2010, the years for which we had existing linked claims. Medicare is the U.S. public insurance for individuals older than 65 years and for a subset of younger individuals with disabilities 9. A subset of BRASS patients with linked Medicare data also had medication benefits through Medicare, known as Medicare Part D, and for these individuals, pharmacy dispensing data were available.

Dependent variables

BRASS recorded DAS28‐CRP scores, a version of DAS28 with CRP as an inflammatory marker but without patient global health assessment on an annual basis 10. Each patient potentially had multiple DAS28‐CRP measurements, but we focused on the first measurement during the follow‐up to avoid correlated dependent variables within each individual, which could complicate the cross‐validation process 11. We modeled disease activity in two ways: original continuous form and dichotomized form. We dichotomized the variable as “moderate or high disease activity” (DAS28‐CRP at 3.2 or greater) and “low disease activity or possible clinical remission” (DAS28‐CRP less than 3.2). We chose these cutoffs based on the treat‐to‐target strategy for established RA patients 12, 13. We will refer to the dichotomized cutoff of 3.2 or greater vs. less than 3.2 as high/moderate vs. low disease activity, respectively, for the purposes of this study, recognizing differing perspectives regarding the definition of a DAS28‐CRP cutoff for clinical remission.

Potential explanatory variables

We derived explanatory variables from three sources: Medicare claims data, Medicare Part D pharmacy dispensing data, and EMR data. From Medicare claims, we used ICD‐9 codes to identify 26 variables, including demographics, comorbidities, joint replacement surgery, rehabilitation visits, number of RA‐related codes, laboratory and imaging use, and health care utilization (See eTable 1 for codes). For a subset of patients, Medicare Part D claims provided medication information regarding biological and conventional disease‐modifying antirheumatic drugs (DMARDs), glucocorticoids, and opioids. We used simple EMR‐derived variables collected during routine clinical practice to supplement the claims‐derived variables. We did not use variables that may not be available outside of our data sources, such as survey data collected for research purposes from the BRASS cohort. We extracted data via the Research Patient Data Registry (RPDR) 14, 15, 16, a centralized clinical data registry consisting of routinely collected data from the EMR. We obtained smoking status, body mass index (BMI), systolic blood pressure, medication use (when Medicare Part D was not available), laboratory abnormalities for RA seropositivity (rheumatoid factor or anticyclic citrullinated peptide), hematocrit, erythrocyte sedimentation rate (ESR), and CRP. When repeated measurements were available, laboratory abnormality was recorded if any one of the measurements was abnormal. Missing values were handled via the missing category method. For example, a laboratory variable was coded as either normal, abnormal, or missing. We did not pursue natural language processing of EMR free text because we aimed to develop a simple and potentially portable estimation and classification model of disease activity. Tender and swollen joint counts were not available as structured data in our system. For both claims and EMR data sources, the variable assessment period was the 1‐year period preceding the index date on which the first ever DAS28‐CRP was recorded in BRASS. This rule was applied to all variables, including relatively stable variables such as seropositivity. For medications, both ongoing therapy and new therapy were considered similarly as “ever use” within this 12‐month window.

Modeling strategy

We utilized a form of supervised machine learning, adaptive least absolute shrinkage and selection operator (LASSO). LASSO is a penalized regression that prevents model overfitting by restricting the magnitude of coefficient estimates (regularization) and performs variable selection by setting some coefficient estimates to be zero 17, 18. Adaptive LASSO 19 is an improvement upon the original LASSO, which allows a different penalty weight for each coefficient. Our modeling approach involved several steps: 1) initial coefficient estimation with ridge regression, 2) adaptive LASSO for variable selection, and 3) final modeling. First, we obtained the absolute values of the ridge regression 20 estimates of coefficients. We constructed differential penalties based on the inverse of absolute ridge coefficient estimates 19. These differential penalties ensured that more promising potential explanatory variables were penalized to a lesser extent in the subsequent steps. Second, we ran an adaptive LASSO for variable selection. The optimal value of the overall penalty term was chosen by minimizing 10‐fold cross‐validation errors. Importantly, cross‐validation results can be dependent on the specific random split of the data when the data set is small. Therefore, we repeated 10‐fold cross‐validation 10 000 times to stabilize this process and to minimize randomness 21, 22. We combined these 10 000 models by examining the number of times each variable was chosen, and we used variables selected in at least 60% of the adaptive LASSO model fits as the final set of variables, in keeping with prior literature 23. Third, the final model was fit with multiple regression for DAS28‐CRP as a continuous DAS28‐CRP, or logistic regression for the dichotomized DAS28 classification (DAS28‐CRP 3.2 or greater vs DAS28‐CRP less than 3.2). We used adjusted R 2 to compare continuous DAS28‐CRP model fits. C‐statistics were used to compare the ability of the binary DAS28‐CRP classification models to distinguish between high and low disease activity. We additionally calculated sensitivity, specificity, and correct classification rate at the threshold chosen by the Youden index 24 that aims to simultaneously maximize sensitivity and specificity. We used SAS v. 9.4 and R 3.4 [glmnet 25] for computation. For the candidate explanatory variables, we considered three increasingly larger potential variable pools to examine how simple EMR variables can improve estimation and classification based on claims variables only: 1) claims only (“Claims‐Only”), 2) claims and EMR medications (“Claims + Medications”), and 3) claims, EMR medications, and laboratory values (“Claims + Medications + Laboratory Tests” Model). For EMR variables such as the laboratory test variables, we categorized values into the normal range and the abnormal range and incorporated missing as a category.

Participants and characteristics

We identified 300 adults with RA enrolled in BRASS with 1 year or more of linked Medicare claims preceding their initial DAS28‐CRP measurement between 2006‐2010. Thirty rheumatologists cared for these 300 patients. The distribution of patient cluster sizes was median 3.5 (interquartile range 1‐11). A subset of 95 patients had Medicare Part D medication coverage. Table 1 shows the patient characteristics at the initial DAS28‐CRP measurement. The mean age was 68 years, 80% were female, and 92% were white. The mean duration of RA was 21 years, reflecting the nature of the prevalent RA cohort that we were able to link to the Medicare claims. The extent of missingness in EMR data was as follows: BMI 13%, blood pressure (BP) 28%, smoking 12%, rheumatoid factor (RF) 36%, ESR 49%, CRP 19%, and hematocrit (5%). The mean (SD) of DAS28‐CRP was 3.6 (1.6). The disease activity categories were as follows: 20% in high disease activity (more than 5.1), 31% in moderate disease activity (3.2‐5.1), 14% in low disease activity (2.6‐3.2), and 35% in clinical remission, here defined as less than 2.6.
Table 1

Patient characteristics at the first measurement of DAS28‐CRP

VariableResult
n300
Age (Mean, SD)67.94 (9.72)
Gender ‐ female (N, %)241 (80.33)
Race (N, %)
white276 (92.0)
black14 (4.8)
other10 (3.3)
DAS28‐CRP (Mean, SD)3.58 (1.62)
DAS28‐CRP Category (N, %)
Clinical Remission105 (35.0)
Low Disease Activity41 (13.7)
Moderate Disease Activity93 (31.0)
High Disease Activity61 (20.3)
DMARDs (N, %)a
092 (30.7%)
1151 (50.3%)
249 (16.3%)
38 (2.7%)
RA Disease Duration (Mean, SD)20.56 (13.28)

Abbreviation: DAS28‐CRP, Disease Activity Score in 28 joints with C‐reactive protein; DMARD, disease‐modifying antirheumatic drug. aNumber of unique DMARDs prescribed during the 1‐year variable ascertainment period.

Patient characteristics at the first measurement of DAS28‐CRP Abbreviation: DAS28‐CRP, Disease Activity Score in 28 joints with C‐reactive protein; DMARD, disease‐modifying antirheumatic drug. aNumber of unique DMARDs prescribed during the 1‐year variable ascertainment period.

Continuous DAS28‐CRP estimation

Claims‐only data resulted in a highly parsimonious final model with just two binary variables (Table 2). As a result, the proportion of continuous DAS28‐CRP explained (R 2) was very poor at 0.03 (adjusted R 2 = 0.02). Models derived by an automated variable selection process may not be clinically interpretable. However, in this specific instance, the presence of laboratory tests (ever/never) for viral hepatitis and for CRP remained in the final model. CRP testing may be a surrogate for the need to assess inflammation formally, which is likely due to high disease activity. Viral hepatitis testing may herald the need to switch medications, particularly to biological DMARDs in the setting of higher disease activity or inadequate disease control.
Table 2

Variable selection and final fit results for continuous disease activity estimation

Claims OnlyClaims + MedsClaims + Meds + Labs
Data sourceVariableSelectionCoefficientSelectionCoefficientSelectionCoefficient
(Intercept)10 0000.65110 0001.02910 0000.855
ClaimsNumber of outpatient visit 000
Number of ED visits 002
Length of hospitalization 17402356881
Number of hospitalizations 000
Number of chest X‐ray 5415891
Arthrocentesis, yes/no 5617260
ANA testing, yes/no 005
BMD testing, yes/no 000
CBC testing, yes/no 5417305
Anti‐CCP testing, yes/no 204206
Metabolic panel, yes/no 5417385
HBV/HCV screening, yes/no 1808420499391.655
Chest CT/MRI, yes/no 000
Liver enzymes, yes/no 5621045872
Tuberculosis tests, yes/no 5717432693
Age at DAS28‐CRP 000
CRP, yes/no9999−0.94810 000−0.94410 000−1.298
RF, yes/no 139821764877
ESR, yes/no 172322705031
Total Number of ESR/CRP 000
Race, Black4652184104
Race, Non‐white/black618361899
Sex000
Charlson Comorbidity Index 000
Joint surgeries, yes/no 99991.68399991.56710 0001.893
Occupational therapy, yes/no 000
Physical therapy, yes/no 622042787
Total Number of RA Codes000
Part D/ EMRTotal number of DMARD use54429677−0.444
ever use of DMARD9994−0.7559959−0.117
ever use of glucocorticoids30
ever use of opioids99370.63668600.340
EMRBMI ≥300
25≤ BMI <300
BMI <18.510 00013.121
BMI Missing99750.971
Systolic BP ≥1600
Systolic BP 120‐1590
Systolic BP missing10 000−1.296
Smoking, current1
Smoking, past302
Smoking, missing5439
RF abnormal5547
RF missing1380
ESR abnormal82710.341
ESR Missing13
CRP abnormal77560.588
CRP missing5
Hematocrit abnormal70360.340
Hematocrit missing10 000−0.782

Abbreviation: ANA, antinuclear antibody; BMD, bone mineral density; BMI, body mass index; BP, blood pressure; CBC, complete blood count; CCP, cyclic citrullinated peptide; CRP, C‐reactive protein; CT, computed tomography; DMARD, disease‐modifying antirheumatic drugs; ED, emergency department; EMR, electronic medical record; ESR, erythrocyte sedimentation rate; HBV, hepatitis B virus; HCV, hepatitis C virus; Labs, laboratory test results; Meds, medications; MRI, magnetic resonance imaging; Part D, Medicare Part D prescription claims; RA, rheumatoid arthritis; RF, rheumatoid factor.

All variables, including medications and laboratory results, were as recorded within the 12‐month period preceding the DAS28‐CRP measurement.

Variable selection and final fit results for continuous disease activity estimation Abbreviation: ANA, antinuclear antibody; BMD, bone mineral density; BMI, body mass index; BP, blood pressure; CBC, complete blood count; CCP, cyclic citrullinated peptide; CRP, C‐reactive protein; CT, computed tomography; DMARD, disease‐modifying antirheumatic drugs; ED, emergency department; EMR, electronic medical record; ESR, erythrocyte sedimentation rate; HBV, hepatitis B virus; HCV, hepatitis C virus; Labs, laboratory test results; Meds, medications; MRI, magnetic resonance imaging; Part D, Medicare Part D prescription claims; RA, rheumatoid arthritis; RF, rheumatoid factor. All variables, including medications and laboratory results, were as recorded within the 12‐month period preceding the DAS28‐CRP measurement. Adding four medication‐related variables in the initial candidate variable pool (Claims + Medications) resulted in a much larger final model with 12 variables. The estimation performance was much better, although it still explained a relatively small fraction of continuous DAS28‐CRP variability (R 2 = 0.12, adjusted R 2 = 0.09). Viral hepatitis and CRP testing remained in the final model again. Tuberculosis and liver enzyme testing, which may also precede biological DMARDs, were in the final model. Glucocorticoid use and opioid use made it to the final model, but not DMARD use. DMARD use might have been of little value because most patients in this tertiary care center RA cohort were on DMARDs. Including further EMR variables (Claims + Medications + Laboratory Tests) resulted in a final model with 23 variables. The estimation performance improved further (R 2 = 0.25, adjusted R 2 = 0.18). Laboratory variables (CRP, ESR, RF, and hematocrit) exceeded the model inclusion threshold of 60%. For RF, CRP, and hematocrit, both the abnormal value indicator and the missing indicator remained in the model, meaning whether a measurement was made at all was also informative of the underlying disease activity in addition to the presence of an abnormal measurement.

Binary category classification

Similar to the continuous DAS28‐CRP estimation, the binary Claims‐only model resulted in a final model with just two variables: the presence of CRP testing and joint surgery (Table 3). The area under the curve (AUC) of the model was 0.61 (Figure 1). At the optimal threshold that maximizes sensitivity and specificity jointly, sensitivity was 47.4% and specificity was 74.7%. Presence of joint surgery may be understood as indicative of more active disease with severe damage.
Table 3

Variable selection and final fit results for binary disease activity classification

Claims‐OnlyClaims+ MedsClaims + Meds + Labs
Data sourceVariableSelectionCoefficientSelectionCoefficientSelectionCoefficient
(Intercept)10 0003.93710 0003.73710 0002.434
ClaimsNumber of outpatient visit 000
Number of ED visits 511900
Length of hospitalization 035200
Number of hospitalizations 000
Number of chest X‐ray 11020
Arthrocentesis, yes/no 515038580
ANA testing, yes/no 6367453
BMD testing, yes/no 000
CBC testing, yes/no 118200
Anti‐CCP testing, yes/no 489340959755−0.229
Metabolic panel, yes/no 51747205−0.5418709−0.460
HBV/HCV screening, yes/no 99980.55799990.21010 0000.571
Chest CT/MRI, yes/no 000
Liver enzymes, yes/no 513973051.30695881.108
Tuberculosis tests, yes/no 514271830.0019540−0.184
Age at DAS28‐CRP 000
CRP, yes/no9997−0.5609999−0.80110 000−0.720
RF, yes/no 600
ESR, yes/no 51927274−0.2289583−0.173
Total Number of ESR/CRP 000
Race, Black564085450.52194280.307
Race, nonwhite/black79210
Sex5034555289610.330
Charlson Comorbidity Index 000
Joint surgeries, yes/no 570372290.87997990.729
Occupational therapy, yes/no 000
Physical therapy, yes/no 51477220−0.2859438−0.217
Total Number of RA Codes000
PartD/ EMRTotal number of DMARD use7189−0.4309501−0.422
ever use of DMARD150
ever use of glucocorticoids99990.23210 0000.216
ever use of opioids99960.37497720.144
EMRBMI ≥3000
25≤ BMI <307189−0.4300
BMI <18.5150
BMI Missing99990.23295620.450
Systolic BP ≥16099960.3740
Systolic BP 120‐1591774
Systolic BP missing10 000−0.773
Smoking, current0
Smoking, past0
Smoking, missing22
RF abnormal10 0000.579
RF missing79480.393
ESR abnormal10 0000.442
ESR Missing0
CRP abnormal96240.839
CRP missing95860.691
Hematocrit abnormal74800.167
Hematocrit missing8793−0.216

Abbreviation: ANA, antinuclear antibody; BMD, bone mineral density; BMI, body mass index; BP, blood pressure; CBC, complete blood count; CCP, cyclic citrullinated peptide; CRP, C‐reactive protein; CT, computed tomography; DMARD, disease‐modifying antirheumatic drugs; ED, emergency department; EMR, electronic medical record; ESR, erythrocyte sedimentation rate; HBV, hepatitis B virus; HCV, hepatitis C virus; Labs, laboratory test results; Meds, medications; MRI, magnetic resonance imaging; Part D, Medicare Part D prescription claims; RA, rheumatoid arthritis; RF, rheumatoid factor.

All variables including medications and laboratory results were as recorded within the 12‐month period preceding the DAS28‐CRP measurement.

Figure 1

Performance of binary classification models. Cut‐off values indicate the thresholds used to dichotomize estimated probabilities of MDA/HDA into binary classifications (MDA/HDA for ≥ cut‐off and REM/LDA for < cut‐off). Abbreviation: CCR, correct classification rate; Class, model‐based classification; HDA, high disease activity; Labs, laboratory test results; LDA, low disease activity; MDA, medium disease activity; Meds, medications; NPV, negative predictive value; PPV, positive predictive value; REM, clinical remission; True, gold standard.

Variable selection and final fit results for binary disease activity classification Abbreviation: ANA, antinuclear antibody; BMD, bone mineral density; BMI, body mass index; BP, blood pressure; CBC, complete blood count; CCP, cyclic citrullinated peptide; CRP, C‐reactive protein; CT, computed tomography; DMARD, disease‐modifying antirheumatic drugs; ED, emergency department; EMR, electronic medical record; ESR, erythrocyte sedimentation rate; HBV, hepatitis B virus; HCV, hepatitis C virus; Labs, laboratory test results; Meds, medications; MRI, magnetic resonance imaging; Part D, Medicare Part D prescription claims; RA, rheumatoid arthritis; RF, rheumatoid factor. All variables including medications and laboratory results were as recorded within the 12‐month period preceding the DAS28‐CRP measurement. Performance of binary classification models. Cut‐off values indicate the thresholds used to dichotomize estimated probabilities of MDA/HDA into binary classifications (MDA/HDA for ≥ cut‐off and REM/LDA for < cut‐off). Abbreviation: CCR, correct classification rate; Class, model‐based classification; HDA, high disease activity; Labs, laboratory test results; LDA, low disease activity; MDA, medium disease activity; Meds, medications; NPV, negative predictive value; PPV, positive predictive value; REM, clinical remission; True, gold standard. The inclusion of medication‐related variables (Claims + Medications) retained the initial two variables, and ever use of DMARDs and ever use of opioids remained in the model. In comparison to the corresponding continuous model, which retained 12 variables, the binary version only retained only 4 variables. The AUC improved slightly to 0.68 with a sensitivity of 79.2% and specificity of 48.6% at the optimal threshold. Ever use of glucocorticoids was not retained in the final model. The largest set of candidate variables (Claims + Medications + Laboratory Tests) resulted in a final model with 13 variables, again somewhat smaller than the continuous counterpart with 23 variables. All four variables in the previous model remained. Additionally, testing for viral hepatitis and the total number of DMARDs during the baseline period were included in the final model. From the extended pool of EMR variables, several variables were retained. For certain variables such as BP, only the missingness indicator remained, suggesting that the absence/presence status of having BP recorded was informative enough and actual recordings, when present, did not add much. For ESR and CRP values, respective abnormal value indicators were kept in the final model, grouping normal values and missing category together. The AUC was 0.76 with a sensitivity of 83.1% and specificity of 59.6%.

Discussion

A model that uses readily available data to estimate RA disease activity would be a valuable addition to epidemiologic population‐based studies that lack direct disease activity measures. Prior studies have demonstrated significant challenges in developing and validating these algorithms 2, 4. We attempted to build on the prior literature by adding EMR variables that should be available from routine practice and are readily extractable from medical records to claims‐based data. We also leveraged novel machine learning strategies to allow the data to drive the choice of variables. In our models, we found that adding EMR‐based information modestly improved model performance metrics. However, we were still unable to estimate disease activity in a meaningful way as a continuous measure. Our model that incorporated EMR data, medication and laboratory data, and claims to classify disease activity as a dichotomized measure did result in a reasonable C‐statistic (0.76), indicating the ability to distinguish between moderate/high vs. low disease activity with adequate certainty. These results indicate that addition of simple EMR‐derived variables to claims‐derived variables could be useful for improving classification of RA disease activity into high and moderate disease activity vs. low disease activity (correct classification rate = 71.3%) but not for accurately estimating its actual numerical values. Importantly, the continuous estimates or binary classification of RA disease activity measure will not add to confounding control if all the covariates are already included in the outcome analysis or propensity score model. However, it may serve as a summary risk score, which can be easier to handle than individual covariates in settings with a limited sample size. A potentially more useful use case of the binary classification is to use it as a stratification variable, for which a correct classification rate of 71.3% may still provide some value. When we are interested in how the effect of a given exposure on the outcome of interest differs by the baseline RA disease activity, stratifying the study cohort by the binary RA disease activity classification may add value beyond what individual covariates can achieve. We acknowledge several limitations in data and modeling. Ideally, the final model performance should be assessed in a data set completely independent from the entire model building and validation process. However, we did not have a test set because of the small sample size. Although adaptive LASSO is a flexible variable selection strategy, it does not attempt to explore more complex relationships between variables and DAS28‐CRP. More advanced supervised machine learning methods, such as deep neural networks 26 and random forest 27, can automatically identify interaction between variables at the cost of being less interpretable and more data‐hungry. Although certain variables included in our final combined model logically correlate with RA disease activity (such as joint surgeries, inflammatory marker elevations, anemia, number and ever use of DMARDs and opioids), our models also incorporated missing EMR data as explanatory variables. For example, presence or absence of EMR recording of systolic BP was deemed more informative than the recorded value itself. Although this is reasonable from a modeling perspective, model interpretability and portability may suffer. The next important step will be determining the degree to which this algorithm may perform in an external cohort with combined EMR‐claims data. In such external validation, variables that were important in the BRASS cohort may carry different importance. For example, the tuberculosis screening– and hepatitis screening–related variables contributed to our models. These may have heralded an impending treatment switch or intensification in our local practice and were informative of disease activity. However, this may not generalize to other practice settings. The C‐statistic for our Claims + Medications model was only slightly inferior compared to the Claims + Medications + Laboratory Tests model and includes variables that are logically associated with disease activity and are readily accessible in claims (inflammatory markers drawn in the prior year, joint replacement surgery, DMARD use, and opioid use). This may be a reasonable option when the goal is to distinguish moderate/high from low. In summary, we attempted to improve DAS28‐CRP estimation and classification based on claims data by utilizing easily accessible information in the EMR, resulting in a modest improvement for the binary category classification. Numerical estimation of DAS28‐CRP was unsatisfactory. External validation of our binary models in a different EMR system is an important future direction. Nonetheless, we believe the present study serves as proof of concept that we can improve our ability to classify RA disease activity in its binary form by supplementing claims data with simple EMR‐derived variables.

Author contributions

Drs. Feldman and Yoshida had full access to all of the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design

Feldman, Yoshida, Shadick, Weinblatt, Connolly, Alemao, Solomon.

Acquisition of data

Frits, Shadick, Weinblatt.

Analysis and interpretation of data

Feldman, Yoshida, Xu, Alemao, Solomon. Click here for additional data file.
  17 in total

1.  Optimizing healthcare research data warehouse design through past COSTAR query analysis.

Authors:  S N Murphy; M M Morgan; G O Barnett; H C Chueh
Journal:  Proc AMIA Symp       Date:  1999

2.  Index for rating diagnostic tests.

Authors:  W J YOUDEN
Journal:  Cancer       Date:  1950-01       Impact factor: 6.860

3.  The American Rheumatism Association 1987 revised criteria for the classification of rheumatoid arthritis.

Authors:  F C Arnett; S M Edworthy; D A Bloch; D J McShane; J F Fries; N S Cooper; L A Healey; S R Kaplan; M H Liang; H S Luthra
Journal:  Arthritis Rheum       Date:  1988-03

4.  Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors:  Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal:  J Stat Softw       Date:  2010       Impact factor: 6.440

5.  Cross-validation for nonlinear mixed effects models.

Authors:  Emily Colby; Eric Bair
Journal:  J Pharmacokinet Pharmacodyn       Date:  2013-03-27       Impact factor: 2.745

6.  Overview of the Medicare and Medicaid Programs.

Authors:  Earl Dirk Hoffman; Barbara S Klees; Catherine A Curtis
Journal:  Health Care Financ Rev       Date:  2000

7.  An external validation study reporting poor correlation between the claims-based index for rheumatoid arthritis severity and the disease activity score.

Authors:  Rishi J Desai; Daniel H Solomon; Michael E Weinblatt; Nancy Shadick; Seoyoung C Kim
Journal:  Arthritis Res Ther       Date:  2015-04-13       Impact factor: 5.156

Review 8.  Treating rheumatoid arthritis to target: 2014 update of the recommendations of an international task force.

Authors:  Josef S Smolen; Ferdinand C Breedveld; Gerd R Burmester; Vivian Bykerk; Maxime Dougados; Paul Emery; Tore K Kvien; M Victoria Navarro-Compán; Susan Oliver; Monika Schoels; Marieke Scholte-Voshaar; Tanja Stamm; Michaela Stoffer; Tsutomu Takeuchi; Daniel Aletaha; Jose Louis Andreu; Martin Aringer; Martin Bergman; Neil Betteridge; Hans Bijlsma; Harald Burkhardt; Mario Cardiel; Bernard Combe; Patrick Durez; Joao Eurico Fonseca; Alan Gibofsky; Juan J Gomez-Reino; Winfried Graninger; Pekka Hannonen; Boulos Haraoui; Marios Kouloumas; Robert Landewe; Emilio Martin-Mola; Peter Nash; Mikkel Ostergaard; Andrew Östör; Pam Richards; Tuulikki Sokka-Isler; Carter Thorne; Athanasios G Tzioufas; Ronald van Vollenhoven; Martinus de Wit; Desirée van der Heijde
Journal:  Ann Rheum Dis       Date:  2015-05-12       Impact factor: 19.103

9.  Development of a health care utilisation data-based index for rheumatoid arthritis severity: a preliminary study.

Authors:  Gladys Ting; Sebastian Schneeweiss; Richard Scranton; Jeffrey N Katz; Michael E Weinblatt; Melissa Young; Jerry Avorn; Daniel H Solomon
Journal:  Arthritis Res Ther       Date:  2008-08-21       Impact factor: 5.156

10.  Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts.

Authors:  Katherine P Liao; Ashwin N Ananthakrishnan; Vishesh Kumar; Zongqi Xia; Andrew Cagan; Vivian S Gainer; Sergey Goryachev; Pei Chen; Guergana K Savova; Denis Agniel; Susanne Churchill; Jaeyoung Lee; Shawn N Murphy; Robert M Plenge; Peter Szolovits; Isaac Kohane; Stanley Y Shaw; Elizabeth W Karlson; Tianxi Cai
Journal:  PLoS One       Date:  2015-08-24       Impact factor: 3.240

View more
  4 in total

Review 1.  Machine Learning in Rheumatic Diseases.

Authors:  Mengdi Jiang; Yueting Li; Chendan Jiang; Lidan Zhao; Xuan Zhang; Peter E Lipsky
Journal:  Clin Rev Allergy Immunol       Date:  2021-02       Impact factor: 8.667

Review 2.  An introduction to machine learning and analysis of its use in rheumatic diseases.

Authors:  Kathryn M Kingsmore; Christopher E Puglisi; Amrie C Grammer; Peter E Lipsky
Journal:  Nat Rev Rheumatol       Date:  2021-11-02       Impact factor: 20.543

Review 3.  Artificial Intelligence in Rheumatoid Arthritis: Current Status and Future Perspectives: A State-of-the-Art Review.

Authors:  Sara Momtazmanesh; Ali Nowroozi; Nima Rezaei
Journal:  Rheumatol Ther       Date:  2022-07-18

4.  An Efficient CNN for Hand X-Ray Classification of Rheumatoid Arthritis.

Authors:  Gitanjali S Mate; Abdul K Kureshi; Bhupesh Kumar Singh
Journal:  J Healthc Eng       Date:  2021-06-14       Impact factor: 2.682

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.