Literature DB >> 34007897

Development of a prognostic prediction model to estimate the risk of multiple chronic diseases: constructing a copula-based model using Canadian primary care electronic medical record data.

Jason E Black¹, Jacqueline K Kueper², Amanda L Terry³, Daniel J Lizotte².

Abstract

INTRODUCTION: The ability to estimate risk of multimorbidity will provide valuable information to patients and primary care practitioners in their preventative efforts. Current methods for prognostic prediction modelling are insufficient for the estimation of risk for multiple outcomes, as they do not properly capture the dependence that exists between outcomes.
OBJECTIVES: We developed a multivariate prognostic prediction model for the 5-year risk of diabetes, hypertension, and osteoarthritis that quantifies and accounts for the dependence between each disease using a copula-based model.
METHODS: We used data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) from 2009 onwards, a collection of electronic medical records submitted by participating primary care practitioners across Canada. We identified patients 18 years and older without all three outcome diseases and observed any incident diabetes, osteoarthritis, or hypertension within 5-years, resulting in a large retrospective cohort for model development and internal validation (n=425,228). First, we quantified the dependence between outcomes using unadjusted and adjusted Ø coefficients. We then estimated a copula-based model to quantify the non-linear dependence between outcomes that can be used to derive risk estimates for each outcome, accounting for the observed dependence. Copula-based models are defined by univariate models for each outcome and a dependence function, specified by the parameter θ. Logistic regression was used for the univariate models and the Frank copula was selected as the dependence function.
RESULTS: All outcome pairs demonstrated statistically significant dependence that was reduced after adjusting for covariates. The copula-based model yielded statistically significant θ parameters in agreement with the adjusted and unadjusted Ø coefficients. Our copula-based model can effectively be used to estimate trivariate probabilities. DISCUSSION: Quantitative estimates of multimorbidity risk inform discussions between patients and their primary care practitioners around prevention in an effort to reduce the incidence of multimorbidity.

Entities: Chemical

Keywords: CPCSSN; copula; diabetes; electronic medical records; hypertension; multimorbidity; multivariate; osteoarthritis; primary care; prognostic prediction model; risk estimation

Mesh：

Year: 2021 PMID： 34007897 PMCID： PMC8112224 DOI： 10.23889/ijpds.v5i1.1395

Source DB: PubMed Journal: Int J Popul Data Sci ISSN： 2399-4908

Introduction

Harnessing observational health data to improve patient care, such as through decision support tools embedded into electronic medical records (EMRs), is a topic of great interest [1, 2]. Prognostic prediction models can provide decision support through quantitative estimates of disease risk based on a patient’s individual predictors (e.g., age, sex, physical activity level) [3-5]. Understanding a patient’s risk of disease empowers prevention efforts, a hallmark of population health, by guiding decision-making processes and identifying patients at increased risk [6]. Research related to decision support at the point of care requires both methodological and clinical considerations. Methodological considerations span from data source selection and pre-processing to model development and evaluation; clinical considerations include identifying what disease(s) or aspects of clinical care could benefit from decision support and the types of information or tools that will accomplish this. There is a gap between one of the most prominent clinical challenges faced by primary care practitioners and their patients and the development of prognostic prediction models thus far. Multimorbidity, where a patient has two or more chronic diseases, is increasing in prevalence and presents several challenges in terms of identification and treatment [7, 8]. The ability to estimate a patient’s risk of multimorbidity is needed [7, 9]. Research into multimorbidity has predominantly focused on establishing patterns or clusters of multimorbidity or establishing risk factors by investigating the associations between multimorbidity and potential risk factors [10, 11]. While related to multimorbidity risk, the latter does not allow for risk estimation. Recently, there has been a focus on developing strategies to prevent multimorbidity as health policy makers and health care practitioners recognize its importance [12, 13]. There are many existing prognostic prediction models for individual diseases but few for multimorbidity [14, 15]. Using a series of single-disease models in a clinical setting to estimate risk of multiple diseases is not only burdensome but also may give inaccurate perceptions of risk. Methodological complexity may be a barrier to developing tools for multimorbidity risk prediction; standard off-the-shelf packages for developing prediction models are not expected to perform correctly. Prognostic prediction models are commonly developed to estimate the risk of a single disease. To estimate the risk of multimorbidity, one might combine the risks of multiple single disease models. For example, if one were interested in estimating a patient’s risk of diabetes and hypertension co-occurring, they might multiply the patient’s risk of diabetes by their risk of hypertension, giving the risk of both diseases occurring. However, this method assumes independence between the incidence of diseases, which rarely occurs. Instead, this dependence must be accounted for when estimating the risk of multiple diseases. We hypothesize that a lack of clear methodology for how to account for dependence between disease incidence is a barrier to the development of prognostic prediction models for multimorbidity and targeting this methodological gap is a necessary first step towards proper estimation of multimorbidity risk. The objective of this study is to present a methodology for prognostic prediction models that accounts for dependence between disease incidence. This is achieved in the context of Canadian primary health care, whereby we developed a prognostic prediction model that estimates the 5-year risk of diabetes, hypertension, and osteoarthritis. These diseases were selected as a case study based on their prevalence, availability of validated case-detecting algorithms [16], and clinical importance. To accomplish this, we first developed univariate multivariable models for each disease. We then explored the dependence between disease incidence, which led to the development of a model capable of predicting each disease and their co-occurrence while accounting for the dependence between disease incidence.

Methods

Data source

Primary care is typically the first contact for patients within the Canadian healthcare system. A patient is managed in primary care by their primary care practitioner or referred to secondary or tertiary care, depending on the level of care required [17]. Primary care is an ideal setting for the deployment of interventions aimed at reducing multimorbidity risk given the broad population it serves who typically are in earlier stages of disease compared to patients of secondary or tertiary care. All data used were derived from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) database [18]: a database containing patient information from the EMRs of primary care practices across Canada starting in 2008 [19]. Nearly 1,200 primary care practitioners voluntarily contribute deidentified records of more than 1.5 million patients. Patients provide consent via an opt-out system, where patients who do not wish to contribute their data may choose to opt-out, except in Quebec, where an opt-in process is mandated by provincial law. In 2013, CPCSSN patients were older and more likely to be female compared to the overall Canadian population as reported in census data [20], which is typical of primary care [21-23]. All structured data from the EMR are available in CPCSSN, including patient demographics, diagnoses, laboratory results, prescriptions, referrals, risk factors, medical procedures, vaccinations, and allergies. For privacy reasons, the free-text narrative where primary care practitioners record their notes is not available in CPCSSN.

Measures

Outcome

CPCSSN researchers developed and validated case-detecting algorithms for several chronic diseases to identify cases of disease within the database [16]. These case detecting algorithms were developed using published evidence and input from primary care and specialist physicians and validated by a comprehensive chart review. Validation demonstrated high sensitivity and specificity. We used CPCSSN case-detecting algorithms to identify cases of diabetes, osteoarthritis, and hypertension. Sensitivity and specificity for these case-detecting algorithms were high; see Appendix Table 1. Our use of validated disease case-detecting algorithms helps ensure that the identification of outcomes is accurate. Inaccurate outcome identification (a form of measurement error) will decrease the accuracy of risk estimates due to biased relationships between the predictors and true disease development. This poor performance would not be revealed by internal validation as the data used for validation would be subject to the same issue of inaccuracy in outcome identification as the data used to construct the model. Often only internal validation is feasible, reinforcing the importance of using a validated case-detecting algorithm for the identification of outcomes. Blinding of predictor information during outcome assessment was not possible. Predictor assessment and subsequent outcome assessment were both conducted by the primary care practitioner; thus, it is likely that primary care practitioners had some knowledge of the patient’s predictors while assessing the outcomes, which may have introduced measurement bias. However, each outcome has clearly defined diagnostic criteria; thus, the impact of this is likely minimal.

Predictors

We identified predictors for each outcome through review of relevant literature. These predictors are presented in Appendix Table 2. We then attempted to identify predictors in the CPCSSN database. We identified 5 predictors of osteoarthritis, 8 of diabetes, and 6 of hypertension; see Table 1. Where possible, we used CPCSSN validated case-detecting algorithms. Otherwise, we developed an algorithm to identify each predictor using CPCSSN data: some combination of diagnostic terms and codes; medications used for specific indications; and laboratory results. These algorithms were reviewed by a primary care practitioner to ensure accuracy. See Supplemental Appendix 1 for predictor case-detecting algorithms.

Table 1: Predictors available in CPCSSN database

Osteoarthritis	Diabetes	Hypertension
Osteoporosis	Hypertension	Older age
Previous leg injury	Older age	Diabetes
Older age	Lipid disorders	Obesity
Obesity	Obesity	Kidney disease
Female sex	Male sex	Tricyclic antidepressant
	Schizophrenia	(TCA) use
	Depression
	Low socioeconomic status

We estimated each patient’s income by linking their Forward Sortation Area (FSA) to area-level income data collected by the National Household Survey conducted in 2011 [24]. Rurality was assessed based on the second digit of the FSA. As suggested in TRIPOD [25], we included all continuous risk factors in their original form. We did not transform or categorize continuous variables.

Participants

We included all patients aged 18 or older who did not have diabetes, osteoarthritis, and hypertension at baseline (i.e., we excluded patients with all 3 outcomes) and had some interaction with their primary care practitioner in 2009 or 2010 (i.e., an interaction that resulted in a billing occurrence, encounter recording or diagnosis, exam, or health condition diagnosis in the EMR). For each patient, we considered the first interaction with their primary care practitioner between 1 January 2009 and 31 December 2010 the patient’s unique start-date. We assessed the patient’s predictors at this point (including diabetes, hypertension, and osteoarthritis as one may predict another). We then noted any diagnosis of diabetes, osteoarthritis, or hypertension over the following 5 years. We included all eligible patients to maximize predictive performance.

Missing data

EMR are collected for clinical purposes, not specifically for research use. Data are often missing from the EMR because they are not relevant for patient care, despite being highly relevant for research. Multiple imputation was used to address missing data, which produced 5 multiple completed datasets. While a single point estimate will be presented for each statistic, in actuality, several were computed (one for each imputed dataset); these results were then combined using Rubin’s rules [26] to create a single statistic whose variance has been adjusted to account for the uncertainty of deriving an estimate from multiple datasets.

Statistical analysis

To construct a prognostic prediction model for diabetes, hypertension, and osteoarthritis, we analyzed the dependence between these diseases. We selected copulas [27, 28] to model the dependence between outcomes because they account for more than two diseases, adjust for both continuous and discrete variables, and can be used to construct a prognostic prediction model. First, we constructed univariate models for each outcome then we used a copula to describe the dependence between outcomes.

Univariate multivariable logistic regression

We constructed univariate multivariable logistic regression models for each outcome. We included patients without the outcome at baseline when estimating the univariate model. For example, we used a subgroup of patients who did not have diabetes at baseline to construct the diabetes univariate model. We internally validated each univariate model by measuring its discrimination and calibration. We assessed discrimination (the ability to assign higher risk to true positive cases) by determining the area under the receiver operator characteristic curve (AUC). We assessed calibration (how well the model fits the data) by examining calibration plots. To investigate the potential impact of censoring (e.g., a patient changing providers), we conducted a sensitivity analysis where we required that each patient have at least one interaction with their primary care practitioner after the end of their follow-up period. We compared parameter estimates from this restricted cohort to those of the overall cohort.

Analysis of dependence

We explored the dependence between outcomes in a pairwise fashion. For each pairwise analysis, we included patients who did not have either outcome at baseline. For example, in the analysis of diabetes and hypertension, we included patients who did not have diabetes or hypertension. We estimated the unadjusted pairwise correlation between outcomes using the Ø coefficient (also known as the mean square contingency coefficient). The Ø coefficient is a measure of association between two binary variables, analogous to the Pearson correlation coefficient for continuous variables [29]. In fact, estimating a Pearson correlation coefficient for two binary variables gives the Ø coefficient [29]. We then estimated the adjusted pairwise correlation (also known as partial correlation) between outcomes using the Ø coefficient adjusted for the predictors of both outcomes. To enable predictions that account for the dependence between outcomes, we estimated a copula-based model that captures the dependence between each outcome pair. Copula models are able to capture dependence among variables without imposing any requirements on marginal distributions of the variables; for example, the marginal distributions do not need to be Gaussian. Many parametric copula forms exist that are characterized by the structure of the dependence they can best describe. We selected the Frank copula [30] based on its ability to describe weak dependence based on the weak correlations we observed between outcome pairs. When modelling the dependence between binary variables, the copula is defined by both the parameter θ and the marginal distributions [27]. As such, we used the two-stage estimation procedure based on the composite likelihood suggested by Zhao and Joe [31] for the estimation of θ. First, we determined the marginal models using the maximum likelihood estimation procedure, yielding β estimates that we used in the second step. From these univariate models, we estimated the probabilities for the independent occurrence of each outcome (), by: where β is a vector containing the β estimates for each outcome j and x is a matrix of covariate data. Second, we obtained estimates of θ, again using the maximum likelihood estimation procedure. This process made use of the bivariate conditional distributions of each outcome pair. From these, the likelihood function was constructed. By setting the derivative of the log likelihood function (known as the score function, s) equal to zero, we estimated θ. where C is the copula function; is the derivative of the copula function; and are estimated probabilities of disease k and l for patient i based on their univariate models, respectively; and Y and Y are the observed disease outcomes for patient i. A dependence structure using copulas is completely specified by its univariate models and copula, which is specified by its θ estimate. For each disease pair, we estimated the parameter θ and bootstrapped confidence intervals using the percentile method [32] and 1,000 replicates. Additionally, we tested the null hypothesis that the observed outcome frequencies are no different than what would be expected under independence [27] using the following hypothesis test based on the score test. We rejected the null hypothesis if z is larger in absolute value than a critical value derived from the standard Normal distribution, denoted N(0, 1). Based on these copula models, trivariate probabilities that account for the dependence between outcomes can be estimated; that is, the probabilities of each combination of diseases will be estimated. Each trivariate probability can be described as a probability mass function. Bivariate probability mass functions can be used to describe the marginal distributions of the trivariate probability mass functions. Similar expressions are true for and . Based on , , and as estimated by the copula model, trivariate probability mass functions () can be found such that their bivariate distributions match the specified bivariate marginal distri-butions. In fact, there may be many trivariate probability mass functions whose bivariate marginals match the specified bivariate distributions. We chose the trivariate with the highest entropy (highest uncertainty), as this gives the most conservative estimate in terms of the model’s predictions. To find the trivariate distribution with maximum entropy, we first note that it is possible to define the space of all trivariate probability mass functions that satisfy the bivariate constraints using a single parameter, (see Supplementary Appendix 2). Therefore, to find the distribution with maximum entropy, we first determined the permitted bounds of , such that all estimated probabilities fall in the range 0 to 1, and we then defined the entropy of a potential solution distribution as a function of [33]. By searching over possible , we found the distribution that maximizes entropy. Given the resulting trivariate distribution, all joint probabilities of disease incidence can be estimated. I.e., the risk of developing any combination of diabetes, hypertension, and osteoarthritis all within a 5-year window can be estimated. We have included code to estimate the copula-based model, perform hypothesis testing, and estimate trivariate probabilities using the copula-based model, see Supplementary Appendix 3.

Results

Descriptive statistics

We followed a cohort of 425,228 adult patients who did not have multimorbid diabetes, hypertension, and osteoarthritis (i.e., they had at most two of these three conditions) who had received care between 1 January 2009 and 31 December 2010 for 5 years. Figure 1 details the flow of patients into the cohort.

Figure 1: Cohort based on CPCSSN database

At baseline, the majority of patients were female (58%) and had a body mass index (BMI) greater than 25 kg/m2 (64%) with a median age of 49 years old (interquartile range: 34 to 59). For a detailed description of all patient characteristics, see Appendix Table 3. After 5 years, hypertension was the most commonly acquired outcome (n=39,882; incidence proportion of 9.4%), followed by diabetes (n=18,769; 4.4%), then osteoarthritis (n=12,803; 3.0%). For BMI, the most recent value before baseline was used. For each predictor found in the CPCSSN database, we assessed its face validity by comparing its prevalence in CPCSSN during 2009 and 2010 with national averages from 2010 (data not shown). Polycystic ovarian syndrome and alcohol use disorder were much lower than national averages; we did not include these predictors in our analysis. Additionally, family history data was not collected in several networks; thus, we did not include family history in our analysis. The following predictors were missing to some degree: smoking information, sex, BMI, age, and income Table 2). We used multiple imputation by chained equations to account for missing data in sex, BMI, age, and income. We did not impute smoking information due to its high degree of missingness.

Table 2: Predictors with missing data

	Development set (n = 265, 228)		Validation set (n = 160, 000)

	n missing	%	n missing	%
Smoking	247,918	93%	149,401	93%
Sex	44	0.02%	25	0.02%
BMI	175,632	66%	105,768	66%
Age	167	0.06%	92	0.06%
Income	13,824	5.2%	8,579	5.4%

BMI: body mass index

BMI: body mass index Many patients were missing BMI values. We could not determine the reason why patients were missing BMI values; however, the distribution among patients with BMI values was approximately similar to that of the Canadian population (Appendix Table 4). We examined the kernel density distribution of imputed BMI values compared to known BMI values: all imputed BMI values were within a reasonable range of values (Appendix Figure 1).

Univariate results

Univariate results are displayed in Table 3. We found that all predictors were associated with the corresponding outcome. Each univariate model displayed strong discrimination and moderate calibration (see Appendix Figures 2a-c for calibration plots). Sensitivity analyses revealed that censoring was not a concern: model estimates based on a cohort restricted to patients with at least one interaction with their primary care practitioner after the follow-up period (n=315,859) were similar to those of the overall cohort (Appendix Table 5).

Table 3: Univariate logistic regression models

	Reference category/units	β estimate	95% CI	Odds ratio	95% CI
Diabetes univariate model (AUC = 0.85)
Hypertension	No	Reference		Reference
	Yes	0.3	0.26 to 0.35	1.35	1.30 to 1.42
Age	(Years)	0.04	0.03 to 0.04	1.04	1.03 to 1.04
Lipid disorders	No	Reference		Reference
	Yes	1.69	1.64 to 1.73	5.42	5.16 to 5.87
BMI	(kg/m²)	0.07	0.07 to 0.08	1.07	1.07 to 1.08
Sex	Male	Reference		Reference
	Female	-0.3	-0.34 to -0.26	0.74	0.71 to 0.77
Schizophrenia	No	Reference		Reference
	Yes	0.63	0.51 to 0.75	1.88	1.67 to 2.12
Depression	No	Reference		Reference
	Yes	0.14	0.08 to 0.20	1.15	1.08 to 1.22
Income	($10,000)	-0.89	-1.15 to -0.64	0.41	0.32 to 0.53
Hypertension univariate model (AUC = 0.84)
Diabetes	No	Reference		Reference
	Yes	0.18	0.12 to 0.23	1.19	1.13 to 1.26
Age	(Years)	0.07	0.06 to 0.07	1.07	1.06 to 1.07
BMI	(kg/m²)	0.06	0.06 to 0.07	1.06	1.06 to 1.07
Chronic Kidney Disease	No	Reference		Reference
	Yes	0.8	0.74 to 0.85	2.22	2.09 to 2.35
Tricyclic Antidepressant Use	No	Reference		Reference
	Yes	0.55	0.49 to 0.62	1.74	1.63 to 1.86
Osteoarthritis univariate model (AUC = 0.83)
Age	(Years)	0.06	0.05 to 0.06	1.06	1.05 to 1.06
Sex	Male	Reference		Reference
	Female	0.22	0.17 to 0.27	1.25	1.19 to 1.31
BMI	(kg/m²)	0.04	0.03 to 0.04	1.04	1.04 to 1.05
Previous Leg Injury	No	Reference		Reference
	Yes	1.6	1.52 to 1.68	4.94	4.57 to 5.35
Osteoporosis	No	Reference		Reference
	Yes	0.9	0.83 to 0.98	2.47	2.29 to 2.66

AUC: area under the receiver operator characteristic curve; BMI: body mass index;

CI: confidence interval.

AUC: area under the receiver operator characteristic curve; BMI: body mass index; CI: confidence interval.

Dependence analysis

We estimate the unadjusted and adjusted correlation between each outcome pair (Table 4a). All pairs were positively correlated. Diabetes and hypertension displayed the highest correlation, followed by hypertension and osteoarthritis, then diabetes and osteoarthritis. This was consistent after adjusting for predictors, though smaller in magnitude (Table 4b).

Table 4a: Unadjusted correlation (ϕ coefficients)

	Diabetes	Hypertension	Osteoarthritis
Diabetes	1
Hypertension	0.240
	(0.238 to 0.246,	1
	p < 0.0001)
Osteoarthritis	0.098	0.209
	(0.093 to 0.102,	(0.205 to 0.213,
	p < 0.0001)	p < 0.0001)	1

Table 4b: Adjusted correlation (partial correlation)

	Diabetes	Hypertension	Osteoarthritis
Diabetes	1
Hypertension	0.132
	(0.128 to 0.137,	1
	p < 0.0001)	1
Osteoarthritis	0.038	0.123
	(0.034 to 0.042,	(0.118 to 0.127,	1
	p < 0.0001)	p < 0.0001)

We estimated copulas for each outcome pair (Table 4c). Hypothesis testing demonstrated a significant positive dependence between all outcome pairs after adjusting for risk factors.

Table 4c: Adjusted dependence (

	Diabetes	Hypertension
Diabetes
Hypertension	0.677
	(0.566 to 0.788,
	p < 0.0001)
Osteoarthritis	0.683	0.949
	(0.526 to 0.841,	(1.822 to 2.076,
	p < 0.0001)	p < 0.0001)

To demonstrate the use of our model, we estimated the trivariate probabilities for a simulated patient accounting for the dependence between outcomes using the copula model and without accounting for the dependence between outcomes by multiplying the probability from each univariate model (Table 5). Risk estimates differed between these approaches, demonstrating the need to account for the dependence between outcomes.

Table 5: Trivariate probabilities for simulated patient

	P(Diabetes, Hypertension, Osteoarthritis)	Based on copula model	Based on independence assumption
P(0,0,0)	0.6088	0.5798	1.05
P(0,0,1)	0.0481	0.0665	0.72
P(0,1,0)	0.2362	0.2633	0.90
P(1,0,0)	0.0466	0.0302	1.54
P(0,1,1)	0.0282	0.0371	0.76
P(1,0,1)	0.0026	0.0043	0.61
P(1,1,0)	0.0239	0.0169	1.42
P(1,1,1)	0.0055	0.0019	2.84

Simulated patient: 79 year-old woman whose BMI is 34 kg/m2 with an income of roughly $35,000 and free of any other risk factors.

Discussion

We developed and internally validated univariate models for diabetes, hypertension, and osteoarthritis based on EMR records. All models were highly discriminative and moderately calibrated. We then explored the dependence between each outcome by estimating the unadjusted and adjusted correlation in a pairwise fashion. All outcome pairs were positively correlated. After adjusting for predictors, outcome pairs remained positively correlated with reduced magnitudes. Finally, we estimated a copula-based model that describes the dependence between outcomes while enabling risk predictions. Existing research for multimorbidity risk prediction includes four areas. First, establishing risk factors for multimorbidity that can be used to identify high-risk patients. Many studies have found that older age, female gender, and lower socioeconomic status are associated with multimorbidity [34]. While our model was constructed for prediction purposes, rather than to derive causal inferences, these factors were all included in our model and found to be predictive of diabetes, hypertension, or osteoarthritis. Second, two prognostic prediction models have been developed for the onset of the first of several possible chronic disease outcomes. Ng et al. (2020) developed a model from national survey data linked with provincial health administrative data in Canada that is primarily intended for population-level predictions to aid health policy makers [35]; May et al. (2019) developed a model with primary care clinical data in the United States that is intended for implementation with EMRs for individual patient-level predictions [36]. These models require the absence of all possible outcome diseases at baseline (6 and 10, respectively) and predict the first instance of any of the diseases, which may signal the beginning of progression towards multimorbidity. In contrast, our methods account for dependence between diseases and allow for the presence of some of the outcome conditions at baseline such that predictions may be made for individuals who are further along in the natural history of diseases. A third line of research relevant to multimorbidity prediction includes Bayesian networks rather than regression-based models [37-39]. Lappenshaar et al. developed multilevel Bayesian network methodology and applied it to explore cardiovascular multimorbidity from primary care data in the Netherlands [38]. Their methodology allows for predicting multiple outcomes, explicit modelling of interactions and dependence between variables, formally incorporating domain knowledge, and accounting for practice-level variation which is commonly present in large health databases. In contrast to our regression-based methods that estimate conditional probability distributions with predictions based on all variables in a parametric model, multilevel Bayesian networks model a joint probability distribution and make predictions based on variables in the Markov blanket of the outcome(s) of interest. Lappenshaar et al. extended their models to include changes over time through multilevel temporal Bayesian networks [37]. While this methodology has the potential for individual risk prediction, it has not been evaluated in that setting; the main focus of the work thus far was to understand interactions between diseases and the progression of multimorbidity over time and to predict of future rates of multimorbidity at a group level. Finally, Wang et al. (2014) developed a multitask machine learning framework for EMR-based multiple disease prediction [40]. Their framework includes learning groupings of common risk factors across the outcome diseases, which serve as high-level latent predictors to use instead of raw EMR features and learning regression coefficients to weight these groupings. The resulting model can be used both for risk prediction and to explore the groupings to identify potential shared or unique risk factors across outcome diseases. A case study with chronic obstructive pulmonary disease and congestive heart failure found the multitask learning framework had better AUCs than a single-outcome dimensionality reduction approach (Principal Component Analysis) and similar performance to a logistic regression-based approach.

Strengths and limitations

Our analysis was limited by the availability of risk factor information within the EMRs. Data such as behavioral or environmental factors are not typically collected during a clinical encounter, thus not stored in the EMR. As such, the univariate models likely underestimated risk among patients who possess the unavailable risk factor. For the dependence analysis, the observed dependence might have been influenced by an unavailable risk factor that could not be adjusted for. Such a factor could act in either direction; a risk factor could increase or decrease the observed dependence between the outcomes, thus the true dependence could be less than or greater than what we observed. Our analysis may be subject to bias introduced by patterns in physicians’ diagnosis of diabetes, hypertension, and osteoarthritis. For example, when a physician diagnoses a patient with diabetes, they will likely assess for related conditions that may have otherwise gone undetected, such as hypertension. This may explain some of the dependence that we observed between disease pairs. The CPCSSN case definition we used to identify patients with diabetes identifies both type 1 and type 2 diabetes but does not distinguish between the two. However, type 1 cases typically constitute the minority of diabetes cases (10%, [41]) and are more commonly diagnosed in children [42]; thus, incident cases of diabetes that we observed in adults were more likely type 2. Risk factor information for type 1 diabetes (i.e., genetic factors [43]) were not available; however, risk factors for type 2 diabetes (e.g., age, sex, obesity, income [44]) were available and included in the model. Indeed, our model estimates risk of diabetes (type 1 or type 2) based on risk factors for type 2 diabetes. The same issue discerning type 1 from type 2 diabetes exists when treating diabetes as a predictor for hypertension. Because the majority of patients with diabetes at baseline likely have type 2 diabetes, the association between diabetes and incident hypertension will largely be determined by these patients. Any patients with type 1 diabetes at baseline will essentially be assigned the risk of a patient with type 2 diabetes at baseline. If the true risk of hypertension differs between patients with type 1 and type 2 diabetes, there may be some misspecification of risk. The CPCSSN validated case-detecting algorithm for osteoarthritis has lower sensitivity than the algorithms for the other conditions. Assuming misclassified osteoarthritis cases are similar to correctly classified cases among truly positive cases, this may reduce the strength of associations that are observed between the predictors and osteoarthritis and result in underestimated risk. However, the difference in sensitivity is small, thus any underestimation in risk would be expected to be small. Ideally, a prognostic prediction model should be deployed in the same setting that it was developed [14]. Our use of CPCSSN data strongly positions our model for deployment in the Canadian primary care setting, especially among physicians who submit data to CPCSSN. Use in new settings requires model ‘updating’ using data from the new setting [45]. This is also ideal operationally, as no additional measures beyond those already collected in the physician’s EMR were used in development; thus, no additional measures are required when applying our model to a patient in practice. A future direction could be to pilot test implementation of our model in CPCSSN-contributing settings to passively operate in the background of a physician’s EMR, flagging patients whose estimated risk is above some specified risk threshold.

Conclusion

Prevention efforts are needed to mitigate the increasing population health burden of multimorbidity. Quantitative estimates of risk can play a valuable role by providing a means to better understand potential future health trajectories and to foster discussions between patients and their primary care practitioners about appropriate preventative measures. Our research presents a model that can be used to provide such risk estimates while understanding and accounting for the dependence that exists between outcomes. The methods described above should be considered whenever predicting multiple outcomes where there may be some dependence between diseases. Further research will determine how best to incorporate this model into primary care practitioners’ clinical workflow and assess its real-world performance.

53 in total

Review 1. Ambient air pollution: an emerging risk factor for diabetes mellitus.

Authors: Xiaoquan Rao; Jessica Montresor-Lopez; Robin Puett; Sanjay Rajagopalan; Robert D Brook
Journal: Curr Diab Rep Date: 2015-06 Impact factor: 4.810

2. Association of leg-length inequality with knee osteoarthritis: a cohort study.

Authors: William F Harvey; Mei Yang; Theodore D V Cooke; Neil A Segal; Nancy Lane; Cora E Lewis; David T Felson
Journal: Ann Intern Med Date: 2010-03-02 Impact factor: 25.391

3. The diabetes risk score: a practical tool to predict type 2 diabetes risk.

Authors: Jaana Lindström; Jaakko Tuomilehto
Journal: Diabetes Care Date: 2003-03 Impact factor: 19.112

4. Exploring joint disease risk prediction.

Authors: Xiang Wang; Fei Wang; Jianying Hu; Robert Sorrentino
Journal: AMIA Annu Symp Proc Date: 2014-11-14

5. Depression is associated with decreased blood pressure, but antidepressant use increases the risk for hypertension.

Authors: Carmilla M M Licht; Eco J C de Geus; Adrie Seldenrijk; Hein P J van Hout; Frans G Zitman; Richard van Dyck; Brenda W J H Penninx
Journal: Hypertension Date: 2009-02-23 Impact factor: 10.190

6. Sex differences in the use of health care services.

Authors: C A Mustard; P Kaufert; A Kozyrskyj; T Mayer
Journal: N Engl J Med Date: 1998-06-04 Impact factor: 91.245

7. Representativeness of patients and providers in the Canadian Primary Care Sentinel Surveillance Network: a cross-sectional study.

Authors: John A Queenan; Tyler Williamson; Shahriar Khan; Neil Drummond; Stephanie Garies; Rachael Morkem; Richard Birtwhistle
Journal: CMAJ Open Date: 2016-01-25

8. Multilevel temporal Bayesian networks can model longitudinal change in multimorbidity.

Authors: Martijn Lappenschaar; Arjen Hommersom; Peter J F Lucas; Joep Lagro; Stefan Visscher; Joke C Korevaar; François G Schellevis
Journal: J Clin Epidemiol Date: 2013-09-12 Impact factor: 6.437

9. Framingham risk score and prediction of lifetime risk for coronary heart disease.

Authors: Donald M Lloyd-Jones; Peter W F Wilson; Martin G Larson; Alexa Beiser; Eric P Leip; Ralph B D'Agostino; Daniel Levy
Journal: Am J Cardiol Date: 2004-07-01 Impact factor: 2.778

10. Risk factors for osteoarthritis and contributing factors to current arthritic pain in South Korean older adults.

Authors: Kyoung Min Lee; Chin Youb Chung; Ki Hyuk Sung; Seung Yeol Lee; Sung Hun Won; Tae Gyun Kim; Young Choi; Soon Sun Kwon; Yeon Ho Kim; Moon Seok Park
Journal: Yonsei Med J Date: 2015-01 Impact factor: 2.759

1 in total

1. Is primary health care ready for artificial intelligence? What do primary health care stakeholders say?

Authors: Amanda L Terry; Jacqueline K Kueper; Ron Beleno; Judith Belle Brown; Sonny Cejic; Janet Dang; Daniel Leger; Scott McKay; Leslie Meredith; Andrew D Pinto; Bridget L Ryan; Moira Stewart; Merrick Zwarenstein; Daniel J Lizotte
Journal: BMC Med Inform Decis Mak Date: 2022-09-09 Impact factor: 3.298

1 in total