Chongliang Luo1,2, Md Nazmul Islam3, Natalie E Sheils3, John Buresh3, Martijn J Schuemie4, Jalpa A Doshi5,6, Rachel M Werner5,6,7, David A Asch5,6, Yong Chen1,6. 1. Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 2. Division of Public Health Sciences, Washington University School of Medicine in St. Louis, St Louis, Missouri, USA. 3. OptumLabs, Minnetonka, Minnesota, USA. 4. Janssen Research and Development LLC, Titusville, New Jersey, USA. 5. Division of General Internal Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA. 6. Leonard Davis Institute of Health Economics, Philadelphia, Pennsylvania, USA. 7. Cpl Michael J Crescenz VA Medical Center, Philadelphia, Pennsylvania, USA.
Abstract
OBJECTIVE: To develop a lossless distributed algorithm for generalized linear mixed model (GLMM) with application to privacy-preserving hospital profiling. MATERIALS AND METHODS: The GLMM is often fitted to implement hospital profiling, using clinical or administrative claims data. Due to individual patient data (IPD) privacy regulations and the computational complexity of GLMM, a distributed algorithm for hospital profiling is needed. We develop a novel distributed penalized quasi-likelihood (dPQL) algorithm to fit GLMM when only aggregated data, rather than IPD, can be shared across hospitals. We also show that the standardized mortality rates, which are often reported as the results of hospital profiling, can also be calculated distributively without sharing IPD. We demonstrate the applicability of the proposed dPQL algorithm by ranking 929 hospitals for coronavirus disease 2019 (COVID-19) mortality or referral to hospice that have been previously studied. RESULTS: The proposed dPQL algorithm is mathematically proven to be lossless, that is, it obtains identical results as if IPD were pooled from all hospitals. In the example of hospital profiling regarding COVID-19 mortality, the dPQL algorithm reached convergence with only 5 iterations, and the estimation of fixed effects, random effects, and mortality rates were identical to that of the PQL from pooled data. CONCLUSION: The dPQL algorithm is lossless, privacy-preserving and fast-converging for fitting GLMM. It provides an extremely suitable and convenient distributed approach for hospital profiling.
OBJECTIVE: To develop a lossless distributed algorithm for generalized linear mixed model (GLMM) with application to privacy-preserving hospital profiling. MATERIALS AND METHODS: The GLMM is often fitted to implement hospital profiling, using clinical or administrative claims data. Due to individual patient data (IPD) privacy regulations and the computational complexity of GLMM, a distributed algorithm for hospital profiling is needed. We develop a novel distributed penalized quasi-likelihood (dPQL) algorithm to fit GLMM when only aggregated data, rather than IPD, can be shared across hospitals. We also show that the standardized mortality rates, which are often reported as the results of hospital profiling, can also be calculated distributively without sharing IPD. We demonstrate the applicability of the proposed dPQL algorithm by ranking 929 hospitals for coronavirus disease 2019 (COVID-19) mortality or referral to hospice that have been previously studied. RESULTS: The proposed dPQL algorithm is mathematically proven to be lossless, that is, it obtains identical results as if IPD were pooled from all hospitals. In the example of hospital profiling regarding COVID-19 mortality, the dPQL algorithm reached convergence with only 5 iterations, and the estimation of fixed effects, random effects, and mortality rates were identical to that of the PQL from pooled data. CONCLUSION: The dPQL algorithm is lossless, privacy-preserving and fast-converging for fitting GLMM. It provides an extremely suitable and convenient distributed approach for hospital profiling.
Decades of health services research have revealed that the outcomes hospitalized patients achieve are considerably determined by where they are admitted. Hospital profiling allows a quantitative assessment of the quality of hospital care that may help patients decide which hospital to use, or may guide how those hospitals are accredited or paid. Studying cross-hospital variation in care helps identify reasons for that variation with the aim of improving care for all. Such profiling across hospitals is usually conducted by analyzing clinical or administrative insurance claims data, always considering what factors to adjust for statistically—for example, patient characteristics like sociodemographic or medical conditions, hospital characteristics like volume or academic status, or social or community characteristics like area-level poverty or uninsurance levels., In a recent article on hospital profiling for coronavirus disease 2019 (COVID-19) mortality, Asch et al. ranked the performance of 929 hospitals after adjusting for the patients’ characteristics including age, sex, Elixhauser comorbidities, insurance type, and hospital’s characteristics including number of beds, number of intensive care unit beds, urban/nonurban setting, geographic region, profit status, and academic affiliation. Research of this kind helps untangle what are often separate contributors to the production of good patient outcomes and is essential for identifying ways to improve those outcomes.Recent years have seen the development of statistical methodologies for the purpose of hospital profiling. A commonly used model is the generalized linear mixed model (GLMM), which assumes common fixed-effects of covariates, for example, patient- and hospital-level factors, and hospital-specific random effects, that is, intercepts on the interested clinical outcome. Based on the estimated fixed and random effects, the risk standardized event rates (RSERs) can be calculated for each site. GLMM estimation, though complicated, could be obtained by methods such as Gaussian-Hermite approximation of the integrated likelihood, Monte-Carlo-based approaches, and the penalized quasi-likelihood (PQL) approach.,For example, Drye et al, studied the in-hospital and 30-day mortality rate of acute myocardial infarction (AMI), heart failure (HF), and pneumonia for more than 3000 hospitals using Medicare claims data from the Centers for Medicare and Medicaid Services (CMS). Asch et al studied COVID-19 mortality or discharge to hospice in 929 hospitals using the UnitedHealth Group Clinical Discovery Portal. Both investigations were based on a large integrated database, where patient-level data from multiple hospitals were available in a single dataset. But often such integrated datasets are not available. Indeed, an important limitation of most other investigations is that they rely on data sets from single institutions and so are smaller, more homogenous, and less representative of broader populations.Ideally, if individual patient-level datasets from across multiple payers and institutes could be shared, the profiling methods can be applied to a larger and more general study population. However, it is often the case that these individual patient-level data are typically protected by privacy regulations and sharing of individual patient data (IPD) is difficult. To extend hospital profiling to cover a wider spectrum of patient populations, privacy-preserving distributed algorithms can be used. Specifically, when fitting GLMM, the distributed algorithm is expected to require aggregated data (AD) from each hospital (often iteratively) but obtains accurate estimates of the model parameters, and therefore accurate estimates of RSERs. Recently, Zhu et al proposed a distributed algorithm based on Expectation–Maximization (EM) that involves the Metropolis-Hasting algorithm. However, it is well known that the EM algorithm usually takes many iterations to converge—the distributed EM algorithm of Zhu et al requires 500∼1000 iterations for results to be converged. As a result, the distributed algorithm also requires many rounds of data communication between institutes.This article aims to fill this important methodological gap by proposing a novel distributed algorithm to fit GLMM that is lossless (ie, it obtains identical results as if the IPD are pooled from all hospitals), computationally stable, and, importantly, requires only a few rounds of communications of AD across institutes. The algorithm is based on the PQL approach and a newly developed distributed algorithm for linear mixed model (LMM). We demonstrate the applicability of the proposed distributed PQL (dPQL) algorithm by hospital profiling for COVID-19 mortality or referral to hospice using data from 929 hospitals that have been previously studied by Asch et al.
METHODS
Fitting GLMM via penalized quasi-likelihood
GLMM is an extension of GLM with random effects. We introduce notations of GLMM in the context of hospital profiling. Assume there are hospitals with numbers of patients , and the total number of patients is . For subject at hospital , we denote the outcome, the -dimensional covariates with fixed effects , and the random effect (ie, random intercept), . Conditional on the covariates and random effects , are assumed to be independent observations with means and variances specified by a GLM. Specifically,
where is the link function that connects the conditional means to the linear predictor , is the variance function. The random effects are assumed to follow a normal distribution with mean 0 and variance θ. We note that the above model dictated by Equations (1) and (2) could be extended to hierarchical models as in George et al for more flexibilities; for example, the covariate can include hospital-level characteristics (eg, the (log) volume of a hospital) and the variance of the random effects could also be dependent on the hospital-level characteristics.Standard estimation of the GLMM parameters is based on maximizing the integrated quasi-likelihood
whereMaximization of this objective function is generally complicated, as the integrations must be performed numerically unless in the case of Gaussian outcome and identity link.One approach to the integration is to make a Laplace approximation, which eventually leads to the PQL algorithm. The PQL algorithm iteratively fit the linear mixed model
with the working outcome
and the weightThe obtained estimates are denoted as . See Breslow et al for more details about the derivation.
The proposed dPQL algorithm
We develop a dPQL algorithm for GLMM estimation in the case that the IPD are distributed across multiple centers and direct transfer of the IPD is not allowed. The dPQL algorithm is based on the distributed linear mixed model (DLMM) algorithm, which fits LMM exactly by requiring each site to contribute some AD only once. Specifically, in each iteration of the PQL algorithm, the weighted LMM (3) is fitted by the DLMM algorithm, requiring each site to contribute ADandscalars .See the Supplementary Materials for details of the DLMM algorithm. The dPQL algorithm thus reconstructs the PQL iterations and obtains identical results as if the IPD are pooled together.The proposed dPQL algorithm1. Initialize: the lead site send an initial value of the fixed effects , and the random effects to the collaborative sites i = 1, …, K.2. For iteration s = 0, 1, . . . ,2.1 Site i calculates the working outcome , and the weights ,2.2 Site i calculates aggregated datamatrix: ,dim vector: andscalars: and sample size , and transfers them to the lead site,2.3 The lead site fits weighted DLMM algorithm based on the aggregated data from 2.2, to obtain updated , and send them back to the collaborative sites.3. Stop iteration when converged, for example, < 1e-6. The final estimates are and .
Distributed calculation for standardized mortality rates based on dPQL
Hospital profiling results are often reported with the standardized mortality rates (SMRs) of hospitals. We demonstrate that the SMRs of hospitals can also be calculated in a privacy-preserving fashion. We provide 2 approaches for risk standardization, the Indirectly Standardized Mortality Rate (denoted as ISMR) and the Directly Standardized Mortality Rates (denoted as DSMR)., While both approaches measure adjusted mortality rates effectively, DSMR in contrast to SMR, has an interpretation in an amenable probability scale.The ISMR of hospital is estimated as
where
is the average expected mortality rate for patients at hospital ,
is the average expected mortality rate for hospital patients had they been treated at the “population level,” and is the overall observed mortality rate. This SMR measure has been used to compare the performance of nonfederal acute care hospitals in the United States for AMI (n = 3135 hospitals), HF (n = 4209 hospitals), and pneumonia (n = 4498 hospitals) from 2004 to 2006.The DSMR of hospital is defined as the average mortality rate assuming patients from all the hospitals being treated at this hospital, thai is,
where
is the average expected mortality rate of patients at hospital had they been treated at hospital . When , and if is a counterfactual probability. This SMR measure has been applied to profiling 4289 hospitals in the United States for AMI using Medicare records from 2009 to 2011, and to evaluating COVID-19 mortality in 929 hospitals. While both approaches measure adjusted mortality rates effectively, DSMR in contrast to ISMR, has an interpretation in an amenable probability scale.We note that both types of SMR measures (ISMR and DSMR) can be calculated distributively without sharing IPD. Specifically, for the ISMR, each individual hospital calculates and shares its average expected mortality rates (ie, 2 probabilities and as in Figure 1, and Equations 7 and 8) using its own patient-level data and the public estimates from dPQL (ie, and as in Figure 1). For the DSMR, each individual hospital calculates and shares the average expected mortality rates had its patients been treated at other hospitals (ie, probabilities as in Figure 1, and Equation 10) using its own patient-level data and the public estimates from dPQL.
Figure 1.
A distributed procedure for hospital profiling. The dPQL algorithm fit the GLMM in a distributive fashion by requiring some aggregated data (AD) from each hospital in a few iterations, and obtains the estimated fixed effects () and random effects (). Next, standardized mortality rates (SMRs) of the hospitals can be calculated distributively. Based on the results of dPQL algorithm, each hospital calculates its average expected mortality rates using its own individual patient data (ie, for hospital , is the average expected mortality rate had its patients been treated at the “population level,” and is the average expected mortality rates had its patients been treated at hospital ). The indirectly and directly standardized mortality rates can then be calculated ( and for hospital ).
A distributed procedure for hospital profiling. The dPQL algorithm fit the GLMM in a distributive fashion by requiring some aggregated data (AD) from each hospital in a few iterations, and obtains the estimated fixed effects () and random effects (). Next, standardized mortality rates (SMRs) of the hospitals can be calculated distributively. Based on the results of dPQL algorithm, each hospital calculates its average expected mortality rates using its own individual patient data (ie, for hospital , is the average expected mortality rate had its patients been treated at the “population level,” and is the average expected mortality rates had its patients been treated at hospital ). The indirectly and directly standardized mortality rates can then be calculated ( and for hospital ).
Us hospital ranking based on the mortality rates for patients admitted with COVID-19
Asch et al conducted a cohort study assessing 38 517 adults who were admitted with COVID-19 to 929 US hospitals from January 1, 2020 to June 30, 2020 using the data from UnitedHealth Group Clinical Discovery Portal. The hospital’s standardized rate of 30-day in-hospital mortality or referral to hospice was calculated, after adjusting for patient-level characteristics, including demographic data, Elixhauser comorbidities, community or nursing facility admission source, and time since January 1, 2020; hospital-level characteristics, including size, the number of intensive care unit beds, academic and profit status, hospital setting; and regional characteristics, including COVID-19 case burden. See Supplementary Figure S1 for a description of the data.We demonstrate the applicability of the proposed dPQL algorithm by using it to rank hospitals with only transferring AD from each hospital. Specifically, we compare the predicted mortality rate (via ISMR or DSMR) of the 929 hospitals by either pooled analysis (PQL) of the patient-level data or the distributed analysis (dPQL) of the AD across hospitals. We also check the number of iterations for reaching convergence, and compare the estimation of fixed effects, best linear unbiased predictors (BLUPs), and mortality rates using either pooled or distributed analyses.
RESULTS
The predicted mortality rate (via ISMR or DSMR) of the 929 hospitals by either pooled analysis (PQL) or the distributed analysis (dPQL) is compared in Figure 2. The dPQL algorithm reached convergence with only 5 iterations, and the estimation of fixed effects, BLUPs, and mortality rates were identical to that of the PQL from pooled data. The estimated fixed and random effects from the dPQL algorithm and from the PQL are also identical, as shown in Supplementary Figure S2.
Figure 2.
The estimated mortality rate (indirectly standardized mortality rate (A) and (B) and directly standardized mortality rates (C) and (D)) of 30-day in-hospital mortality or referral to hospice of the 929 hospitals by either pooled analysis (PQL) or the distributed analysis (dPQL).
The estimated mortality rate (indirectly standardized mortality rate (A) and (B) and directly standardized mortality rates (C) and (D)) of 30-day in-hospital mortality or referral to hospice of the 929 hospitals by either pooled analysis (PQL) or the distributed analysis (dPQL).
DISCUSSION
We propose a novel dPQL algorithm, a privacy-preserving distributed learning algorithm to fit GLMM. The dPQL algorithm does not require sharing of individual patient-level data. The algorithm only requires sharing of minimal AD from each site over few rounds of communication and obtains identical results as if fitting GLMM to the pooled data using PQL. The calculation of AD at each individual site is implemented in the R package “pda.”, We also developed an “over-the-air” online portal called PDA-OTA (http://pda-ota.pdamethods.org/) to facilitate secure and convenient collaboration on the basis of the “pda” package. See the Supplementary Materials for detailed instructions for using the PDA-OTA.The results of the PQL estimation are comparable to that of other approaches used to fit the GLMM model. For example, in the hospital ranking for COVID-19 mortality rates, the PQL estimation is almost identical to that of the Gaussian-Hermite approximation approach used in the original paper. Although fitting GLMM by PQL is sometimes criticized for its biased estimation when the outcome is binary and clusters are small,, it is still an appropriate estimation approach for hospital profiling purposes, as the sample sizes in hospitals are usually large enough.The communication efficiency of the dPQL algorithm is attributable to the fast convergence of the PQL algorithm. See Supplementary Figure S3 showing the convergence in just a few iterations in the hospital profiling example for COVID-19 mortality. The communication efficiency can be further improved by a one-shot (or few-shots) version of the dPQL algorithm, that is, run only one (or few) iteration of the dPQL algorithm proposed in Section “The proposed dPQL algorithm.” Such a one-shot approach has been pursued by many distributed algorithms and is considered communication-efficient., The one-shot version dPQL algorithm will sacrifice some accuracy of the estimation, but obtains very appealing communication cost, as each hospital needs only to share the AD once. Meanwhile, the number of iterations required in the PQL algorithm depends on the choice of initial values. While default initial values (ie, all fixed effects being 0) provide satisfactory results, the performances can be improved with smart choices of initial values. We recommend setting a maximum number of iterations (eg, within 5 iterations) when using dPQL in practice. However, we do not recommend applying dPQL in the high-dimensional setting (ie, large p) as it will involve communication of massive aggregate data (ie, the p-by-p matrices ).We provide indirect (ISMR) and direct (DSMR) standardization to interpreting the hospital ranking for the purpose of public reporting. Examples of conducting hospital profiling using either approach exist in literature.,, The directly standardized approach is considered to behave better for models that consider the interaction between the hospital and the patients. On the other hand, using GLMM for ranking hospitals assumes overlap of patient characteristics at different hospitals. Other statistical models, for example, without random effects, could also be considered when there is poor overlap of patient characteristics between hospitals. The choice of standardization approaches and statistical models is beyond the scope of this paper. The hospital profiling can be conducted for other tasks, as long as the outcome can be modeled by GLMM. This includes binary outcomes such as COVID-related mortality, ventilator usage or hospital readmission, and count outcomes such as hospitalization length of stay, etc.Our proposed dPQL algorithm is in a similar fashion as federated learning methods, which have found profound applications in many clinical settings in recent years. However, our AD release mechanism has not been investigated in rigorous privacy framework such as k-anonymity or differential privacy,, and thus is not guaranteed to be protected from the risk of re-identification or membership inference attacks (MIAs). Specifically, the risk of re-identification arises from linking potential quasi-identifiers (eg, combinations of patient’s characteristics) to external sources, and the risk of MIAs refers to inferring whether a data point (eg, a specific patient’s record) is used to train the model., To avoid potential risk of re-identification, we suggest excluding or suppressing values representing 10 or fewer patients when sharing the aggregate data and using random initial values if possible when initiating the iteration. These will prevent the aggregate data from containing sparse elements and hence re-identifying sensitive patient information. We also suggest avoiding high-dimensional GLMM models, and using a representative sample for training. These will prevent overfitting and improve the generalizability of the model, which result in mitigating the risk of MIAs. In the future, we plan to extend our dPQL algorithm via techniques such as differential privacy and multiparty homomorphic encryption.
FUNDING
This work was partially supported through a Patient-Centered Outcomes Research Institute (PCORI) Project Program Award (ME-2019C3-18315). All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the PCORI, its Board of Governors or Methodology Committee.
AUTHOR CONTRIBUTIONS
CL and YC designed methods and analyses; MNI, NES, and JB provided the dataset from UnitedHealth Group; CL and MNI conducted numerical analyses; all authors interpreted the results and provided instructive comments; CL and YC drafted the main article. All authors have approved the article.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
CONFLICT OF INTEREST STATEMENT
NES and MNI and JB are or were full-time employees in Optum Labs and own stock in its parent company, UnitedHealth Group, Inc. The other authors have no competing interests to declare.
DATA AVAILABILITY STATEMENT
All data were accessed in compliance with the HIPPA rules, IRB approval or waiver of authorization was not required.Click here for additional data file.
Authors: Elizabeth E Drye; Sharon-Lise T Normand; Yun Wang; Joseph S Ross; Geoffrey C Schreiner; Lein Han; Michael Rapp; Harlan M Krumholz Journal: Ann Intern Med Date: 2012-01-03 Impact factor: 25.391
Authors: Rui Duan; Mary Regina Boland; Zixuan Liu; Yue Liu; Howard H Chang; Hua Xu; Haitao Chu; Christopher H Schmid; Christopher B Forrest; John H Holmes; Martijn J Schuemie; Jesse A Berlin; Jason H Moore; Yong Chen Journal: J Am Med Inform Assoc Date: 2020-03-01 Impact factor: 4.497
Authors: Nicolas R Thompson; Youran Fan; Jarrod E Dalton; Lara Jehi; Benjamin P Rosenbaum; Sumeet Vadera; Sandra D Griffith Journal: Med Care Date: 2015-04 Impact factor: 2.983
Authors: David A Asch; Natalie E Sheils; Md Nazmul Islam; Yong Chen; Rachel M Werner; John Buresh; Jalpa A Doshi Journal: JAMA Intern Med Date: 2021-04-01 Impact factor: 21.873
Authors: Rui Duan; Chongliang Luo; Martijn J Schuemie; Jiayi Tong; C Jason Liang; Howard H Chang; Mary Regina Boland; Jiang Bian; Hua Xu; John H Holmes; Christopher B Forrest; Sally C Morton; Jesse A Berlin; Jason H Moore; Kevin B Mahoney; Yong Chen Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497
Authors: Nicola Rieke; Jonny Hancox; Wenqi Li; Fausto Milletarì; Holger R Roth; Shadi Albarqouni; Spyridon Bakas; Mathieu N Galtier; Bennett A Landman; Klaus Maier-Hein; Sébastien Ourselin; Micah Sheller; Ronald M Summers; Andrew Trask; Daguang Xu; Maximilian Baust; M Jorge Cardoso Journal: NPJ Digit Med Date: 2020-09-14