Literature DB >> 22871397

A secure distributed logistic regression protocol for the detection of rare adverse drug events.

Khaled El Emam¹, Saeed Samet, Luk Arbuckle, Robyn Tamblyn, Craig Earle, Murat Kantarcioglu.

Abstract

BACKGROUND: There is limited capacity to assess the comparative risks of medications after they enter the market. For rare adverse events, the pooling of data from multiple sources is necessary to have the power and sufficient population heterogeneity to detect differences in safety and effectiveness in genetic, ethnic and clinically defined subpopulations. However, combining datasets from different data custodians or jurisdictions to perform an analysis on the pooled data creates significant privacy concerns that would need to be addressed. Existing protocols for addressing these concerns can result in reduced analysis accuracy and can allow sensitive information to leak.
OBJECTIVE: To develop a secure distributed multi-party computation protocol for logistic regression that provides strong privacy guarantees.
METHODS: We developed a secure distributed logistic regression protocol using a single analysis center with multiple sites providing data. A theoretical security analysis demonstrates that the protocol is robust to plausible collusion attacks and does not allow the parties to gain new information from the data that are exchanged among them. The computational performance and accuracy of the protocol were evaluated on simulated datasets.
RESULTS: The computational performance scales linearly as the dataset sizes increase. The addition of sites results in an exponential growth in computation time. However, for up to five sites, the time is still short and would not affect practical applications. The model parameters are the same as the results on pooled raw data analyzed in SAS, demonstrating high model accuracy.
CONCLUSION: The proposed protocol and prototype system would allow the development of logistic regression models in a secure manner without requiring the sharing of personal health information. This can alleviate one of the key barriers to the establishment of large-scale post-marketing surveillance programs. We extended the secure protocol to account for correlations among patients within sites through generalized estimating equations, and to accommodate other link functions by extending it to generalized linear models.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22871397 PMCID： PMC3628043 DOI： 10.1136/amiajnl-2011-000735

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

Introduction

Although US$500 billion is spent world wide on drugs each year,1 there is limited capacity to assess the comparative risks and effectiveness of medications after they enter the market.2–6 In Canada, 60% of people 18 years of age or older have taken at least one prescription drug in the previous 6 months, and over one-third report experiencing an adverse drug event (ADE).7 Even when safety problems are identified, there is no timely or effective method of communicating this information to physicians to inform prescribing decisions.8–10 Most countries have established a formal regulatory process for drug approval that defines the information required from the drug manufacturers to demonstrate a drug's safety and efficacy. However, drugs are typically tested in randomized controlled trials with a limited number of patients selected carefully to optimize compliance and limit comorbidity.11–13 This population of patients rarely represents the typical patient treated with the drug after approval. While pre-market studies uncover commonly occurring ADEs, they are not designed to detect rare but serious ADEs,12 nor to assess safety and effectiveness in the broader population of eventual users.14 15 The limitations of relying on safety assessments from pre-market drug approval studies were highlighted in the 1950s with the thalidomide disaster, where drugs prescribed for nausea in pregnancy produced severe congenital anomalies. In response to this problem, a voluntary system of adverse drug reaction reporting was instituted, which continues to be the cornerstone of post-market surveillance.11 However, 60 years later, there is worldwide consensus that voluntary reporting is insufficient.12 16 Only 2–10% of ADEs are reported, there are substantial delays in ADE detection, and ADE case reports lack accurate numerators and denominators to estimate incidence.11 17–19 Moreover, voluntary reporting does not allow identification of ADEs, such as myocardial infarction, which also commonly occur in the general population. For example, more than 9 million people took the now infamous weight-loss drug fen-phen before it was identified that the drug could result in cardiac valve damage, a problem that also occurs in the general population for non-drug-related causes.12 16 Traditional adverse event reporting has also been widely criticized because it substantially underestimates important patient-reported adverse effects such as nausea, fatigue, appetite loss, and diarrhea.20 21 This underestimation can have profound clinical implications because early detection and response to suboptimal patient-reported treatment outcomes can improve adherence to treatment as well as reduce the risk of adverse events.22 However, regular monitoring and follow-up is resource intensive, and difficult to incorporate into regular practice in a cash-strapped healthcare system. A number of approaches have been used for post-marketing surveillance to address these problems. Prescription event monitoring is an active post-market surveillance method that requires physicians to respond to a follow-up questionnaire about a patients' response to new drugs.23 In one study, 94% of events detected by prescription event monitoring were not detected by spontaneous ADE reporting.19 However, response rates of physicians to follow-up questionnaires is poor, ranging from 35% to 65% and decreasing to 27.6% when information is sought for more than 30 patients from a single physician.23 24 Moreover, physicians who prescribe new drugs to more patients are generally poor responders.25 The labor-intensive nature of this method makes it unsustainable for a nation-wide undertaking.23 Even if mandatory reporting of ADEs were to be instituted, such as is the case for infectious disease reporting for public health, response rates are notoriously poor.26 27 In the public health context, authorities have addressed this limitation by increasing their reliance on computerized information sources such as data from the laboratory and medical service claims systems, as these data are more timely and reliable, and the effort to document information in a parallel reporting system is reduced.28–31 In Canada, population-level health administrative data (prescriptions, medical services, hospitalizations, mortality) can be linked to create longitudinal health histories for individual patients, which has enabled a new generation of methods to assess post-approval drug safety and effectiveness after their approval.32–41 Unfortunately, administrative data cannot be used alone for prospective surveillance because they lack important clinical variables that are needed for assessing indications for treatment, risk factors (eg, smoking), clinical (eg, blood pressure, HbA1c), and health status outcomes (eg, functional status). The increasing use of electronic health records in community- and hospital-based care may, however, provide a means of addressing both of these issues: systematic collection of important clinical variables to assess effectiveness and identification of ADEs in a timely manner.42–45 Europe, Scandinavia, Australia, and England have led the introduction of electronic health records in primary care.42–45 One byproduct of these early investments is the creation of new information sources that can be used to conduct drug safety and effectiveness evaluation. The General Practice Research Database, the first of this new genre, collects information from the electronic health records of 450 general practices in England and approximately 3.6 million active patients. It has been used to conduct over 800 studies including a sentinel study on the safety of childhood vaccines in relationship to the suspected link to the development of autism.46 Similar to paper medical records, these electronic files include information on prescribed therapy, consultations, morbidity events (diagnosis and symptoms), and lifestyle (smoking, alcohol, height, and weight).47 In the last 5 years, there has been a call to develop the potential to use these new information-rich resources for assessment of drug safety and effectiveness.2–4 6 48 Indeed, a new generation of drug safety and effectiveness studies is beginning to emerge from the electronic clinical data of large enterprise health-delivery networks.49–52 For rare adverse events, the pooling of data from multiple sources is necessary to have the statistical power and sufficient population heterogeneity to detect differences in safety and effectiveness in genetic, ethnic and clinically defined subpopulations.6 This is important because the effects of treatment may vary by sex and ethnicity,47 51–55 probably because of subpopulation differences in the prevalence of genetic polymorphisms that influence the metabolism of medication and its efficacy and toxicity.56 57 Combining data from different data custodians or jurisdictions to perform an analysis on the pooled data creates significant privacy concerns that would need to be addressed.58 It has been argued that providers would be permitted to disclose identifiable health information to certain organizations performing pharmacovigilance, such as the Food and Drug Administration in the USA.59 However, not all organizations in the USA and elsewhere that will be collecting data for the evaluation of drug, medical device, and vaccine safety will have such public health exemptions. For example, pharmaceutical companies that need to perform post-marketing surveillance on conditionally approved drugs or devices would still have to address privacy issues, as they will not have the authority to collect potentially identifiable patient information. In addition, in order to maintain public trust, even if the organization performing surveillance is permitted to collect personal health information (PHI), it may be prudent not to collect PHI on large numbers of individuals who do not experience adverse events (eg, controls). Datasets that are distributed among multiple sites having the same fields but different records in each site are called ‘horizontally partitioned’ data. To address the privacy concerns noted above, a number of data analysis protocols for secure computation on such horizontally partitioned data have been proposed, but they all have important disadvantages. For example, the sharing of deidentified data to create a pooled dataset60–62 will result in a loss of precision of the data, meta-analytic methods will result in a loss of precision and power,63 and the accuracy of recently proposed propensity score methods were not compared with an ideal analysis on the pooled data,64 65 therefore any losses in precision and accuracy from that approach are not known. Methods for multi-site regression would retain the precision and power.66 67 However, current multi-site regression approaches are prone to inappropriate disclosure of personal information from the information matrix,68 from indicator variables, disclosures from the covariance matrix,68–71 from the iterations themselves,72 and from the information matrix across multiple models.68 Secure multi-party computation methods have been proposed for the construction of regression models on horizontally partitioned data.73–77 However, as we demonstrate in the online appendix, these methods can still leak personal information. Distributed aggregation architectures that send queries to sites and combine their responses have been proposed and deployed.78–81 These are prone to tracker queries at various levels of sophistication that can reveal personal information.82–88 A detailed review and critique of all these methods and protocols that illustrates how they can potentially still leak personal information is provided in the online appendix. Our objective was therefore to develop a multi-site logistic regression protocol using secure multi-party computation methods, which does not disclose PHI by (1) not revealing the individual site information matrix and score vectors, (2) avoiding inference channels through multiple overlapping queries, and (3) retaining the same precision as a raw data pooled analysis. We chose logistic regression because (1) it is a commonly used analytical method for investigations of ADEs,89–93 and (2) the link function for the logistic model is more complex than for other generalized linear models (GLMs), which makes it a good one to illustrate in detail. We then show how the logistic regression protocol can be extended to generalized estimating equations (GEEs) to account for correlations among patients within a site, other GLMs such as Poisson regression and survival models.

Methods

Logistic regression

Let =(Y1,…,Y)′ be independent Bernoulli variables with mean E()=μ=(1,…,μ)′. Given an intercept and a set of covariates =[, .1,…,.v], where .j=(x1,…,x)′ contains the values for covariate j, we define a logistic model with parameters using the formula: (we say that the logit function links the random component to the systematic component ). The log-likelihood of the full model, which can be used to assess model fit (usually given as −2 log-likelihood), equals:where is row i from the design matrix . For a set of observations =(y1,…,y), we can determine parameter estimates at which the log-likelihood l of the model is maximized using the Newton–Raphson method (or, equivalently, the Fisher scoring method, since the estimated and observed information matrices are the same for a logistic model94). That is, we iteratively compute the estimates using (=(−[(]−1(, at iteration t, where (t)=′(−(t)) is the estimated score vector with probability of success (t)=logit−1((t)), and (t)='(t) is the estimated information matrix with weight matrix (t)=diag[p)]. This fitting method can be used for any GLM (with new derivations for the score vector and information matrix),95 and has been shown to converge to a solution in fewer iterations than other optimization algorithms applied to logistic models.96

SPARK protocol

Our protocol for the secure computation of logistic regression models across horizontally partitioned data (SPARK: Secure Pooled Analysis acRoss K-sites) assumes that there are k sites providing data on the same variables for different patients, and there is a single analysis center (AC) as illustrated in figure 1 (for the case of three sites). The AC would define the model that needs to be constructed and initiate the distributed secure computation. In some instances, the sites need to communicate directly with each other. This direct communication capability that bypasses the AC is important for maintaining the security of the protocol.

Figure 1

Overview of set-up for implementing the SPARK protocol when there are only three sites. This figure is only reproduced in colour in the online version.

Secure building blocks

We use the additive homomorphic encryption system proposed by Paillier.97 With the Paillier cryptosystem, it is possible to perform mathematical operations on the encrypted values themselves, such as addition and limited forms of multiplication. Formally, for any two data elements, m and m, and their encrypted values, E(m and E(m, the following equation is satisfied: where p is a product of two large prime numbers, and D is the decryption function. In this type of cryptosystem, addition of the plaintext is mapped to the multiplication of the corresponding ciphertext. The Paillier cryptosystem also allows a limited form of the product of an encrypted value: which allows an encrypted value to be multiplied with a plaintext value to obtain their product. Another property of Paillier encryption is that it is probabilistic. This means that it uses randomness in its encryption algorithm so that when encrypting the same message several times it will, in general, yield different ciphertexts. This property is important to ensure that an adversary would not be able to compare an encrypted message with all possible counts from zero onwards and determine what the encrypted value is. The SPARK protocol uses a number of secure building blocks that are needed for basic mathematical operations, such as addition, multiplication, secure dot product, matrix multiplication, and matrix inverse, which are combined to implement logistic regression. Secure dot product,98 secure multiparty multiplication,99 secure multiparty addition,99 secure matrix sum inverse for two parties,100 and secure matrix multiplication100 are existing protocols that we use in SPARK. In each of these building blocks, the final result is privately shared among the parties involved. We extended the secure matrix sum inverse sub-protocol, which only exists for the two-party case, to the more general multi-party case. The secure computation of 2-norm distance and comparison are the other two building blocks that we use in SPARK, and these are presented in the online appendix. Based on these secure building blocks, we describe the complete SPARK protocol in the online appendix. We also include a detailed security analysis of the protocol to illustrate that it is inherently secure and resilient to plausible collusion attacks.

Empirical evaluation

A theoretical complexity analysis of the SPARK protocol is provided in the online appendix. Our empirical evaluation of SPARK, presented here, considered the computational performance of the protocol and its accuracy compared with the results of using raw pooled datasets with SAS. For this empirical evaluation, we created some simulated datasets.

Simulation datasets

A classic simulated dataset is one formed by a set of independent normally distributed variables (eg, see Hosmer–Lemeshow101). The variables in this case could be thought of as mean-centered and scaled.95 102 Datasets of this type have been used repeatedly in the evaluation of statistical methods in medical and health research.103 104 Similarly, we use a binomial distribution to create binary variables, which could be seen as simulating binary risk factors (as in Heinze and Schemper105). Correlated variables are common, however, in health research and often used in models (see the reviews by Bagley et al106 and Mallett et al107). We therefore also created correlated data from the normally distributed variables. A common recommendation in biostatistics is to constrain the number of covariates proportionally to the number of observations. Following Harrell,108 we therefore limited the number of observations to 40 times the number of covariates to simulate more realistic models. Harrell recommends a maximum of 10–20 ‘equivalent’ observations per covariate to avoid overfitting, where, for a logistic regression, the number of equivalent observations is the minimum number of binary outcomes at the same level (eg, the minimum number of zeros or ones). We assumed that the outcomes would be split evenly between their binary values, which meant creating twice the number of observations as the equivalent observations described by Harrell. The sizes of the resulting datasets are summarized in table 1.

Table 1

Size of simulated datasets (excluding intercept)

Number of covariates	5	10	15	20
Number of observations	200	400	600	800

Size of simulated datasets (excluding intercept) Moreover, we wished to test different variable types and therefore created datasets with independent identically distributed (iid) continuous covariates, correlated covariates, and binary indicators (thus resulting in 12 datasets when combined with table 1). The iid variables were drawn randomly from a standard normal distribution; the correlated variables were created using a Cholesky decomposition on a correlation matrix with off-diagonal entries of 0.75, applied to the iid matrix of variables (preserving their marginal distributions)109; and the binary indicators were drawn randomly from the binomial distribution, with probability of success for each variable drawn randomly from the uniform distribution (scaled so that the probability of success was restricted to values from 0.3 to 0.7 in an effort to avoid convergence problems in the estimated models). In order to compare estimates between models with different covariate types, we needed to use the same parameter values for the 12 different logistic models. We therefore randomly drew 21 fixed values for the parameters (for models with an intercept and up to 20 covariates) from a normal distribution with mean 0 and variance 10. The resulting draw for the first six parameters (common to each model) was the fixed column vector '=(0.899,−5.944,1.534,−0.156,2.259,−1.868). We included an intercept, β, hence we also included a column of ones in the design matrix (which was otherwise exclusively populated with one of iid, correlated, or indicator variables). The outcome variable was drawn for each of the 12 models from a binomial distribution with probability of success equal to the mean response of the logistic model given by μ=logit−1(i), since .110 When the outcome is rare, as would be expected with some ADEs, then the dataset would be quite unbalanced. There are two common approaches for dealing with an unbalanced dataset: (1) a down-sampling or prior correction approach reduces the number of observations so that the two classes in the logistic regression model are equal111–113; and (2) the use of weights. It has been noted that the weighting approach suffers a loss in efficiency compared with an unweighted approach when the model is exact.114 Therefore after down-sampling, the dataset would be rebalanced, which is consistent with our simulated datasets. Having created our 12 datasets, with outcomes, we then used a simple bootstrap to generate 5000 replicates for each dataset (with the same number of observations in each replicate). In all of our evaluations, we randomly split the dataset into subsets of equal size to the different sites for each iteration of the simulation.

Computational performance evaluation

Two types of performance evaluation were performed. In the first, we assumed two sites, and the focus was to evaluate the computation time. This was calculated as the average across all replicates for each dataset. In the second evaluation we measured the computation time as the number of sites, and records in the dataset were systematically increased. We varied the number of sites from two to five, and the number of records from 100 000 to 1 million in 100 000 record increments. We did not take advantage of parallelism in these evaluations, therefore the performance should be considered a lower bound. The machine used was a commodity Windows XP platform with a dual-core Intel 2.4 GHz processor and 3 GB of RAM. The key bit-length size for this evaluation was 1024 bits. To have a fast and accurate computation on big integer and floating point numbers, the GNU Multiple Precision Arithmetic Library was utilized inside the implementation of the protocol, and the system was developed in the C# programming language.

Accuracy evaluation

It is necessary to perform accuracy evaluations because all secure multi-party computation protocols operate only on integers. We therefore had to scale all of our real numbers into integers to perform the computations, and then scale them back when presenting the results. This scaling causes a loss of precision. The accuracy evaluation was intended to determine the extent to which the results differ from constructing models on the original pooled datasets in SAS. We fitted logistic models to all replicates individually with SPARK and SAS using the Newton–Raphson method, without any form of ridging, and with relative parameter convergence of 1e-4. We compared the maximum difference between estimates for SPARK and SAS, including estimates for the intercept and five covariates (we exclude the additional covariates for ease of presentation).

Results

The evaluation results for performance and accuracy are shown in this section.

Computational performance

The computational performance results for the two sites are shown in table 2. These show the actual time to perform the computations at each site, and not the communication time among sites. As expected, the computation time increases with more covariates in the dataset. The variation in performance among the datasets with the same number of covariates was not dramatic. The results with large datasets and more sites are shown in figure 2. The computation time scales linearly with more records. As more sites are added, the computation time grows exponentially. However, with five sites and 1 million records, the computation is approximately 5 min, which makes the implementation practical in realistic situations.

Table 2

Computation time for different datasets assuming two parties

No of covariates	Type	Time (min)
5	iid	0.0286
	Correlated	0.0244
	Binary	0.0238
10	iid	0.1836
	Correlated	0.1395
	Binary	0.1249
15	iid	0.6669
	Correlated	0.4336
	Binary	0.3935
20	iid	0.9804
	Correlated	1.0159
	Binary	1.0026

Time is the average across 5000 replicates.

iid, independent identically distributed.

Figure 2

Performance in seconds as the number of records increases from 100 000 to 1 million for two to five sites. This figure is only reproduced in colour in the online version.

Computation time for different datasets assuming two parties Time is the average across 5000 replicates. iid, independent identically distributed. Performance in seconds as the number of records increases from 100 000 to 1 million for two to five sites. This figure is only reproduced in colour in the online version.

Accuracy

The accuracy results are shown in table 3. Note that differences are given at a precision of 10e-6 (ie, all values in the table need to be multiplied by 10e-6), and that estimates were originally recorded at a precision of 10e-9. Mean absolute differences (not shown) were so small, with narrow CIs, that we decided it would be more meaningful to report maximum differences only.

Table 3

Absolute difference between SPARK and SAS estimates for intercept and five covariates, based on a simple bootstrap of 5000 replicates*, with a recorded precision of 10e-9 for estimates

No of covariates	Type	Estimate	Maximum absolute difference between estimates (×10e-6)
		Parameters	b₀	b₁	b₂	b₃	b₄	b₅
5	iid		0.073	0.257	0.082	0.094	0.117	0.017
	Correlated		0.060	0.229	0.133	0.061	0.084	0.158
	Binary		0.071	0.447	0.110	0.079	0.233	0.126
10	iid		0.555	2.050	0.589	0.162	0.716	0.740
	Correlated		0.025	0.089	0.032	0.017	0.036	0.059
	Binary		0.023	0.072	0.025	0.024	0.027	0.027
15	iid*		0.930	4.340	0.980	0.807	1.850	1.510
	Correlated		0.016	0.075	0.041	0.028	0.030	0.027
	Binary		0.049	0.120	0.034	0.034	0.042	0.028
20	iid		0.021	0.093	0.026	0.040	0.033	0.028
	Correlated		0.041	0.200	0.087	0.017	0.094	0.058
	Binary		0.114	1.330	0.334	0.220	0.530	0.360
		Std errors	se₀	se₁	se₂	se₃	se₄	se₅
5	iid		0.017	0.069	0.020	0.023	0.031	0.024
	Correlated		0.015	0.057	0.035	0.018	0.020	0.037
	Binary		0.032	0.206	0.029	0.019	0.107	0.032
10	iid		0.376	1.504	0.429	0.142	0.524	0.533
	Correlated		0.006	0.019	0.007	0.005	0.007	0.015
	Binary		0.004	0.017	0.004	0.004	0.005	0.005
15	iid*		1.548	8.500	1.960	1.577	2.860	1.790
	Correlated		0.002	0.018	0.005	0.003	0.005	0.004
	Binary		0.005	0.009	0.003	0.002	0.004	0.003
20	iid		0.005	0.025	0.006	0.002	0.009	0.008
	Correlated		0.009	0.052	0.023	0.005	0.026	0.019
	Binary		0.073	1.239	0.301	0.205	0.514	0.340

Replicates in which complete or quasi-complete separation was detected in SAS were excluded. This occurred in less than 2.5% of replicates for all but the dataset with 15 iid covariates, in which separation was detected in 16.6% of replicates.

iid, independent identically distributed.

Absolute difference between SPARK and SAS estimates for intercept and five covariates, based on a simple bootstrap of 5000 replicates*, with a recorded precision of 10e-9 for estimates Replicates in which complete or quasi-complete separation was detected in SAS were excluded. This occurred in less than 2.5% of replicates for all but the dataset with 15 iid covariates, in which separation was detected in 16.6% of replicates. iid, independent identically distributed. Cases where complete or quasi-complete separation was detected were excluded from the results in table 3, because of potential differences in stopping criteria. Complete separation occurs when a linear combination of the data produces perfect predictions, with some observations always having a probability of one and others always having a probability of zero (ie, there exists a vector such that <0 when y, and >0 when y, for all observations i); quasi-complete separation occurs when a linear combination of the data produces perfect predictions for some observations and uncertainty otherwise (ie, there exists a vector such that ≤0 when y, and ≥0 when y, and at least one case of equality in both). Parameter estimates are infinite if the design matrix is completely or quasi-completely separable, which leads to convergence failures. Formal details are given by Albert and Anderson,115 and a more applied presentation is given by Allison.116 This type of convergence failure is common in logistic regression, and occurred in less than 2.5% of replicates for all but the dataset with 15 iid covariates, which suffered complete or quasi-complete separation in 16.6% of replicates (according to the detection criteria in SAS, as described by Allison116). We did not modify the latter replicates because variables were created and sampled using random draws, making such convergence failures difficult to eliminate in advance. Also, the original dataset was only one of our 12 test cases. Absolute differences between SPARK and SAS estimates in table 3 that exceeded 10e-6 were most likely due to undetected quasi-complete separation. On inspection we found that replicates where this occurred had parameter estimates that were multiple times their simulated values. Therefore relative differences between SPARK and SAS were even more accurate than suggested by the absolute differences reported.

Discussion

To ensure sufficient statistical power and population heterogeneity in the detection of ADEs, data from multiple sites need to be combined. A simple pooling of such horizontally partitioned data presents serious privacy concerns. Our review of the literature found that existing architectures and methods for analyzing horizontally partitioned data would allow the disclosure of PHI under a variety of conditions. For the specific problem of detecting ADEs, we have developed a secure distributed logistic regression protocol which addresses known weaknesses of previous protocols and ensures that PHI cannot be disclosed. The detailed security analysis in the online appendix demonstrates that sites that follow the protocol cannot access raw data from other sites and presents low risk from plausible collusion scenarios. Our empirical evaluation of the protocol has demonstrated that its computational performance would be acceptable for large datasets and for multiple sites, and that its accuracy (in terms of model parameters and diagnostics) is equivalent to the values that one would obtain from an analysis using SAS on the pooled raw data. This protocol should allow sites to contribute their patient data to multi-site analyses of ADEs with assurances that their patients' personal information will not be disclosed or inferred, but still allow the appropriate multi-site analytical models to be constructed. Because one of the key privacy concerns would be addressed, the SPARK protocol should allow analyses to commence faster and with less need for negotiating complex data-sharing agreements on PHI with each site (which can be a time-consuming process, especially if it involves data crossing jurisdictional boundaries). Compared with other protocols that do not implement secure computation (and hence do not provide the same level of assurances), SPARK will have more communication overhead. Therefore its overall performance will also be a function of this communication overhead, which will be dependent on network latency among the sites. Details on the number of messages passed in the SPARK protocol are provided in the complexity analysis in the online appendix. In general, communications can be optimized through pipelining the data flow rather than communicating in bursts and by sending multiple messages together. A multi-site analysis requires that all of the datasets are standardized, for example, by ensuring that coding schemes for nominal or categorical variables are consistent. This standardization effort would be required whether data are pooled for analysis or a distributed analysis is used, however. While our primary use case has been the detection of ADEs from data distributed across multiple sites, the SPARK protocol can be used for other types of situations where the datasets are distributed, such as genetic association studies. The main drivers for using SPARK would be the need to expand the dataset that a model is built upon to increase statistical power and enhance population heterogeneity, and to deal with privacy concerns in an expeditious manner that would still ensure accurate model results.

Extensions to GEEs

In practice, one would expect that there would be stronger correlations among the patients at a particular site than other sites. For example, there may be treatment, lifestyle, or environmental factors at one site that do not exist at other sites, leading to site-specific effects on the probability of an ADE. This kind of correlation can be accounted for by constructing GEEs. In the online appendix, we provide a description of GEEs and extend the SPARK protocol to implement GEEs for logistic regression.

Extensions to other GLMs

The basic protocol we have presented here can be extended to other GLMs.94 The link functions for other GLMs are simpler than the logit function, as illustrated in table 4. Secure computation of the link functions could be applied using the building blocks in this paper. In the Poisson log function, for example, we only need to compute the exponent of the product of regression vector and design matrix, which we already have in our protocol.

Table 4

Examples of link functions

Name	Function
Identity	μ
Reciprocal	1/μ
Reciprocal squared	1/μ²
Square root
Log	ln (μ)
Complementary log-log	ln (−ln (μ))
Logit	ln (μ/(1−μ))

Examples of link functions

Survival models

To model adverse events, another common modeling technique is a time-to-event or Cox model. Time-to-event survival models can be used to investigate hospitalization, infection, or death. Survival analysis methods provide hazard rates and consider various types of censoring, such as withdrawal from the study, death from other causes, or loss to follow-up. Proportional hazards models, in particular, are one of the most commonly used methods in health research, and are a form of ordinal model using the complementary log-log link function on Bernoulli data.94 Therefore our extension of the secure protocol to GLMs can include this form of survival modeling.

General limitations on remote analysis systems

A full implementation of the SPARK protocol would need to address some of the concerns that exist with remote analysis systems in general. In particular, if one considers the normal equations, =, the left-hand system of equations has k(k+1)/2 unknowns, and the right-hand system of equations has k unknowns. One could therefore fit k(k+1)/2+k sub models to determine these unknowns. This does not require the exposure of the information matrix and can occur with the model results only. To address concerns from the use of sub-models, it is necessary to monitor the number of sub-models that are created and limit their use accordingly. An adversary may attempt to circumvent limits on the number of sub-models by running multiple sub-models on highly correlated outcomes. However, the uncertainty introduced from using a different outcome may be enough to ensure the data are not recoverable. Alternatively, the protocol may instead use a different sub-sample of observations when building sub-models. This is the preferred method discussed in Sparks et al,68 although further investigation may be required to determine appropriate bounds on the desired level of uncertainty. Other disclosure risks associated with allowing an analyst to manipulate models through a remote analysis system can be mitigated through a variety of means68 such as: variable transformations would be limited to the most common (eg, log, square root, etc), transformations of factors would not be allowed, sparse factors or interactions would not be allowed, estimates would be rounded, and samples would be used.

73 in total

1. Secure analysis of distributed chemical databases without data integration.

Authors: Alan F Karr; Jun Feng; Xiaodong Lin; Ashish P Sanil; S Stanley Young; Jerome P Reiter
Journal: J Comput Aided Mol Des Date: 2005-11-03 Impact factor: 3.686

2. Using pharmacoepidemiology to inform drug coverage policy: initial lessons from a two-province collaborative.

Authors: J Michael Paterson; Andreas Laupacis; Ken Bassett; Geoffrey M Anderson
Journal: Health Aff (Millwood) Date: 2006 Sep-Oct Impact factor: 6.301

3. On the front lines of care: primary care doctors' office systems, experiences, and views in seven countries.

Authors: Cathy Schoen; Robin Osborn; Phuong Trang Huynh; Michelle Doty; Jordon Peugh; Kinga Zapert
Journal: Health Aff (Millwood) Date: 2006-11-02 Impact factor: 6.301

4. Off-label prescribing among office-based physicians.

Authors: David C Radley; Stan N Finkelstein; Randall S Stafford
Journal: Arch Intern Med Date: 2006-05-08

5. National surveillance of emergency department visits for outpatient adverse drug events.

Authors: Daniel S Budnitz; Daniel A Pollock; Kelly N Weidenbach; Aaron B Mendelsohn; Thomas J Schroeder; Joseph L Annest
Journal: JAMA Date: 2006-10-18 Impact factor: 56.272

6. Prevalence of unplanned hospitalizations caused by adverse drug reactions in older veterans.

Authors: Zachary A Marcum; Megan E Amuan; Joseph T Hanlon; Sherrie L Aspinall; Steven M Handler; Christine M Ruby; Mary Jo V Pugh
Journal: J Am Geriatr Soc Date: 2011-12-08 Impact factor: 5.562

7. Under-reporting of infectious gastrointestinal illness in British Columbia, Canada: who is counted in provincial communicable disease statistics?

Authors: L MacDougall; S Majowicz; K Doré; J Flint; K Thomas; S Kovacs; P Sockett
Journal: Epidemiol Infect Date: 2007-04-16 Impact factor: 2.451

8. Adverse drug reaction deaths reported in United States vital statistics, 1999-2006.

Authors: Greene Shepherd; Philip Mohorn; Kristina Yacoub; Dianne Williams May
Journal: Ann Pharmacother Date: 2012-01-17 Impact factor: 3.154

9. Outpatient gatifloxacin therapy and dysglycemia in older adults.

Authors: Laura Y Park-Wyllie; David N Juurlink; Alexander Kopp; Baiju R Shah; Therese A Stukel; Carmine Stumpo; Linda Dresser; Donald E Low; Muhammad M Mamdani
Journal: N Engl J Med Date: 2006-03-01 Impact factor: 91.245

10. Adherence to black box warnings for prescription medications in outpatients.

Authors: Karen E Lasser; Diane L Seger; D Tony Yu; Andrew S Karson; Julie M Fiskio; Andrew C Seger; Nidhi R Shah; Tejal K Gandhi; Jeffrey M Rothschild; David W Bates
Journal: Arch Intern Med Date: 2006-02-13

23 in total

1. WebDISCO: a web service for distributed cox model learning without patient-level data sharing.

Authors: Chia-Lun Lu; Shuang Wang; Zhanglong Ji; Yuan Wu; Li Xiong; Xiaoqian Jiang; Lucila Ohno-Machado
Journal: J Am Med Inform Assoc Date: 2015-07-09 Impact factor: 4.497

2. Privacy-preserving biomedical data dissemination via a hybrid approach.

Authors: Yichen Jiang; Chenghong Wang; Zhixuan Wu; Xin Du; Shuang Wang
Journal: AMIA Annu Symp Proc Date: 2018-12-05

3. SecureMA: protecting participant privacy in genetic association meta-analysis.

Authors: Wei Xie; Murat Kantarcioglu; William S Bush; Dana Crawford; Joshua C Denny; Raymond Heatherly; Bradley A Malin
Journal: Bioinformatics Date: 2014-08-21 Impact factor: 6.937

4. Preserving Patient Privacy During Computation over Shared Electronic Health Record Data.

Authors: Olivia G d'Aliberti; Mark A Clark
Journal: J Med Syst Date: 2022-10-20 Impact factor: 4.920

5. Predicting Adverse Drug Reactions on Distributed Health Data using Federated Learning.

Authors: Olivia Choudhury; Yoonyoung Park; Theodoros Salonidis; Aris Gkoulalas-Divanis; Issa Sylla; Amar K Das
Journal: AMIA Annu Symp Proc Date: 2020-03-04

6. EXpectation Propagation LOgistic REgRession (EXPLORER): distributed privacy-preserving online model learning.

Authors: Shuang Wang; Xiaoqian Jiang; Yuan Wu; Lijuan Cui; Samuel Cheng; Lucila Ohno-Machado
Journal: J Biomed Inform Date: 2013-04-04 Impact factor: 6.317

Review 7. Analytic and Data Sharing Options in Real-World Multidatabase Studies of Comparative Effectiveness and Safety of Medical Products.

Authors: Sengwee Toh
Journal: Clin Pharmacol Ther Date: 2020-01-24 Impact factor: 6.875

8. Data harmonization and federated analysis of population-based studies: the BioSHaRE project.

Authors: Vincent Ferretti; Isabel Fortier; Dany Doiron; Paul Burton; Yannick Marcon; Amadou Gaye; Bruce H R Wolffenbuttel; Markus Perola; Ronald P Stolk; Luisa Foco; Cosetta Minelli; Melanie Waldenberger; Rolf Holle; Kirsti Kvaløy; Hans L Hillege; Anne-Marie Tassé
Journal: Emerg Themes Epidemiol Date: 2013-11-21

9. Development of a web service for analysis in a distributed network.

Authors: Xiaoqian Jiang; Yuan Wu; Keith Marsolo; Lucila Ohno-Machado
Journal: EGEMS (Wash DC) Date: 2014-12-26

10. Inverse probability weighted Cox model in multi-site studies without sharing individual-level data.

Authors: Di Shu; Kazuki Yoshida; Bruce H Fireman; Sengwee Toh
Journal: Stat Methods Med Res Date: 2019-08-26 Impact factor: 3.021