Literature DB >> 32657380

Privacy-preserving construction of generalized linear mixed model for biomedical computation.

Rui Zhu¹, Chao Jiang², Xiaofeng Wang¹, Shuang Wang¹, Hao Zheng³, Haixu Tang¹.

Abstract

MOTIVATION: The generalized linear mixed model (GLMM) is an extension of the generalized linear model (GLM) in which the linear predictor takes random effects into account. Given its power of precisely modeling the mixed effects from multiple sources of random variations, the method has been widely used in biomedical computation, for instance in the genome-wide association studies (GWASs) that aim to detect genetic variance significantly associated with phenotypes such as human diseases. Collaborative GWAS on large cohorts of patients across multiple institutions is often impeded by the privacy concerns of sharing personal genomic and other health data. To address such concerns, we present in this paper a privacy-preserving Expectation-Maximization (EM) algorithm to build GLMM collaboratively when input data are distributed to multiple participating parties and cannot be transferred to a central server. We assume that the data are horizontally partitioned among participating parties: i.e. each party holds a subset of records (including observational values of fixed effect variables and their corresponding outcome), and for all records, the outcome is regulated by the same set of known fixed effects and random effects.
RESULTS: Our collaborative EM algorithm is mathematically equivalent to the original EM algorithm commonly used in GLMM construction. The algorithm also runs efficiently when tested on simulated and real human genomic data, and thus can be practically used for privacy-preserving GLMM construction. We implemented the algorithm for collaborative GLMM (cGLMM) construction in R. The data communication was implemented using the rsocket package.
AVAILABILITY AND IMPLEMENTATION: The software is released in open source at https://github.com/huthvincent/cGLMM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2020 PMID： 32657380 PMCID： PMC7355231 DOI： 10.1093/bioinformatics/btaa478

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Owing to the rapid advances in DNA-sequencing technologies, human genomic studies, such as genome-wide association studies (GWASs) have been increasingly used to identify the genetic variants susceptible to human diseases (Hirschhorn and Daly, 2005; McCarthy ). Meanwhile, many computational methods have been developed to enhance the sensitivity and statistical power of GWAS (McCulloch, 2003a). Among them, the linear mixed model (LMM) aims to explain the variation of a target phenotype in a population by a mixture of fixed effects (variables of interests) and random effects (unknown variables), which are shown to improve the identification rate of potentially causal variants of human disease (Golan and Rosset, 2018). However, LMM is designed for quantitative traits (e.g. blood pressure), and cannot be directly applied to categorical phenotypes. The generalized linear mixed model (GLMM) extends the LMM to support both categorical and quantitative phenotypes, and thus were frequently used in GWAS (e.g. for binary case–control studies or ordered disease stages; McCulloch, 2003b). In a generic setting, a GLMM can be constructed using human genomic data (i.e. genotypes inferred from genome sequences) from a cohort of phenotyped human individuals, which requires data analysts to have direct access to the individual-level genomic data for every member in the cohort. In practice, it may be of great biomedical interest to assemble a large cohort of human genomes from multiple studies of the same disease for GWAS (often referred to as the meta-analysis; Begum ; Jeck ; Pharoah ). For this purpose, the genomic data need to be collected and stored across several institutions, since it is difficult to move the data to a central site due to the challenges in data transmission (large data size), privacy protection (personal human genomic data are identifiable and thus sensitive) and the restriction of institutional data disclosure policy. As a result, privacy-preserving algorithms should be in place to enable computation on distributed genomic data without sharing individual-level genomic variants. In the past decade, privacy-preserving algorithms have been developed for protecting various statistical methods, including survival analyses [e.g. using the Cox model (Bradburn ; Lu ; Yu )], missing data imputation (Jagannathan and Wright, 2008) and logistic regression (Wu ). These algorithms follow the same collaborative computation approach, where the computation for targeted statistical methods is partitioned depending on the required input data: some computation is done on the locally retained genomic data to obtain often less-sensitive intermediate results, whereas the other computation is performed on a central server using the intermediate results as the input. As a step forward on this direction, in this article, we present a privacy-preserving method for constructing a GLMM using a collaborative Expectation–Maximization (EM) algorithm that combines the Metropolis–Hasting (MH) algorithm in the E-step and the Newton–Raphson (NR) algorithm in the M-step to estimate parameters of the GLMM. We partition the computation for both the MH and the NR algorithm into the private and joint components, which are carried out by each participating party, whose intermediate results are then combined by a central server. The resulting collaborative EM algorithm is mathematically equivalent to the original EM algorithm (i.e. commonly used in GLMM model construction Booth and Hobert (1999). We implemented the collaborative GLMM (cGLMM) algorithm in R, and evaluated its performance using both simulated and real-world human genome data. The results show that the collaborative EM algorithm is efficient, and also accurate. We note that the cGLMM method can be applied to other biomedical computation tasks where privacy-preserving approaches are needed, e.g. for building predictive models of human diseases using diagnostic information retrieved from patients’ Electronic Health Records (EHRs) that are held by multiple medical institutions but cannot be shared outside each respective institution.

2 Materials and methods

2.1 Background

2.1.1 Generalized linear mixed model

In an LMM, the outcome is a continuous variable. In GWAS, however, the outcome is often categorical, e.g. the binary outcome representing disease or healthy in a case-control study. The GLMM extends LMM by incorporating a link function to convert a continuous outcome into a categorical outcome (McCulloch, 2003b). Here, we use the logit function: as the link function to deal with the binary outcome often used in the case-control GWAS: where X represents fixed effects and Z represents the random effect, whereas β and u are the coefficients for the fixed and the random effects, respectively. Below, we summarize the EM algorithm for GLMM parameter estimation from a total of N input records, each with the given fixed and random effects as well as the desirable outcome (0 or 1). Let T be the number of random effect categories and M be the number of fixed effect variables. We consider as one of the random effect categories, and as one of the fixed effect variables. For the convenience of presenting the privacy-preserving EM algorithm for parameter estimation in the next section, here we assume each random effect category contains the equal number of P input records (thus, ), and indexes each record. Our implementation does not have this constraint and allows for each category to contain different number of records. Hence, the outcomes for all records can be represented as a response matrix, The input random effects are represented as, Z is a vector with length T. For ith element Z () is sampled from a normal distribution . Note that different with most notation of GLMM, We remove ‘Membership matrix’ u by reshape input matrix X as Equation (4). For example, in (4) X1 is the first fixed effect variable. It is a matrix which has T rows, where T is number of levels of random effect. So for the ith row of X1, all the record has same random effect level, so that they are all correspond to Z which is the level of random effect of ith record. It different from most notation in GLMM, but this is the only notation way that we think can elaborate cGLMM. Throughout the rest of this article, we will use to represent the mth input fixed effect values in tth level of random effect, pth record, and use to represent the desirable outcomes, where is a M-dimensional vector (as shown below), and is the binary outcome [as shown on (3)]. mth fixed effect is a T × P matrix, denote as X. So the corresponding fixed effect coefficients β are represented in an M-dimensional vector, We use an EM algorithm to estimate the parameters (Z and β) of an optimal GLMM: in the M-step, given the current latent variables (σ), we compute the fixed effect coefficients β by using the NR algorithm, and in the E-step, we compute the latent variables (σ) based on the current model parameters β by using the MH algorithm. The EM algorithm iterates between these two steps until convergence. Specifically, in the MH algorithm, we started from randomly selected σ for each category i, and then sample random effects , where each z is randomly sampled from . Then in each MH sampling step, we first sample new random effects in the neighborhood of based on the current , and then compute a proposal probability A between and z using the current model parameters β to decide if z should be replaced by in Z for the next step. The proposal function is defined as where is the likelihood of random effect according to the current GLMM model, and represents the likelihood of observing a record according to the current . Specifically, can be written as (McCulloch, 2003a): Let MMH be the total number of sampling steps in the MH algorithm, which can be set to between 500 and 1000. After we complete MMH iterations, the variance of the final samples are used to compute the updated σ. The details of the MH algorithm will be further illustrated in Section 2.3, when we present the privacy-preserving version of the GLMM construction algorithm. In the M-step of the EM algorithm, we use the NR algorithm to compute the fixed effect coefficients β based on the current random effects Z and the parameters σ (updated in the E-step). NR is a second-order optimization algorithm, which uses the first-order derivative to choose the optimization direction, and the second-order derivative to choose the step length. Comparing with the first-order optimization algorithms, the NR algorithm takes fewer steps to converge. Specifically, in each NR iteration i, β is updated by where the first-order derivative and the Hessian matrix H can be computed by We use a threshold ϵ to determine if the optimization converges, i.e. the NR algorithm terminates when the distance between the βs from two consecutive steps is smaller than ϵ. The details of the NR algorithm will be illustrated in Section 2.4.

2.2 Privacy-preserving GLMM construction on horizontally partitioned data

In this article, we consider a scenario of collaborative computation for privacy-preserving GLMM construction, where each of the collaborative parties (e.g. medication institutions) holds a sensitive dataset (e.g. the genotypes from a cohort of human individuals including disease patients and healthy controls) and attempts to build a GLMM model collaboratively without sharing its raw dataset to the other parties. This scenario is often referred to as the collaborative computation on horizontally partitioned data (Jiang ; Wang ), where each participating party holds the complete records from a subset of subjects, whereas, in a different scenario called vertically partitioned data (Li ), each participating party holds a different portion of records from the same set of subjects that can be matched among all parties. The general idea of collaborative computation is to partition a target algorithm (e.g. to construct the GLMM) into two parts: the first part of private computation can be performed on the partial data held by each participating party that generate intermediate results required for the second part of computation, and the second part joint computation has to be performed on a central server using the intermediate results from the private computation of all parties. In this scenario, the central server is considered to be semi-trusted (honest-but-curious), and thus the sensitive raw data cannot be directly sent to the central server by each participating party. In practice, collaborative algorithms for many tasks (e.g. uncertainty quantification; Sciacchitano ) have been developed. In many cases, they involved multi-round communications where in each round, not only the intermediate results were transferred from each party to the central server for the joint computation, but the intermediate results from joint computation needs to be sent back to each party for the next round of private computation. Notably, the collaborative computation for building a machine learning model is also referred to as the federated learning approach (Konečnỳ ), which aims to save the cost of transferring a large amount of training data among participating parties while protecting the privacy of sensitive training data. In some cases, the intermediate results may carry sensitive information and thus can be used to infer the presence of a record (e.g. by using a re-identification attack (El Emam ). In these cases, the joint computation should be performed by using encryption protocols (e.g. homomorphic encryption (HE; Gentry, 2009; Wang ), or in a trusted execution environment (TEE; Sabt ). Consider the input data matrix X containing the fix effects (e.g. the genotypes) on a total of N records (disease patients) in each of the T random effect categories (Fig. 1a). In the horizontal data partition scenario, the entire data matrix was not held by a single user, but instead by K different parties: the ith party holds a subset of T × P records (Fig. 1c). Again, for the convenience of presentation, here we assume each party holds the same number (P) of records in each of the T random effect categories; as a result, . However, in our implementation, we can handle the situation where different parties hold a different number of records, and also the records may be distributed unevenly among random effect categories. Using the separately held input data matrix, our goal is to develop the collaborative computation algorithm for building a GLMM model among the K participating parties, through the partition of private and joint computation for the EM algorithm as laid out above.

Fig. 1.

The input fixed effect matrix is jointly held by K parties, each holding a subset of records (a). Hence, the input matrix can be viewed as horizontally partitioned (b), and each party holds a submatrix containing a subset of rows (c) The cGLMM presented here takes as the input the data matrix jointly held by multiple parties and uses the EM algorithm to estimate the parameters of the GLMM (i.e. β for fixed effect coefficients and σ for the random effect). In each iteration of the E-step (MH algorithm) and M-step (NR algorithm), each party first computes the intermediate results from their own data, and then transfer the results to a central server to compute the updated parameters, which will then be sent back to each party for the next iteration. Figure 2 illustrates the workflow of the collaborative EM algorithm for cGLMM, in which each iteration consists of the E-step (MH algorithm; Fig. 2, left) and M-step (NR algorithm; Fig. 2, right), and each of them is performed in a collaborative manner. Both the MH and NR algorithms can be partitioned into the private computation (executed by each party separately) and the joint computation (executed on a central server) such that the intermediate results generated by the private computation of each party are sent to the central server for joint computation (①), and the intermediate results generated by the joint computation are then sent back to each party for the private computation in the next iteration ②. As a result, the collaborative EM algorithm involves multiple rounds of data communications, where the number of rounds is equal to the number of iterations in the EM algorithm.

Fig. 2.

The workflow of the collaborative EM algorithm for building cGLMM jointly by multiple participating parties

The workflow of the collaborative EM algorithm for building cGLMM jointly by multiple participating parties In the next two sections, we will present the details of the collaborative EM algorithm, specifically the partition of the private and joint computation for the MH and NR algorithms, respectively.

2.3 Collaborative MH algorithm

As described in Section 2.1.1, in the E-step of the EM algorithm for the parameter estimation of GLMM, we attempt to estimate the fixed effect coefficients (β) based on the current estimation of the random effect parameters (σ), using the MH algorithm. Given the horizontally partitioned data, we need to modify the MH algorithm into a collaborative version that consists of the private computation, in which each participating party computes some intermediate results from its partial data, and the joint computation, in which the intermediate results from all parties are transferred to the central server to compute the new fixed effect coefficients for the next step. Figure 3a illustrates the procedure of the collaborative MH algorithm. As an initial step, based on the current estimated σ, the central server samples a set of random effects as the initial pool of samples, and sends them to all participating parties. In each subsequent iteration i, the central server first samples a new set of random effects Z from , and sends them to all parties. Then each party k () computes the intermediate results A on its private server by using the random effect from the previous step (), the new random effect (), as well as the private data records and held by the party. Then by using A we can update the result to get . where P is the same number of records in each category of random effect, and is the outcome (e.g. in the case-control GWAS study) of the pth record, and is the input fixed effect variables (e.g. the genotypes in GWAS) in all T random effect categories, which are held privately by the kth participating party. Again, for the convenience of presentation, here we assume each party k holds the same number (P) of records in each random effects category. In our implementation, however, different numbers of records are allowed to be held by different parties and in different random effects categories.

Fig. 3.

The data communication between the central server and participating parties in the collaborative MH algorithm (E-step; a) and the collaborative NR algorithm (M-step; b)

The data communication between the central server and participating parties in the collaborative MH algorithm (E-step; a) and the collaborative NR algorithm (M-step; b) The intermediate result A computed in the private computation by the kth party is then sent to the central server, where the joint computation is performed for computing the proposal probability density A, to decide if the previous samples of the random effect should be replaced by the new estimates z. The proposal probability A acts like a filter to replace the less likely (i.e. fitting improperly to the outcome according to the current estimates of fixed effect parameters β) random effects parameters sampled in the previous step: if A = 1, the previous parameters are definitively replaced; otherwise, it is replaced with the probability A. After the update of iteration i, the random effects Z are stored in the central server for the subsequent iterations in the MH algorithm. The iteration of the algorithm continues until the parameter estimation converges. In our experiments, the MH algorithm usually converges after 1000 iterations. After convergence, the central server will retain a pool of random effects obtained in each iteration. In order to better estimate the parameters , we adopted the burn-in strategy commonly used in MH algorithms (Chib and Greenberg, 1995), which eliminates some sampled parameters in the initial steps of MH sampling. Based on the final sample pool of random effects , the central server updates the parameters σ, which will be used in the M-step (see the collaborative NR algorithm in Section 2.4) for estimating the fixed effect coefficients β. The key idea of the collaborative MH algorithm presented here is the partition of the computation of proposal probability into two components, the private and joint computation, respectively. This partitioning approach is equivalent to the non-collaborative MH algorithm presented in Section 2.1.1. As a result, the collaborative MH algorithm offers mathematically equivalent solutions as the original MH algorithm in the E-step of the EM algorithm for GLMM parameter estimation, even though in practice, the solutions may not be identical due to the stochastic sampling in the MH algorithm. Apparently, the collaborative MH algorithm requires the central server and the server at each participating party to remain online to facilitate the in-time communication between servers for the coordinated private/joint computation. The entire computation involves MMH rounds of communications, where MMH represents the number of MH iterations before it converges. In each iteration, the central server sends a P dimensional vector (i.e. z) to each participating party, and receives another P dimensional vector i.e. A, from each party. In addition, the server at each party computes A privately, which takes O(P) running time and the central server computes the aggregate A (Equation 11), which takes O(P) time for each iteration. The private computation completed by the participating parties spends a majority of the running time during the entire process.

2.4 Collaborative NR algorithm

In the M-step, we use a NR algorithm to estimate the fixed effect coefficients (β) based on the current estimates of the random effect parameters σ. Similar to the E-step, we propose a collaborative NR algorithm that partitions the computation of the first-order derivative and Hessian matrix of the likelihood function, which are required to update the fixed effect coefficients, into the private and joint computation. Figure 3b illustrates the procedure of the collaborative NR algorithm in the M-step. Before the iteration starts, the central server will pick up the random β0 and Z as input. In each subsequent iteration i, we attempt to update β by using Equation 8, and thus we need to compute the Hessian matrix H and the first-order derivative . Notably, the computation of H and in the non-collaborative NR algorithm requires the entire input dataset, i.e. Y and X of each record t (Equation 9). To compute them without sharing the data among participating parties, we re-write the equations for computing the Hessian matrix and the first-order derivative. As a result, each participating party k can compute the intermediate results on its private server, including the partial Hessian matrix H and the partial first-order derivatives . We denote the fixed effects of the pth record in the tth category of random effect held by the kth party as . The k*th party then computes and sends the intermediate results to the central server for the joint computation of the full Hessian matrix and first-order derivatives, which are sent back to each participating party to update the fixed effect coefficients of its records (Equation 8) for the next iteration. Finally, the iterative procedure terminates after the estimates of the fixed effect coefficients converge or it reaches the preset maximum number of iterations. Notably, the collaborative computation for the Hessian matrix and the first-order derivatives is equivalent to the non-collaborative NR algorithm presented in Section 2.1.1. Therefore, the collaborative NR algorithm generates identical results (i.e. the fixed effect coefficients β) as the original NR algorithm in the M-step of the EM algorithm for GLMM parameter estimation. Similar to the collaborative MH algorithm, the collaborative NR algorithm also requires the central server and all participating parties’ servers to remain online for in-time communication. The entire computation involves MNR rounds of communications, where MNR represents the number of NR iterations before it converges. In each iteration, the central server receives the Hessian matrix (in M × M dimension) and first-order derivative (i.e. a M-dimensional vector) computed by each party and then sends back the full Hessian matrix and first-order derivative vector of the same size to each party. In the private computation, the server at each party first computes the partial Hessian matrix and partial first-order derivatives (Equations 12 and 13), which takes and running time, respectively, and also updates the fixed effect coefficients β, which takes time, where represents the number of fixed effect variables. On the other hand, in the joint computation, the central server combines these intermediate results into the full Hessian matrix and first-order derivatives (Equation 14), which only takes time, where N is the total number of records held by all parties. Overall, the private and the joint computation take about the same amount of running time in the collaborative NR algorithm. Above we presented the collaborative algorithms for the E- and M-step, respectively. It is straightforward to combine these them into a collaborative EM algorithm, which iterates between these two steps until the estimated parameters (including the fixed effect coefficients β and the random effect parameter σ) converge. The entire process requires rounds of communications, where C represents the number of iterations of the EM algorithm.

2.5 Implementation

We implemented the algorithm for cGLMM construction in R. The data communication was implemented using the rsocket package. The software is released in open source at https://github.com/huthvincent/cGLMM.

3 Results

3.1 Simulation experiments

We first conducted simulated experiments to test the performance of the cGLMM construction. We simulated three different datasets with 50(3.5 KB), 100 (11.8 KB) and 150 (25.6 KB) fixed effects over 1000, 2000 and 3000 records, respectively, which were held by two participating parties. For each dataset, there is one random effect with five different levels. We note that the numbers of fixed effects simulated here resemble the real cases in a collaborative study, in which a participating party have already identified a small number of putative effects from its own data, and hope to use the data from the other parties to validate these candidates. We compared the results of cGLMM using the collaborative EM algorithm with results of the regular GLMM using the non-collaborative EM algorithm. We considered only two parties; the running time may not be much longer when more parties are involved because the computation carried out by each party remains the same as long as each party has about the same number of records. For comparison purpose, we used the same convergence conditions and the same initial selection of parameters in the cGLMM and regular GLMM. Nevertheless, the final results were similar but not identical because the MH algorithm in the E-step may introduce some randomness. We set the maximum MH iteration as 1000, and only retain the last two sets of samples in the pool (i.e. burn-in = 998). The convergence condition of the EM algorithm is that the Euclidean distances between all parameters in three consecutive iterations are all below a threshold of 0.08. The same threshold was also used to determine the convergence of the MH algorithm. We simulated the genotypes as the fixed effects. The variable of each fixed effect can take , where 00, 01 and 11 represents the genotypes of homozygous major, heterozygous and homozygous minor, respectively. The frequencies of the three genotypes were simulated following the Hardy–Weinberg equilibrium with a specific minor allele frequency; e.g. if the minor allele frequency is 0.3, the simulated frequencies of three genotypes follow 0.49, 0.42 and 0.09, respectively. Finally, in the simulation of the LMM, we assume the minor allele has an additive effect on the outcome; as a result, the fixed effect variables were simulated as integer values of {0, 1, 2}, representing the three genotypes, respectively. GLMM can be applied to other models of genetic effect (e.g. the dominant effect model), which were not tested here. Table 1 compares the results from the collaborative (cGLMM) and non-collaborative (GLMM) EM algorithm on the three simulated datasets. Both algorithms converge after a small number of iterations, while cGLMM runs slower than GLMM, which is mostly due to the communication overhead. We note that our evaluation was conducted using three separate jobs (simulating two servers of the participating parties and the central server, respectively) on the same computer; thus, in reality, the running time of cGLMM can be significantly longer, depending on network bandwidth among the participating servers. However, even though the number of rounds of communication between servers is overall high, the total amount of data transferred is still moderate. Interestingly, we observe that with the increasing size of the GLMM model (i.e. with more fixed effect variables), the overhead of cGLMM comparing with GLMM actually decreases, probably because for complex GLMM problems, significantly more time is spent on the actual computation comparing to data communication. Finally, in all cases, cGLMM reported fixed effect coefficients that are nearly identical to the actual values used in the simulation [Pearson correlation coefficient (PCC) = 0.99] and the results from GLMM, suggesting cGLMM achieved the same accuracy as GLMM.

Table 1.

We run cGLMM and GLMM 10 times, below is the average comparison of GLMM and cGLMM on simulated datasets with

No. of SNPs	Iterations		Running time (min)		Communication round (cGLMM)	Communication size (cGLMM; KB)	PCCs^a
No. of SNPs	GLMM	cGLMM	GLMM	cGLMM	Communication round (cGLMM)	Communication size (cGLMM; KB)	PCCs^a
50	7 ± 1	7 ± 1	0.062 ± 0.01	2.9 ± 0.4	6107 ± 877	369.62 ± 53	0.99
100	8 ± 1	7 ± 1	0.096 ± 0.01	4.2 ± 0.6	6505 ± 930	807.40.1 ± 116	0.99
150	9 ± 2	8 ± 2	2.6 ± 0.6	26.6 ± 6.4	7825 ± 1842	1403.22 ± 342	0.99

PCCs, Pearson correlation coefficients between the actual fixed effect coefficients used in simulation and those reported by GLMM and cGLMM, respectively.

We run cGLMM and GLMM 10 times, below is the average comparison of GLMM and cGLMM on simulated datasets with PCCs, Pearson correlation coefficients between the actual fixed effect coefficients used in simulation and those reported by GLMM and cGLMM, respectively. Figure 4 compares the results of GLMM and cGLMM on the simulated dataset containing 150 fixed effect variables (genotypes) on 3000 records. As shown in Figure 4a and b, respectively, the distances between the fixed effect coefficients (β) and the random effect parameters (σ) in consecutive iterations becomes close to zero after only a few iterations of the collaborative EM algorithm, indicating it converges as fast as the non-collaborative EM algorithm. Furthermore, the fixed effect coefficients reported in cGLMM are very to those in GLMM and the actual coefficients used in the simulation (Fig. 4c; PCC > 0.99 with the actual values), indicating cGLMM reported comparable results as GLMM. The results from the other two datasets (containing 50 and 100 fixed effect variables) are generally similar, and are shown in Supplementary Figures S1 and S2. These results suggested the collaborative EM algorithm is accurate and efficient for constructing GLMM collaboratively among multiple parties.

Fig. 4.

The comparison between the results from cGLMM and GLMM on the simulated dataset containing 150 fixed effects in a typical run, for the distances between the fixed effect coefficients (β) (a), and the distances between the random effect parameters (σ) in consecutive iterations of the collaborative (in red) and non-collaborative (in green) EM algorithm, (b) as well as the actual fixed effect coefficients (β, sorted in decreasing order; in blue) and those reported by GLMM (in green) and cGLMM (in red), (c) respectively

3.2 Real human genomic data

Next, we used the real human genomic data from the 1000 Genomes Project (Wang ) to test our algorithm. The 1000 Genomes Project has total 2000 records, contain 1000 records in control group and 1000 records in case group. We selected all 2000 records from the whole dataset. We then selected the 10 most significant SNPs (by using a test; including five positively and five negatively correlated with Group I versus Group II) between these two groups (Group I and II, simulating the case and control groups in GWAS), and another randomly selected 40, 90 and 140 SNPs to form the three testing datasets containing 50, 100 and 150 SNPs (used as the fixed effect variables in GLMM), with the total data size of 28.6, 60.5 and 84.0 KB, respectively. We considered the gender of the individuals as the random effect: Group I contains 505 males and 495 females while Group II contains 479 males and 521 females. Similar to the simulation experiment, we consider the additive genetic model, in which the fixed effect variables take 0, 1 or 2, representing the genotypes of homozygous major, heterozygous and homozygous minor, respectively. We randomly split the genomes into two subsets containing 907 and 1093 records, respectively, assuming each participating party holds one of them. Table 2 compares the results from the collaborative (cGLMM) and non-collaborative (GLMM) EM algorithm on the three real human genomic datasets. Similar to the results from the simulated data, both algorithms converge after a few EM iterations, while cGLMM runs about 30 times slower than GLMM, which is mostly due to the communication overhead. In all cases, cGLMM reported fixed effect coefficients that nearly identical to the results from GLMM (PCC > 0.99).

Table 2.

Comparison between the results from cGLMM and GLMM on real human genomic datasets

No. of SNPs	Iterations		Running time (min)		Communication round (cGLMM)	Communication size (cGLMM; KB)	PCCs^a
No. of SNPs	GLMM	cGLMM	GLMM	cGLMM	Communication round (cGLMM)	Communication size (cGLMM; KB)	PCCs^a
50	4 ± 0.2	4 ± 0.2	0.05 ± 0.003	1.6 ± 0.08	3006 ± 110	55.8 ± 5.9	0.99
100	5 ± 0.3	5 ± 0.3	0.06 ± 0.003	3.1 ± 0.2	4004 ± 235	145.6 ± 11.5	0.99
150	5 ± 0.3	5 ± 0.3	0.2 ± 0.01	5.0 ± 0.3	4006 ± 241	615.6 ± 48.7	0.99

PCCs, Pearson correlation coefficients between the fixed effect coefficients reported by GLMM and those reported by cGLMM. All resource list in here are averages of 10 runs with .

Comparison between the results from cGLMM and GLMM on real human genomic datasets PCCs, Pearson correlation coefficients between the fixed effect coefficients reported by GLMM and those reported by cGLMM. All resource list in here are averages of 10 runs with . Figure 5 compares the results of GLMM and cGLMM on the real human genomic dataset containing 150 SNPs (including 10 significant SNPs between the two groups) over the 2000 genomes in two groups. Similar to the results from the simulated datasets, the cGLMM algorithm converges as fast as the GLMM algorithm (Fig. 5a and b), and achieved nearly identical results as the regular GLMM algorithm (Fig. 5c), where the 10 SNPs (leftmost and rightmost SNPs in Fig. 5c) which are highly correlated with the two groups receive high fixed effect coefficients. The testing on the other two datasets (containing 50 and 100 SNPs) showed similar results (see Supplementary Figs S3 and S4).These results confirmed the satisfactory performance of our cGLMM algorithm on real-world human genomic data.

Fig. 5.

The comparison between the results from cGLMM and GLMM on the human genomic dataset in a typical run, containing 150 SNPs (including 10 SNPs significantly associated with the two groups), for the distances between the fixed effect coefficients (β) (a), and the distances between the random effect parameters (σ) in consecutive iterations of the collaborative (in red) and non-collaborative (in blue) EM algorithm, as well as the fixed effect coefficients (β, sorted in decreasing order) reported by GLMM (in blue) and cGLMM (in red), respectively We note that the total size of transferred from the participating party to the central server is comparable or greater than the input data size. For example, the input data size is 28.6 KB for 50 SNP dataset, while on average 27.9 KB data are transferred from participating parties to the central server (the data of equal size are transferred from the central server to the parties and thus in total 55.8 KB are communicated) in the computation. Even though the collaborative algorithm does not reduce the size of data transferred from participating party to the central server, the transferred data are aggregate intermediate results instead of the raw genetic data. Furthermore, the communication data size is only dependent on the number of fixed effects (SNPs) and thus does not increase with more genomes (i.e. 1000, 2000 and 3000 in our experiments). Finally, we are implementing a HE version of cGLMM using homomorpheR (https://github.com/bnaras/homomorpheR), in which the intermediate results from each party will be encrypted and the collaborative computation will be implemented using the Paillier homomorphic addition. The updated implementation will be made available at https://github.com/huthvincent/cGLMM.

4 Discussions

Our current collaborative EM algorithm follows a distributed computing approach, in which the input data are partitioned and the computation is split among participating institutions. As a result, each partition of the input data can remain on the server of the respective participant that holds the partial data, and only the intermediate results (e.g. the partial proposal probabilities and the partial Hessian matrix) need to be sent out. Still, some computation needs to be performed on a central server, and the intermediate results need to be communicated between the central server and the server at each participating party. Our experimental results showed that even though several thousands rounds of communication are required to complete a reasonable-size GLMM task, the total amount of data transferred between servers is moderate (<100 MB). Therefore, we think our privacy-preserving algorithm can be practically used for collaboratively building GLMMs. However, we note that the many rounds of communication due to the required data exchange in each iteration of the MH algorithm may reduce the applicability of the method. We plan to explore the approximation approaches to the MH algorithm that may reduce the rounds of communication while without sacrificing its accuracy. We note that our current approach can only allow one random effect exist in the model and does not protect the intermediate results sent by each participating institution to the central server: they are sent in plain text. The assumption here is that the intermediate results do not carry sufficient sensitive information about the individual records, which is commonly adopted by existing privacy-preserving algorithms for other biomedical computation tasks, e.g. for logistic regression Wu . However, our method can be combined with encryption methods to provide additional protection on the intermediate results. For example, we may use HE (Kim ) to compute the proposal probability (Equation 11), the Hessian matrix and the first-order derivatives (Equation 14), respectively, on the central server. Because only additions are needed for these computations, one can use the light-weight Paillier cryptosystem with additive homomorphic properties (Parmar ), which introduces moderate overhead on running time and communication. A more efficient solution may also be implemented using the recently available TEEs (Chen , 2017a,b), such as Intel’s Software Guard Extensions (McKeen ). We are currently implementing the HE version of the cGLMM model using homomorpheR, and will release the updated version of cGLMM on its website. Our current approach addresses the cGLMM construction on horizontally partitioned data, when each participating institution holds the complete records of a subset of subjects. In comparison, the data may be vertically partitioned in some scenarios of collaborative computation. For examples, two medical institutions may have complementary clinical data (i.e. a subset of fixed effects) of the same patients, and attempt to combine their partial data for some joint analyses. Privacy-preserving methods have been developed for some data mining and machine learning algorithms on vertically partitioned data, e.g. for association rule mining (Vaidya and Clifton, 2002), k-means clustering (Vaidya and Clifton, 2003), naive Bayes classifier (Vaidya and Clifton, 2004) and support vector machine (Yu ). We plan to develop privacy-preserving algorithms for cGLMM construction on vertically partitioned data in the future. Click here for additional data file.

18 in total

Review 1. Genome-wide association studies for common diseases and complex traits.

Authors: Joel N Hirschhorn; Mark J Daly
Journal: Nat Rev Genet Date: 2005-02 Impact factor: 53.242

2. WebDISCO: a web service for distributed cox model learning without patient-level data sharing.

Authors: Chia-Lun Lu; Shuang Wang; Zhanglong Ji; Yuan Wu; Li Xiong; Xiaoqian Jiang; Lucila Ohno-Machado
Journal: J Am Med Inform Assoc Date: 2015-07-09 Impact factor: 4.497

Review 3. Genome-wide association studies for complex traits: consensus, uncertainty and challenges.

Authors: Mark I McCarthy; Gonçalo R Abecasis; Lon R Cardon; David B Goldstein; Julian Little; John P A Ioannidis; Joel N Hirschhorn
Journal: Nat Rev Genet Date: 2008-05 Impact factor: 53.242

4. PREMIX: PRivacy-preserving EstiMation of Individual admiXture.

Authors: Feng Chen; Michelle Dow; Sijie Ding; Yao Lu; Xiaoqian Jiang; Hua Tang; Shuang Wang
Journal: AMIA Annu Symp Proc Date: 2017-02-10

5. PRINCESS: Privacy-protecting Rare disease International Network Collaboration via Encryption through Software guard extensionS.

Authors: Feng Chen; Shuang Wang; Xiaoqian Jiang; Sijie Ding; Yao Lu; Jihoon Kim; S Cenk Sahinalp; Chisato Shimizu; Jane C Burns; Victoria J Wright; Eileen Png; Martin L Hibberd; David D Lloyd; Hai Yang; Amalio Telenti; Cinnamon S Bloss; Dov Fox; Kristin Lauter; Lucila Ohno-Machado
Journal: Bioinformatics Date: 2017-03-15 Impact factor: 6.937

6. EXpectation Propagation LOgistic REgRession (EXPLORER): distributed privacy-preserving online model learning.

Authors: Shuang Wang; Xiaoqian Jiang; Yuan Wu; Lijuan Cui; Samuel Cheng; Lucila Ohno-Machado
Journal: J Biomed Inform Date: 2013-04-04 Impact factor: 6.317

7. HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS.

Authors: Shuang Wang; Yuchen Zhang; Wenrui Dai; Kristin Lauter; Miran Kim; Yuzhe Tang; Hongkai Xiong; Xiaoqian Jiang
Journal: Bioinformatics Date: 2015-10-06 Impact factor: 6.937

8. VERTIcal Grid lOgistic regression (VERTIGO).

Authors: Yong Li; Xiaoqian Jiang; Shuang Wang; Hongkai Xiong; Lucila Ohno-Machado
Journal: J Am Med Inform Assoc Date: 2015-11-09 Impact factor: 4.497

9. A systematic review of re-identification attacks on health data.

Authors: Khaled El Emam; Elizabeth Jonker; Luk Arbuckle; Bradley Malin
Journal: PLoS One Date: 2011-12-02 Impact factor: 3.240

Review 10. Survival analysis part II: multivariate data analysis--an introduction to concepts and methods.

Authors: M J Bradburn; T G Clark; S B Love; D G Altman
Journal: Br J Cancer Date: 2003-08-04 Impact factor: 7.640

2 in total

1. dPQL: a lossless distributed algorithm for generalized linear mixed model with application to privacy-preserving hospital profiling.

Authors: Chongliang Luo; Md Nazmul Islam; Natalie E Sheils; John Buresh; Martijn J Schuemie; Jalpa A Doshi; Rachel M Werner; David A Asch; Yong Chen
Journal: J Am Med Inform Assoc Date: 2022-07-12 Impact factor: 7.942

2. Federated learning algorithms for generalized mixed-effects model (GLMM) on horizontally partitioned data from distributed sources.

Authors: Wentao Li; Jiayi Tong; Md Monowar Anjum; Noman Mohammed; Yong Chen; Xiaoqian Jiang
Journal: BMC Med Inform Decis Mak Date: 2022-10-16 Impact factor: 3.298

2 in total