| Literature DB >> 28882105 |
P Saha-Chaudhuri1, C R Weinberg2.
Abstract
BACKGROUND: Data confidentiality and shared use of research data are two desirable but sometimes conflicting goals in research with multi-center studies and distributed data. While ideal for straightforward analysis, confidentiality restrictions forbid creation of a single dataset that includes covariate information of all participants. Current approaches such as aggregate data sharing, distributed regression, meta-analysis and score-based methods can have important limitations.Entities:
Keywords: Conditional logistic regression; Data privacy; Distributed data network; Matched case-control design; Specimen pooling
Mesh:
Year: 2017 PMID: 28882105 PMCID: PMC5590217 DOI: 10.1186/s12874-017-0419-0
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Models of horizontally and vertically partitioned data as adapted from Karr, et al. (2007). Each row is for one subject, and columns are for covariates. (a) Horizontally partitioned data. Data subjects are partitioned among database owners or nodes. (b) Vertically partitioned data. Covariates are partitioned among nodes
Fig. 2Schematic of distributed data. The analytical center is indicated by dashed red square and the nodes are indicated by solid blue circles. The arrows indicate the flow of information: the dashed red arrows represent the flow of instructions from center to nodes and the solid blue arrows represent the flow of aggregate data from the nodes to the center. Since the aggregate data do not reveal individual-level covariate combinations, and the center does not own microdata, the information flow is preserves data confidentiality
Fig. 3Schematic of pooling strategy for a 1:3 matched case-control study with three nodes, numbered 1–3 and 10 matched sets distributed over these nodes. For each matched set, case is indicated by filled red circle, three matched controls are indicated by shaded green circles. A line joins the case and controls in each matched stratum. Boxes represent the nodes. In the table on the right hand side, the first column indicates the node number and the second column indicates the matched set number. The 3rd column titled “Current” indicates the pool id if the current strategy of Saha-Chaudhuri and Weinberg (2017) is applied directly. The 4th and 5th column indicates the pool ids for two variants of the proposed within-node pooling strategy. In “New - 1” (4th column), poolsize g = 2 is used. The first two pools are formed within node 1, third pool is formed within node 2 and the fourth pool is formed within node 3. Since g = 2 is used and nodes 2 and 3 have odd number of matched sets, within-node pooling leads to exclusion of one matched set each from node 2 and node 3. In Scheme 2 (4th column, titled “New - 2”), poolsizes g = 2 and 3 are used. The first two pools are formed within node 1 (g = 2), third pool is formed within node 2 (g = 3) and the fourth pool is formed within node 3 (g = 3). Consequently, all participant data were used for analysis. Secure summation is not required for within-node stratified pooling as only the aggregate covariate levels are released from the nodes
Pooled data passed from a node to the Analytical Center. We assume a 1:1 matched design. Here V i(1) (V i(1))denotes the pooled level of variable 1 (variable 2) for the ith case pool and V1 i(0) (V2 i(0))denotes the pooled level of variable 1 (variable 2) for the ith control pool within the node, i = 1, 2, …, k, where each pool consists of either g cases or g matched controls
| Pool id | Pooled Covariate V1 | Pooled Covariate V2 | Pooled Covariate … | |||
|---|---|---|---|---|---|---|
| Case | Control | Case | Control | Case | Control | |
| 1 | V1 1(1) | V1 1(0) | V2 1(1) | V2 1(0) | ⋮ | ⋮ |
| ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ |
| i | V1 i(1) | V1 i(0) | V2 i(1) | V2 i(0) | ⋮ | ⋮ |
| ⋮ | ⋮ | ⋮ | ⋱ | ⋮ | ⋮ | ⋮ |
| k | V1 k(1) | V1 k(0) | V2 k(1) | V2 k(0) | ⋮ | ⋮ |
Parameter estimates between standard analysis and pooled analysis with different pool sizes for a binary outcome with a matched design. See Results section for detailed simulation setting. In addition, model-based SE (ModelSE), the Monte Carlo Standard Error (EmpSE) and coverage (nominal: 0:95) are also shown
| Parameters | Unpooled | Pooled | ||
|---|---|---|---|---|
|
|
|
| ||
|
| ||||
| Estimate | 0.301 | 0.303 | 0.307 | 0.336 |
| EmpSE | 0.014 | 0.022 | 0.028 | 0.077 |
| ModelSE | 0.014 | 0.022 | 0.029 | 0.060 |
| Coverage | 0.958 | 0.956 | 0.964 | 0.964 |
|
| ||||
| Estimate | 0.202 | 0.204 | 0.207 | 0.234 |
| EmpSE | 0.077 | 0.101 | 0.128 | 0.219 |
| ModelSE | 0.076 | 0.100 | 0.122 | 0.199 |
| Coverage | 0.954 | 0.956 | 0.944 | 0.952 |
|
| ||||
| Estimate | 0.149 | 0.150 | 0.150 | 0.165 |
| EmpSE | 0.037 | 0.049 | 0.062 | 0.104 |
| ModelSE | 0.037 | 0.049 | 0.060 | 0.098 |
| Coverage | 0.952 | 0.958 | 0.952 | 0.960 |
|
| ||||
| Estimate | 0.088 | 0.088 | 0.091 | 0.104 |
| EmpSE | 0.050 | 0.067 | 0.080 | 0.134 |
| ModelSE | 0.049 | 0.063 | 0.076 | 0.123 |
| Coverage | 0.964 | 0.936 | 0.942 | 0.954 |
|
| ||||
| Estimate | 0.050 | 0.051 | 0.051 | 0.054 |
| EmpSE | 0.013 | 0.018 | 0.023 | 0.039 |
| ModelSE | 0.013 | 0.018 | 0.022 | 0.037 |
| Coverage | 0.954 | 0.944 | 0.952 | 0.944 |
Comparison between pooled analysis and unpooled analysis of matched case-control study on obesity. The column marked “Standard CLogit” displays results of a standard conditional logistic regression analysis for the matched case-control study design. The column marked “Pooled CLogit” displays results of pooled conditional logistic regression analysis with within-stratum pooling and poolsize g = 4. The OR and 95% confidence interval (CI) for the covariates are includes
| Variable | OR (95% CI) | |
|---|---|---|
| Analysis | Standard CLogit | Pooled CLogit |
| DBP | 1.007 (0.999, 1.016) | 1.005 (0.996, 1.014) |
| Can walk or bike to work | 1.349 (1.068, 1.702) | 1.357 (1.055, 1.745) |
| Vigorous Recreational Activity | 1.710 (1.263, 2.315) | 1.955 (1.352, 2.827) |
| Moderate Recreational Activity | 1.289 (1.043, 1.592) | 1.270 (1.016, 1.586) |
| Moderate work activity | 0.853 (0.693, 1.049) | 0.813 (0.653, 1.013) |