| Literature DB >> 25261970 |
Amadou Gaye1, Yannick Marcon1, Julia Isaeva1, Philippe LaFlamme1, Andrew Turner1, Elinor M Jones1, Joel Minion1, Andrew W Boyd1, Christopher J Newby1, Marja-Liisa Nuotio1, Rebecca Wilson1, Oliver Butters1, Barnaby Murtagh1, Ipek Demir1, Dany Doiron1, Lisette Giepmans1, Susan E Wallace1, Isabelle Budin-Ljøsne1, Carsten Oliver Schmidt1, Paolo Boffetta1, Mathieu Boniol1, Maria Bota1, Kim W Carter1, Nick deKlerk1, Chris Dibben1, Richard W Francis1, Tero Hiekkalinna1, Kristian Hveem1, Kirsti Kvaløy1, Sean Millar1, Ivan J Perry1, Annette Peters1, Catherine M Phillips1, Frank Popham1, Gillian Raab1, Eva Reischl1, Nuala Sheehan1, Melanie Waldenberger1, Markus Perola1, Edwin van den Heuvel1, John Macleod1, Bartha M Knoppers1, Ronald P Stolk1, Isabel Fortier1, Jennifer R Harris1, Bruce H R Woffenbuttel1, Madeleine J Murtagh1, Vincent Ferretti1, Paul R Burton1.
Abstract
BACKGROUND: Research in modern biomedicine and social science requires sample sizes so large that they can often only be achieved through a pooled co-analysis of data from several studies. But the pooling of information from individuals in a central database that may be queried by researchers raises important ethico-legal questions and can be controversial. In the UK this has been highlighted by recent debate and controversy relating to the UK's proposed 'care.data' initiative, and these issues reflect important societal and professional concerns about privacy, confidentiality and intellectual property. DataSHIELD provides a novel technological solution that can circumvent some of the most basic challenges in facilitating the access of researchers and other healthcare professionals to individual-level data.Entities:
Keywords: DataSHIELD; ELSI; bioinformatics; confidentiality; disclosure; distributed computing; intellectual property; pooled analysis; privacy
Mesh:
Year: 2014 PMID: 25261970 PMCID: PMC4276062 DOI: 10.1093/ije/dyu188
Source DB: PubMed Journal: Int J Epidemiol ISSN: 0300-5771 Impact factor: 7.196
Figure 1.Typical DataSHIELD setting for a pooled individual-level analysis.
Figure 2.Overview of the IT infrastructure required for a DataSHIELD process. The settings are the same in all DCs so only one is highlighted in this figure.
Figure 3.Graphical view of pooled data (a), horizontally partitioned (b) and vertically partitioned data (c).
Detailed explanations of the steps in DataSHIELD process
| Step | Explanation | Input data | Output data | Output location | Visibility |
|---|---|---|---|---|---|
| (0) Preliminary and prerequisite step | Strictly speaking, this step is not part of a DataSHIELD analysis process; it is, however, a prerequisite for any valid analysis that pools multiple data sets. Each contributing study (e.g. | All variables held in | Variables required for combined analysis. No data that may potentially be directly identifying [e.g. a full UK postcode, a full date of birth, or an ID that is equivalent to an ID available elsewhere (e.g. a national health system number)] unless such a variable is essential to the required analysis | A new analysis SQL database linked to an Opal server: | Invisible outside |
| (1) Login to collaborating servers | The user logs into the collaborating servers through secured web services, using the credentials provided to them. This authentication ensures that only users authorized by the access body (e.g. an analysis access panel put in place by the consortium) can actually carry out an analysis | A command with a specific public/private key pair | No data are returned, the connection is established | Not applicable | No individual-level server-side data are ever visible to the user after login |
| (2 and 3) Request and transfer of the shared data to the analysis zone | (2) The user sends a command to request the specific data to analyse. This could be all the variables or specific variables stored in | All or some of the variables in | A data frame (an R data structure) with all variables or part of the variables in | Individual-level data invisible outside | |
| (4) Starting the analysis (i.e. sending command to fit a GLM model) | The researcher sitting at the analysis computer (AC) sends an R command to every study telling it to fit one iteration of a generalized linear modelling fitting procedure (the iterative reweighted least-squares algorithm), including first-‘guessed’ estimates at what the ultimate set of regression coefficients will be | A short set of instructions completely unrelated to any data in any study which contains the model to fit and an arbitrary string of numbers representing the first-guessed coefficient estimates | Instructions about the model to fit and the coefficient estimates are received by | The model to fit and the coefficient estimates are visible outside | |
| (5) Carrying out the analysis locally (i.e.enact one iteration of a GLM fit) | Each data computer responds to the instructions sent from the AC in step 4 by running a single iteration of aGLM fit. This fitting is carried out in | Instructions as in step 4 | A score vector (e.g. | ||
| (6) Summary statistics returned to the analysis computer | DC1 transmits | Analysis computer | |||
| (7) Combining the summary statistics returned by the DCs | The analysis computer adds up the score vectors and information matrices from all DCs, divides the first sum by the latter (technically, a matrix multiplication) and uses the result to update the coefficient estimates using the conventional updating algorithm called the Iterative Reweighted Least Squares (IRLS) algorithm | Score vectors and information matrices from all DCs | New coefficient estimates | Analysis computer | All visible to outside world, but carry no sensitive information |
| (8) Repeat step 4 with updated coefficients | The same process as in step 4 re-starts; the analysis computer commands the DCs to fit the same model with the updated coefficient estimates | Same as per step (4) | Same as per step (4) | Same as per step (4) | Same as per step (4) |
| Keep repeating steps 5-8 until the model is almost unchanged (judged by appropriate convergence criterion) between iterations—the model is then said to have converged | |||||
Figure 4.Overview of a DataSHIELD process. Each of the 8 steps and the terms used to refer to the key components and data exchanged between AC and DCs are detailed in Table 1.
Healthy Obese Project collaborating studies and shared number of participants at the time of this work
| Study name | Host institution | Location | Participants |
|---|---|---|---|
| Cooperative Health Research in South Tyrol Study (CHRIS) | European Academy of Bolzano | Bolzano, Italy | 1583 |
| Cooperative Health Research in the Region of Augsburg (KORA) | Helmoltz Center Munich | Augsburg, Germany | 3080 |
| LifeLines Cohort Study (LifeLines) | University Medical Center Groningen | Groningen, The Netherlands | 94516 |
| Mitchelstown Study Population (Mitchelstown) | Living Health Clinic in Mitchelstown | Cork, Ireland | 2047 |
| Microisolates in South Tyrol Study (MICROS) | European Academy of Bolzano | Bolzano, Italy | 1060 |
| National Child Development Study (NCDS) | University of Leicester | Leicester, UK | 7210 |
| FINRISK 2007 Study (FINRISK 2007) | National Institute for Health and Welfare | Helsinki, Finland | 5024 |
| Nord-Trøndelag Health Study (HUNT) | Norwegian University of Science and Technology | Levanger, Norway | 78968 |
| Prevention of REnal and Vascular ENd-stage Disease study (PREVEND) | University Medical Center Groningen | Groningen, The Netherlands | 8592 |
| The Study of Health in Pomerania (SHIP) joined HOP after the analysis reported in this paper, and so the text and figures refer to 9 not 10 studies | University Medicine of Greifswald | Greifswald, Germany | 4308 |
Figure 5.For the Healthy Obese Project, communications between AC and DCs were channelled through a trusted portal.
Figure 6.Histogram plots of the variable ‘LAB_HDL' for each study (A) and for the pooled data (B).
Comparison of the critical outputs of the same GLM model fitted using DataSHIELD (in light shading) and using standard R with the physically pooled data (in dark shading)
DataSHIELD derived estimates rounded to same decimal places as standard R estimates. To avoid confusion, it should be noted that at a very early stage of the HOP analysis, the name of the categorical BMI variable was misspelt as ‘… CATEGORIAL …’. As that misspelling is now entrenched in all of the harmonized data sets etc. we chose not correct it for this paper.
SE, standard error.
Figure 7.Illustration of DataSHIELD set-up for the analyses of: (a) horizontally partitioned data (similar data, different individuals) held in GP databases and/or data centres. (**Single-site DataSHIELD); and (b) vertically partitioned data requiring record linkage between different types of data on the same individuals held in a variety of data archives.