Literature DB >> 35659747

dsSurvival: Privacy preserving survival models for federated individual patient meta-analysis in DataSHIELD.

Soumya Banerjee1, Ghislain N Sofack2,3, Daniela Zöller2,3, Tom R P Bishop4, Thodoris Papakonstantinou2,3, Demetris Avraam5,6, Paul Burton5.   

Abstract

OBJECTIVE: Achieving sufficient statistical power in a survival analysis usually requires large amounts of data from different sites. Sensitivity of individual-level data, ethical and practical considerations regarding data sharing across institutions could be a potential challenge for achieving this added power. Hence we implemented a federated meta-analysis approach of survival models in DataSHIELD, where only anonymous aggregated data are shared across institutions, while simultaneously allowing for exploratory, interactive modelling. In this case, meta-analysis techniques to combine analysis results from each site are a solution, but an analytic workflow involving local analysis undertaken at individual studies hinders exploration. Thus, the aim is to provide a framework for performing meta-analysis of Cox regression models across institutions without manual analysis steps for the data providers.
RESULTS: We introduce a package (dsSurvival) which allows privacy preserving meta-analysis of survival models, including the calculation of hazard ratios. Our tool can be of great use in biomedical research where there is a need for building survival models and there are privacy concerns about sharing data.
© 2022. The Author(s).

Entities:  

Keywords:  Federated analysis; Meta-analysis; Survival analysis

Mesh:

Year:  2022        PMID: 35659747      PMCID: PMC9166323          DOI: 10.1186/s13104-022-06085-1

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Introduction

Survival models are widely used in biomedical research for analyzing survival data [1]. These models help researchers compare the effect of exposures on mortality or other outcomes of interest [2]. The Cox proportional hazards model [3] is one of the most popular survival analysis models and was primarily developed to determine the importance of predictors in survival, by using covariate information to make individual predictions [4]. Achieving sufficient power in survival analysis usually requires large amounts of data from several sites or institutions. Multi-site analysis across studies with different population characteristics help us understand how diseases affect different populations and what it is about these populations that cause these differences. However, the number of cases at a single site is often rather small, making statistical analysis challenging. Also due to the sensitivity of individual-level biomedical data, ethical and practical considerations related to data transmission, and institutional policies, it may sometimes be difficult to share individual-level data [5]. In consortia, this issue is often addressed by manual analysis in each site, followed by a manual meta-analysis of the analysis results from the individual sites. This process is very time-consuming and error-prone, making exploratory analysis (e.g., for understanding different effect patterns in each site) impractical. As an alternative, the DataSHIELD framework can be used. DataSHIELD is a framework that enables the remote and privacy preserving analysis of sensitive research data [6]. The framework is based on the programming language R [7, 8]. In each site, specifically requested aggregated anonymous analysis results can be requested, which are then combined in a central analysis server. The requirement is that the analysis be privacy preserving and be conducted across globally distributed cohorts. We have implemented a meta-analysis approach based on the Cox-model in DataSHIELD using individual patient data that are distributed across several sites, without moving those data to a central site i.e., the individual-level data remain within each site and only non-disclosive aggregated data are shared. Our software package for DataSHIELD allows building of survival models and analyzing results in a federated privacy preserving fashion. Remote federated meta-analysis allows the analysis to come to the data and enables multiple research groups to collate their data [7, 8]. This is an alternative to literature based meta-analysis since study variables and outcomes can be harmonised [9]. Our package offers considerable advantages over: (1) literature based meta-analysis, which suffers from publication bias as well as restricting the analytic endpoints you may wish to use; (2) central pooling of data, which provides important governance challenges and can engender privacy risks; and (3) asking researchers in each location to do local analyses based on a shared analysis plan, which all too often demands numerous emails, with repeated reminders, to disseminate analytic protocols and return results for meta-analysis, which is typically time-consuming and can be error prone. Our tool can be of great use in domains where there is a need for building survival models and there are privacy concerns about sharing data.

Main text

Basics of survival analysis

Survival analysis can be used to analyze clinical data if there are records of patient mortality and time to event data. The key quantity is a survival function:where S(t) is the survival function, t is the current time, and T is a random variable denoting time of death. Pr() is the probability that the time of death is greater than time t i.e., the probability of surviving till time t. The instantaneous hazard [] is the probability of death occurring within time period [ t, t + t ] given survival till time t. This is related to the survival function as follows:The proportional hazards model assumes that the effect of covariates is proportional to the hazard. This is modelled as follows using the hazard function :where is called the baseline hazard and is the hazard at time t. denotes the parameters and is the jth covariate for the i th subject. We aim to meta-analyze these log hazard ratios.

Implementation

DataSHIELD operates on a distributed architecture that only allows restricted computation. DataSHIELD has a client-server architecture (Fig. 1). There are multiple servers located on separate sites and there is a single analysis client. Assign functions in DataSHIELD perform computation and ultimately create objects that persist on the servers and are not shared with the analysis client. These server-side objects can then be used for subsequent computations. Aggregate functions in DataSHIELD perform computation on each site, check for disclosure risks, and send aggregated results back to a client. The results do not persist on the servers, but can be saved on the client. This is shown in Fig. 1.
Fig. 1

Client-server architecture of DataSHIELD. The diagram shows four study sites/servers (DC) each having data stored in the Original.DB. The analyst (client) sends commands from the analysis computer (AC) to each study site to request the specific data (Assigned.Data) to be analyzed. This could be all the variables or specific variables stored in Analysis.DB. R commands are also sent from the analysis computer to every study telling it to create survival objects and fit the Cox proportional hazard model. Each site responds to instructions sent by creating the survival object and fitting the model. This fitting is carried out in the R environment of each study. The coefficient matrices, standard errors, and odds ratios from each site are then pooled and meta-analyzed using fixed optimization methods, and only non-disclosive statistics are returned to the analyst

Client-server architecture of DataSHIELD. The diagram shows four study sites/servers (DC) each having data stored in the Original.DB. The analyst (client) sends commands from the analysis computer (AC) to each study site to request the specific data (Assigned.Data) to be analyzed. This could be all the variables or specific variables stored in Analysis.DB. R commands are also sent from the analysis computer to every study telling it to create survival objects and fit the Cox proportional hazard model. Each site responds to instructions sent by creating the survival object and fitting the model. This fitting is carried out in the R environment of each study. The coefficient matrices, standard errors, and odds ratios from each site are then pooled and meta-analyzed using fixed optimization methods, and only non-disclosive statistics are returned to the analyst The communication between the client and server for the survival models is shown in Fig. 2 for an assign function [ ds.Surv() to create survival objects on the servers] and aggregate function [ ds.coxphSLMA() to perform a meta-analysis of Cox regression models]. This shows an asynchronous mode of operation in DataSHIELD where multiple parties (sites) perform secure computation. The client-side package is called dsSurvivalClient and the server-side package is called dsSurvival.
Fig. 2

Architecture of client and server side functions for building survival models in dsSurvival. Left panel: an assign function for creating a server-side survival object using ds.Surv(). Right panel: an aggregate function for a Cox proportional hazards model using ds.coxphSLMA()

Architecture of client and server side functions for building survival models in dsSurvival. Left panel: an assign function for creating a server-side survival object using ds.Surv(). Right panel: an aggregate function for a Cox proportional hazards model using ds.coxphSLMA() The server-side package dsSurvival 1.0.0 contains the functions SurvDS() and coxphSLMADS(). These functions are configured to reside in modified R environments located behind a firewall at each institution and process the individual-level data at each distinct repository. dsSurvivalClient contains the functions ds.Surv() and ds.coxphSLMA(). These functions reside on the conventional R environment of the analyst. The ds.Surv() (assign function) calls the server-side function SurvDS() to assign survival objects in each site. This can then be used as the response variable in the ds.coxphSLMA() (aggregate) function. The ds.coxphSLMA() function calls and controls the corresponding server-side functions coxphSLMADS() and performs the regression analysis at different sites. These functions implement study-level meta-analysis (SLMA). The estimates from each site are combined and then pooled using fixed effects or random effects meta-analysis.

Computational pipeline and use case

We outline the development and code for implementing survival models (Cox regression) and meta-analysis of hazard ratios in our package (dsSurvival). A tutorial in bookdown format is available here: https://neelsoumya.github.io/dsSurvivalbookdown/ In the following, we demonstrate the computational steps using synthetic data. The first step is using DataSHIELD to connect to the server and loading the survival data. We assume that the reader is familiar with these details. We show the steps using synthetic data. There are 3 data sets that are held on the same server but can be considered to be on separate servers/sites. The variable EVENT holds the event information and variables STARTTIME and ENDTIME hold the time information. There is also age and gender information in variables named age and female, respectively. We will look at how age and gender affect survival time and then meta-analyze the hazard ratios. For details on how to setup the variables, please see the bookdown above. The log-hazard ratios and their standard errors from each study can be found after running ds.coxphSLMA(). The hazard ratios can then be meta-analyzed using the metafor package [10]. Fig. 3 shows an example forest plot with meta-analysed hazard ratios. The plot shows the log hazard ratios corresponding to age in the survival model.
Fig. 3

A plot showing the meta-analyzed hazard ratios generated from dsSurvival. A Cox proportional hazards model was fit to synthetic data. The hazard ratios correspond to age in a survival model

A plot showing the meta-analyzed hazard ratios generated from dsSurvival. A Cox proportional hazards model was fit to synthetic data. The hazard ratios correspond to age in a survival model There are two options to generate the survival object. The analyst can generate it separately or inline [for example, by the following command: dsSurvivalClient::ds.coxph.SLMA(formula = ’survival::Surv(time=SURVTIME,event=EVENT) D$age+D$female’) ]. If a survival object is generated separately, it is stored on the server and can be used later in an assign function [ds.coxphSLMAassign()]. This allows the survival model to be stored on the server and can be used later for diagnostics.

Preserving privacy and disclosure checks

Disclosure checks are an integral part of DataSHIELD and dsSurvival. dsSurvival leverages the DataSHIELD framework to ensure that multiple parties perform secure computation and only the relevant aggregated statistical details are shared. We disallow any Cox models where the number of covariate terms are greater than a fraction (default set to 20%) of the number of data points. The number of data points is the number of entries (for all patients) in the survival data. This fraction can be also be changed by the data custodian or administrator in DataSHIELD. We also deny any access to the baseline hazard function.

Diagnostics for Cox proportional hazards models

We generate diagnostics for Cox models using the function dsSurvivalClient::ds.cox.zphSLMA(). These diagnostics can allow an analyst to determine if the proportional hazards assumption in Cox proportional hazards models is satisfied. If the p-values returned by dsSurvivalClient::ds.cox.zphSLMA() are greater than 0.05 for a covariate, then the proportional hazards assumption is likely correct for that covariate. If the proportional hazards assumptions are violated, then the analyst may wish to modify the model. Modifications may include introducing strata or using time-dependent covariates.

Discussion and conclusion

dsSurvival is a DataSHIELD package for privacy preserving meta-analysis of survival data distributed across different sites. dsSurvival also performs federated calculation of hazard ratios. Its implementation relies exclusively on the distributed algorithm of the DataSHIELD environment. DataSHIELD facilitates important research particularly amongst institutions that are not allowed to transmit patient-level data to an outside server. Previously building survival models in DataSHIELD involved using approximations like piecewise exponential models. This involves defining time buckets and is an additional burden on the researcher. A lack of familiarity with this approach also makes people less trusting of the results. Previous work has looked at reducing the dimensions of a survival model and the reduced feature space model is then shared amongst multiple parties [11]. Survival analysis is also possible in DataSHIELD using dsSwissKnife [12]. However, our package offers advantages such as storing the model on the server-side, diagnostics, integration with client-side meta-analysis and future plans to add in more functionality such as survival curves. We have released an R package for privacy preserving survival analysis in DataSHIELD. Our tool can be of great use in domains where there is a need for building survival models and there are privacy concerns about sharing data. We hope this suite of tools and tutorials will serve as a guideline on how to use survival analysis in a federated environment.

Limitations

Our approach implements study-level meta-analysis. This is a computationally faster approach but is also a limitation, especially if the units of meta-analysis are centres within a study. This may reduce the number of events per center and normality approximations implicit in two-stage meta-analysis may be violated. In the future we will implement functionality of iteratively fitting a single model across all studies. We will also develop plotting of privacy preserving survival curves and the ability to have time-dependent covariates in survival models. Our package also does not return Schoenfeld or Martingale residuals (due to privacy concerns), which are used as diagnostics for survival models. Finally in the future we will apply our package on real world data and solve any practical issues that arise.
  5 in total

Review 1.  Data Sharing For Precision Medicine: Policy Lessons And Future Directions.

Authors:  Alessandro Blasimme; Marta Fadda; Manuel Schneider; Effy Vayena
Journal:  Health Aff (Millwood)       Date:  2018-05       Impact factor: 6.301

2.  Time-dependent Cox regression: serial measurement of the cardiovascular biomarker proadrenomedullin improves survival prediction in patients with lower respiratory tract infection.

Authors:  Oliver Hartmann; Philipp Schuetz; Werner C Albrich; Stefan D Anker; Beat Mueller; Thorsten Schmidt
Journal:  Int J Cardiol       Date:  2012-09-24       Impact factor: 4.164

3.  DataSHIELD: taking the analysis to the data, not the data to the analysis.

Authors:  Amadou Gaye; Yannick Marcon; Julia Isaeva; Philippe LaFlamme; Andrew Turner; Elinor M Jones; Joel Minion; Andrew W Boyd; Christopher J Newby; Marja-Liisa Nuotio; Rebecca Wilson; Oliver Butters; Barnaby Murtagh; Ipek Demir; Dany Doiron; Lisette Giepmans; Susan E Wallace; Isabelle Budin-Ljøsne; Carsten Oliver Schmidt; Paolo Boffetta; Mathieu Boniol; Maria Bota; Kim W Carter; Nick deKlerk; Chris Dibben; Richard W Francis; Tero Hiekkalinna; Kristian Hveem; Kirsti Kvaløy; Sean Millar; Ivan J Perry; Annette Peters; Catherine M Phillips; Frank Popham; Gillian Raab; Eva Reischl; Nuala Sheehan; Melanie Waldenberger; Markus Perola; Edwin van den Heuvel; John Macleod; Bartha M Knoppers; Ronald P Stolk; Isabel Fortier; Jennifer R Harris; Bruce H R Woffenbuttel; Madeleine J Murtagh; Vincent Ferretti; Paul R Burton
Journal:  Int J Epidemiol       Date:  2014-09-26       Impact factor: 7.196

Review 4.  Review of survival analyses published in cancer journals.

Authors:  D G Altman; B L De Stavola; S B Love; K A Stepniewska
Journal:  Br J Cancer       Date:  1995-08       Impact factor: 7.640

5.  Associations between maternal physical activity in early and late pregnancy and offspring birth size: remote federated individual level meta-analysis from eight cohort studies.

Authors:  S Pastorino; T Bishop; S R Crozier; C Granström; K Kordas; L K Küpers; E C O'Brien; K Polanska; K A Sauder; M H Zafarmand; R C Wilson; C Agyemang; P R Burton; C Cooper; E Corpeleijn; D Dabelea; W Hanke; H M Inskip; F M McAuliffe; S F Olsen; T G Vrijkotte; S Brage; A Kennedy; D O'Gorman; P Scherer; K Wijndaele; N J Wareham; G Desoye; K K Ong
Journal:  BJOG       Date:  2018-10-22       Impact factor: 6.531

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.