| Literature DB >> 33784300 |
Yannick Marcon1, Tom Bishop2, Demetris Avraam3, Xavier Escriba-Montagut4,5, Patricia Ryser-Welch3, Stuart Wheater6, Paul Burton3, Juan R González4,5,7,8.
Abstract
Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).Entities:
Mesh:
Year: 2021 PMID: 33784300 PMCID: PMC8034722 DOI: 10.1371/journal.pcbi.1008880
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Main characteristics of the proposed infrastructure for privacy-protected federated data analyses with big data.
| Feature | Capabilities / Advantages | Limitations / Disadvantages |
|---|---|---|
| Resources | Any data source or computation resource that can be accessed from R is made available in DataSHIELD. These include databases of any kind (SQL, NoSQL, distributed Big Data systems), most of the file formats, domain specific applications accessible through web services or remote commands etc. Applies to any scientific domain. | Resource URL design can be complex when resource options need to be specified. R is the required entry point, which complexifies the use of analysis algorithms in Python for instance. |
| Scalability | Scales with number of studies. The | Some advanced statistical techniques may not scale well with number of records. The processing power can sometimes suffer from latency. There is a need for investigation share computation over multiple processors. |
| R programming language | R is an open-source and heterogenous programming language. Interpreters for available for many operating systems. A wide community support its users. | Skills for other functional and statistical programming languages can be transferred to learn R. However not all of the analysts are familiar with other programming languages, which can raise some barriers of adoption. |
| Disclosure prevention | Several complementary features that provide privacy preservation are implemented in the DataSHIELD architecture. Just one of these is that data custodians/owners (not | Data protection is shared between existing computer systems and DataSHIELD. An existing computer system that has some poorly implemented secured network access and authentication may threaten the quality of disclosure prevention. In other words, as well as the active privacy preservation built into DataSHIELD software, it is also crucial that the hardware on which everything runs must satisfy good practice for conventional privacy protection. |
| Graphical User Interface (GUI) | The use of R studio and the specialised data warehouse software have a GUI. Analysis can be integrated with GUI packages, such as R Shiny. | An appropriate use of R Scripts, R notebooks and vignettes is helpful. DataSHIELD users then need to learn how to use these approaches before using DataSHIELD. The use of R Studio can make the learning curve less steep. |
| Applications to other biomedical areas than genomics | Any other specific data infrastructures available in public repositories such as images (OpenNeuro), transcriptomic or epigenomic data (GEO) can be accessed and analyzed in a distributed and privacy preserving way. Applications outside biomedical, health and social science are also entirely possible. | There are no obvious scientific or academic domains to which DataSHIELD could not in principle be applied. Particularly, if the aim is to facilitate the quantitative analysis of individual level data which are sensitive either because of ethico-legal or information-governance restrictions, or because of their intrinsic intellectual or commercial value. |
| Cost | As everything is based on open-source freeware, the costs are minimized for any organisations who wishes to adopt DataSHIELD. | While DataSHIELD development is substantively funded through grants, there is nevertheless a need to seek support from the user community (particularly large-scale users) to contribute to the core DataSHIELD provision: user training and support; on-line support materials; bug snagging and error correction; continuous testing of evolving code; preparing and smoothly undertaking new releases. |
| Time | From the user perspective, DataSHIELD is a time-effective tool as does not require iterative in-person communication between the analyst and the data holders. This is most evident in comparing the time commitment required for a standard consortium-based meta-analysis and that required by a centrally controlled study-level meta-analysis via DataSHIELD. The former requires the analysis centre to ask each study to undertake a series of specified analysis and return results (typically several rounds of analysis and return). In contrast the latter is controlled in real time as if one was working directly with the raw data. This can speed things up by orders of magnitude. | Like any specialised open-source software, DataSHIELD analysts, developers and installer need some time to adapt to the concepts of federated systems and disclosure limitations. |
| Server sharing | DataSHIELD supports multi-tenancy, to permit multiple users to share a DataSHIELD server. It also permits multiple servers to be deployed if needed. | DataSHIELD requires some specialised data warehouse software such as Opal. At the moment there is no white-paper that formally defines standards for further development in this area. |
| Documentation | DataSHIELD has a wiki that provides beginners training to any new DataSHIELD analysts and developers. Some more advanced tutorials are available on multimedia contents through YouTube. Documentation for the Opal specialised data warehouse software is available online. | DataSHIELD documentation needs to be more supportive to the DataSHIELD developers. An online forum is available for community engagement and support. |
| Deployment | Software solution packages are available for different hosting systems (includes a container-based option) and the resource concept can adapt itself to the existing hosting infrastructure. | It is important that there is at least a minimum baseline level of system administration knowledge on the part of the data owner and proper dimensioning of the hardware (especially when targeting multi-user, computation intensive usage). Nobody should be making sensitive data available for analysis via any mechanism–including DataSHIELD–if they do not have a proper understanding of their data systems or of the governance framework under which analysis is to be enacted. |
Available resources at the resourcer R package and extensions for genomic data.
| Type | Resource | Reference | R package | Use |
|---|---|---|---|---|
| File reader | R data | resourcer | Any | |
| File reader | Tidy data (.csv,.tsv, txt, …) | resourcer | Any | |
| File location | S3 compatible file store | s3.resourcer | Any | |
| Database | SQL | resourcer | Any | |
| Database and Big Data analytics | SQL | resourcer | Any | |
| Database | NoSQL | resourcer | Any | |
| Computation service | SSH | resourcer | Any | |
| Domain specific | HL7 FHIR | fhir.resourcer | Patient’s data | |
| Database | SQL | odbc.resourcer | Any | |
| File reader | VCF | dsOmics | Genomic | |
| File reader | GDS | dsOmics | Genomic | |
| File reader | BAM | dsOmics | Genomic | |
| File reader | Bioconductor infrastructures (ExpressionSet, RangedSummarizedExperiment, MultiAssayExperiment, …) | dsOmics | Genomic | |
| Computation service | PLINK | dsOmics | Genomic | |
| Domain specific | GA4GH | dsOmics | Genomic and clinical | |
| Domain specific | EGA | dsOmics | Genomic and clinical |