| Literature DB >> 28232859 |
Youri Hoogstrate1, Chao Zhang2, Alexander Senf3, Jochem Bijlard4, Saskia Hiltemann1, David van Enckevort5, Susanna Repo6, Jaap Heringa2, Guido Jenster7, Remond J A Fijneman8, Jan-Willem Boiten9, Gerrit A Meijer8, Andrew Stubbs1, Jordi Rambla10, Dylan Spalding3, Sanne Abeln2.
Abstract
High-throughput molecular profiling techniques are routinely generating vast amounts of data for translational medicine studies. Secure access controlled systems are needed to manage, store, transfer and distribute these data due to its personally identifiable nature. The European Genome-phenome Archive (EGA) was created to facilitate access and management to long-term archival of bio-molecular data. Each data provider is responsible for ensuring a Data Access Committee is in place to grant access to data stored in the EGA. Moreover, the transfer of data during upload and download is encrypted. ELIXIR, a European research infrastructure for life-science data, initiated a project (2016 Human Data Implementation Study) to understand and document the ELIXIR requirements for secure management of controlled-access data. As part of this project, a full ecosystem was designed to connect archived raw experimental molecular profiling data with interpreted data and the computational workflows, using the CTMM Translational Research IT (CTMM-TraIT) infrastructure http://www.ctmm-trait.nl as an example. Here we present the first outcomes of this project, a framework to enable the download of EGA data to a Galaxy server in a secure way. Galaxy provides an intuitive user interface for molecular biologists and bioinformaticians to run and design data analysis workflows. More specifically, we developed a tool -- ega_download_streamer - that can download data securely from EGA into a Galaxy server, which can subsequently be further processed. This tool will allow a user within the browser to run an entire analysis containing sensitive data from EGA, and to make this analysis available for other researchers in a reproducible manner, as shown with a proof of concept study. The tool ega_download_streamer is available in the Galaxy tool shed: https://toolshed.g2.bx.psu.edu/view/yhoogstrate/ega_download_streamer.Entities:
Keywords: EGA; Galaxy; bioinformatics; data management; translational research; workflows
Year: 2016 PMID: 28232859 PMCID: PMC5302147 DOI: 10.12688/f1000research.10221.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. A flowchart of the designed ecosystem for the management and storage of data for clinical research data with a focus on security.
The clinical data of an experiment describes the clinical-pathological data, including tissue and patient information. Descriptors of the samples combined with these variables are stored in tranSMART. Molecular profiling data are derived from samples of patients: these samples are processed in the laboratory to obtain tissue derivatives, such as isolation of DNA, RNA and proteins, which are subsequently analysed by high throughput experimental techniques to obtain the raw molecular profiling data; the descriptions of the performed experiments are also stored in tranSMART. The actual raw data produced by the high throughput analysis are physically stored in repositories like EGA, while the interpreted data processed by extensive computational workflows, and references to the raw data are stored in tranSMART. The ability to reanalyse the raw data, is provided by Galaxy. Note that the work described here indicated by red arrows implements a data connection, allowing a user to retrieve raw data from EGA in Galaxy, and run subsequent workflows, constructed by tools in the Galaxy tool shed.
Figure 2. STAR-Fusion workflow in Galaxy.
The workflow firstly, obtains the raw data from EGA, to subsequently allow reanalysis of the data in a workflow of multiple components, to derive interpreted data. The raw forward and backward FASTQ sequencing reads are imported from EGA by ega_download_streamer; subsequently, the tool FASTQ Groomer does a consistency check of the data formats; then with Sickle, low quality bases (Q<30) are trimmed and reads clipped into less than 25 bases are discarded, only outputting the high-quality sequencing reads. Afterwards, these reads are aligned to the hg19 (GenBank Assembly ID GCA_000001405.1) reference genome in RNA STAR. Then STAR-Fusion is used for predicting the fusion genes, which also requires two reference files as auxiliary inputs. The output goes through two filters to only keep predictions having more than two split reads and more than two spanning reads.