| Literature DB >> 30631270 |
Samir Das1,2, Xavier Lecours Boucher1,2, Christine Rogers1,2, Carolina Makowski1,2,3, François Chouinard-Decorte1,2, Kathleen Oros Klein4,5, Natacha Beck1,2, Pierre Rioux1,2, Shawn T Brown1,2, Zia Mohaddes1,2, Cole Zweber1,2, Victoria Foing1,2, Marie Forest4,5, Kieran J O'Donnell3,4, Joanne Clark4, Michael J Meaney3,4, Celia M T Greenwood4,5, Alan C Evans1,2.
Abstract
Analysis of "omics" data is often a long and segmented process, encompassing multiple stages from initial data collection to processing, quality control and visualization. The cross-modal nature of recent genomic analyses renders this process challenging to both automate and standardize; consequently, users often resort to manual interventions that compromise data reliability and reproducibility. This in turn can produce multiple versions of datasets across storage systems. As a result, scientists can lose significant time and resources trying to execute and monitor their analytical workflows and encounter difficulties sharing versioned data. In 2015, the Ludmer Centre for Neuroinformatics and Mental Health at McGill University brought together expertise from the Douglas Mental Health University Institute, the Lady Davis Institute and the Montreal Neurological Institute (MNI) to form a genetics/epigenetics working group. The objectives of this working group are to: (i) design an automated and seamless process for (epi)genetic data that consolidates heterogeneous datasets into the LORIS open-source data platform; (ii) streamline data analysis; (iii) integrate results with provenance information; and (iv) facilitate structured and versioned sharing of pipelines for optimized reproducibility using high-performance computing (HPC) environments via the CBRAIN processing portal. This article outlines the resulting generalizable "omics" framework and its benefits, specifically, the ability to: (i) integrate multiple types of biological and multi-modal datasets (imaging, clinical, demographics and behavioral); (ii) automate the process of launching analysis pipelines on HPC platforms; (iii) remove the bioinformatic barriers that are inherent to this process; (iv) ensure standardization and transparent sharing of processing pipelines to improve computational consistency; (v) store results in a queryable web interface; (vi) offer visualization tools to better view the data; and (vii) provide the mechanisms to ensure usability and reproducibility. This framework for workflows facilitates brain research discovery by reducing human error through automation of analysis pipelines and seamless linking of multimodal data, allowing investigators to focus on research instead of data handling.Entities:
Keywords: HPC; biostatistics; database; genomics; integrative neuroscience; omics analysis; reproducibility; workflow
Year: 2018 PMID: 30631270 PMCID: PMC6315165 DOI: 10.3389/fninf.2018.00091
Source DB: PubMed Journal: Front Neuroinform ISSN: 1662-5196 Impact factor: 4.081
Figure 1Generalized Workflow cycle between the LORIS data-management platform and the CBRAIN processing platform. Data from LORIS (Storage) can be queried and filtered (Genomic Browser and other tools) to select a set of variables and/or files. The newly created dataset is then transferred to the CBRAIN DataProvider for processing (Task Launching) and analysis (high-performance computing, HPC). The output is synced back to LORIS with the provenance data. Results can be examined and a new iteration can begin with the added derived variables. For stepwise details of this model, please see Figure 2 in “Results” section.
Figure 2Genomic processing cycle between LORIS and CBRAIN through the DataProvider. Methylation450K pipeline—Brown path (1): IDAT files are transferred to the DataProvider, then the methylation normalization pipeline is launched. The Beta-values output file is returned to the DataProvider, and then loaded into LORIS using the Genomic Uploader. The inserted results can be browsed or visualized in the Genomic Browser module. ImputePrepSanger pipeline—Green path (2): PLINK files are added to LORIS via the Genomic Uploader, selected in the DatasetBuilder, and sent to CBRAIN for the imputePrepSanger tool to be run. The resulting Variant Call Format (VCF) output file is stored in LORIS—Pink path (4). Statistical analysis—Blue path (3): using the DatasetBuilder module in LORIS, data from any source (Orange path (5), Red path (6)) can be packaged in a new dataset and sent to CBRAIN via the DataProvider for statistical analysis using (e.g.,) the principal component of explained variance (PCEV) pipeline.
Figure 3Relationship between three files required for loading of methylation data in LORIS’ Genomic Browser. The Beta-values file contains a value for each biosample tested on each probe. Each biosample in the Beta-values file is linked to a study subject in the Sample mapping file, using a subject identifier (Participant_id). Each probe from the Beta-values file is linked to a set of properties in the Annotations file provided by the chip manufacturer (Illumina).
Figure 4LORIS Genomic Browser: Profiles tab. Filter applied to search for subjects based on Site, Gender, Subproject, External ID and the availability of genomic data. In the table, detailed subject data can be accessed by clicking on the link that appears on each item.
Figure 5Filters and Methylation Beta-values in the Genomic Browser. Filters are applied on subject information, genomic range and the probe’s annotations. The filtered data view can be downloaded as a CSV file. Hyperlinks on each “CpG Name” column cell will bring the user to the online UCSC genome browser, which provides detailed information about a given CpG from the most recent human genome build version.
Figure 6Example Genomic Viewer shows the context for single-nucleotide polymorphisms (SNPs) and CpGs in a small region of CpGs. Visualized context includes features from external sources, for chromosome Y from position 15010000 to 15039953. The upper section of the visualization plot presents the transcripts of gene DDX3Y with 5′UTR, as well as exons and transcription direction dynamically queried from the UCSC Genome browser. In the middle track, box plot distributions show Beta-values for each CpG. In the lowest track, in this view, users can view SNP and CpG positions stored in LORIS.
Figure 7Prototype DatasetBuilder module. The preview panel displays all records returned from jointly querying the database, using the “BMI underweight” pre-built query stored in the data querying tool (DQT) module. This is joined with all subject-samples on which CpGs were found on chromosome 1 between position 15865 and 1266504 from the Methylation450k dataset Beta-values.
Figure 8Prototype of LORIS Imaging Browser with PhantomPipeline processing launch capability using a single button. A user can click on the “Launch” button, under the “PhantomPipeline”column to initiate transfer of the scan dataset to CBRAIN to begin execution of the task.
Figure 9View of task (PhantomPipeline) running on CBRAIN web portal, launched from LORIS Imaging Browser module in Figure 8. The task was launched automatically through CBRAIN’s application programming interface (API), but can also be viewed and monitored interactively this way.