| Literature DB >> 34608132 |
Anthony Mammoliti1,2, Petr Smirnov1,2, Minoru Nakano1, Zhaleh Safikhani1,2, Christopher Eeles1, Heewon Seo1,2, Sisira Kadambat Nair1, Arvind S Mer1,2, Ian Smith1,2, Chantal Ho1, Gangesh Beri1, Rebecca Kusko3, Eva Lin4, Yihong Yu4, Scott Martin4, Marc Hafner4,5, Benjamin Haibe-Kains6,7,8,9.
Abstract
Reproducibility is essential to open science, as there is limited relevance for findings that can not be reproduced by independent research groups, regardless of its validity. It is therefore crucial for scientists to describe their experiments in sufficient detail so they can be reproduced, scrutinized, challenged, and built upon. However, the intrinsic complexity and continuous growth of biomedical data makes it increasingly difficult to process, analyze, and share with the community in a FAIR (findable, accessible, interoperable, and reusable) manner. To overcome these issues, we created a cloud-based platform called ORCESTRA ( orcestra.ca ), which provides a flexible framework for the reproducible processing of multimodal biomedical data. It enables processing of clinical, genomic and perturbation profiles of cancer samples through automated processing pipelines that are user-customizable. ORCESTRA creates integrated and fully documented data objects with persistent identifiers (DOI) and manages multiple dataset versions, which can be shared for future studies.Entities:
Year: 2021 PMID: 34608132 PMCID: PMC8490371 DOI: 10.1038/s41467-021-25974-w
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Data-processing platforms and their respective features for handling multimodal data.
| Features | ORCESTRA (Pachyderm) | DNAnexus | Databricks | Lifebit |
|---|---|---|---|---|
| Create language-agnostic pipelines in the cloud | ✓ | ✓ | ✓ | ✓ |
| Large dataset support (TB in size) | ✓ | ✓ | ✓ | ✓ |
| Automatic pipeline triggering with updated data (out-of-the-box) | ✓ | X | X | X |
| Prevents recomputation of entire dataset with each new pipeline trigger | ✓ | ✓ | X | ✓ |
| Docker utilization | ✓ | ✓ | ✓ | ✓ |
| Every pipeline run and data sources are versioned with a unique identifier | ✓ | ✓ | a | a |
| Parallelism support | ✓ | ✓ | ✓ | ✓ |
| Versioning system (e.g., GitHub) for pipelines and input data | ✓ | ✓ | ✓ | ✓ |
| Open access (free) | ✓ | X | X | X |
| Direct mounting of data (no copying into file system) | X | X | ✓ | ✓ |
| Automatic cost-efficiency implementation for instances (low-priority) | X | X | X | ✓ |
| No permanent resource allocation for a pipeline (memory/CPU) | X | ✓ | ✓ | ✓ |
aIndicates partial support of the feature.
Each feature was tested against each platform using biomedical data as an input data source.
Fig. 1Summary of samples, treatments, and molecular profiles utilized for data-object generation in ORCESTRA.
Molecular data, sample, and treatment information are combined to yield 17 unique data objects from a variety of biomedical data types.
Fig. 2ORCESTRA web-application connectivity with data-processing layer through commit identifier (ID) scanning for user-selected pipeline requests, and subsequent data-object DOI tracking with MongoDB queries.
The web-application layer receives pipeline requests under the form of JavaScript Object Notation (JSON) file and updates the ORCESTRA database with each data-object digital object identifier (DOI) and commit ID. The orchestration functionality scans for new pipeline requests and executes them to generate a versioned data object, which is uploaded to Zenodo to retrieve a DOI in the data-sharing layer.
Fig. 3The ORCESTRA framework layers for pipeline selection, data-object generation, and digital object identifier (DOI) sharing with a custom metadata web page.
The web-application layer allows users to request custom data objects, which are generated through Pachyderm in a Kubernetes cluster within the data-processing layer. Each versioned data object is automatically pushed to the data-sharing layer and uploaded to Zenodo to obtain a DOI. Data objects that have already been processed result in the immediate sharing of custom metadata web pages with users via email.
Fig. 4The Cloud-based deployment of the ORCESTRA data-processing layer automatically versions data using Pachyderm and shares generated data objects through Zenodo via a persistent identifier (DOI).
Each file and pipeline in the Pachyderm environment are provided a unique identifier, allowing for each data object to be versioned.