| Literature DB >> 30973881 |
Ravi Madduri1,2, Kyle Chard1,2, Mike D'Arcy3, Segun C Jung1,2, Alexis Rodriguez1,2, Dinanath Sulakhe1,2, Eric Deutsch4, Cory Funk4, Ben Heavner5, Matthew Richards4, Paul Shannon4, Gustavo Glusman4, Nathan Price4, Carl Kesselman3, Ian Foster1,2,6.
Abstract
Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.Entities:
Mesh:
Year: 2019 PMID: 30973881 PMCID: PMC6459504 DOI: 10.1371/journal.pone.0213013
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1A high-level view of the TFBS identification workflow, showing the six principal datasets, labeled D1–D6, and the five computational phases, labeled –.
Fig 2Network topology showing the distributed environment which was used to generate the six principal datasets, labeled D1–D6, and the locations of the five computational phases, labeled –.
Fig 3An example BDBag, with contents in the data folder, description in the metadata folder, and other elements providing data required to fetch remote elements (fetch.txt) and validate its components.
Fig 4A minid landing page for a BDBag generated by the encode2bag tool, showing the associated metadata, including locations (in this case, just one).
Details of the per-tissue computations performed in the ensemble footprinting phase.
Data sizes are in GB. Times are in hours on a 32-core AWS node; they sum to 2,149.1 node hours or 68,771 core hours. DNase: DNase Hypersensitivity (DNase-seq) data from ENCODE. Align: Aligned sequence data. Foot: Footprint data and footprint inference computation. Numbers may not sum perfectly due to rounding.
| Tissue | Biosamples | Replicates | Data size | Compute time | |||
|---|---|---|---|---|---|---|---|
| DNase | Align | Foot | Align | Foot | |||
| adrenal gland | 3 | 8 | 31 | 68 | 0.5 | 7.4 | 18.7 |
| blood vessel | 10 | 129 | 117 | 234 | 2.1 | 32.7 | 68.9 |
| bone element | 1 | 7 | 4 | 6 | 0.2 | 1.4 | 3.1 |
| brain | 29 | 185 | 402 | 840 | 6.2 | 120.0 | 160.5 |
| bronchus | 2 | 9 | 18 | 36 | 0.4 | 3.1 | 9.7 |
| esophagus | 2 | 41 | 35 | 64 | 0.3 | 12.6 | 7.7 |
| extraembryonic | 11 | 66 | 193 | 412 | 3.0 | 46.5 | 89.9 |
| eye | 8 | 53 | 129 | 252 | 2.3 | 26.9 | 109.5 |
| gonad | 2 | 7 | 27 | 56 | 0.4 | 4.9 | 12.2 |
| heart | 8 | 69 | 169 | 342 | 2.0 | 38.8 | 76.0 |
| kidney | 8 | 29 | 84 | 174 | 2.0 | 19.5 | 96.5 |
| large intestine | 5 | 18 | 96 | 184 | 0.9 | 23.2 | 50.0 |
| liver | 3 | 8 | 26 | 62 | 0.5 | 4.9 | 19.8 |
| lung | 7 | 94 | 142 | 300 | 1.8 | 19.3 | 27.2 |
| lymphatic vessel | 2 | 30 | 21 | 286 | 0.4 | 3.6 | 11.9 |
| lymphoblast | 21 | 71 | 150 | 320 | 2.9 | 70.0 | 153.5 |
| mammary gland | 2 | 5 | 18 | 36 | 0.4 | 2.8 | 10.7 |
| mouth | 4 | 18 | 81 | 164 | 1.1 | 19.5 | 57.3 |
| muscle organ | 4 | 13 | 54 | 110 | 0.8 | 9.6 | 49.6 |
| pancreas | 2 | 13 | 38 | 84 | 0.5 | 8.1 | 30.9 |
| prostate gland | 2 | 8 | 23 | 50 | 0.3 | 4.2 | 78.1 |
| skin | 48 | 401 | 377 | 780 | 8.8 | 220.0 | 181.2 |
| spinal cord | 2 | 34 | 66 | 128 | 0.7 | 9.2 | 28.0 |
| stomach | 1 | 5 | 24 | 52 | 0.4 | 3.5 | 3.6 |
| thyroid gland | 3 | 24 | 63 | 136 | 0.7 | 13.6 | 27.0 |
| tongue | 2 | 8 | 51 | 106 | 0.7 | 13.7 | 26.1 |
| urinary bladder | 1 | 2 | 4 | 8 | 0.2 | 0.8 | 1.8 |
Fig 5The encode2bag portal.
The user has entered an ENCODE query for urinary bladder DNase-seq data and clicked “Create BDBag.” The portal generates a Minid for the BDBag and a Globus link for reliable, high-speed access.
Fig 6Our DNase-seq ensemble footprinting workflow, used to implement and of Fig 1.
The master workflow A takes a BDBag from as input. It executes from top to bottom, using subworkflows B and C to implement and then subworkflow D to implement . It produces as output BDBags containing aligned DNase-seq data and footprints, with the latter serving as input to .
The six datasets shown in Fig 1D1–1D6.
For each we indicate whether it is an input or output.
| # | Name | Identifier | Role | Description | Size |
|---|---|---|---|---|---|
| D1 | DNase-seq | In | BDBag of 27 BDBags extracted from ENCODE by | 2.40 TB | |
| D2 | Alignment | Out | BDBag of 54 BDBags produced by | 5.30 TB | |
| D3 | Footprints | Out | BDBag of 54 BDBags containing footprints computed by | 0.04 TB | |
| D4 | Motifs | In | Database dump file containing the non-redundant motifs provided by Funk et al. [ | 31.5 GB | |
| D5 | Hits | Out | Database dump file containing the hits produced by | 0.04 TB | |
| D6 | TFBSs | Out | BDBag of 54 BDBags containing candidate TFBSs produced by | 0.35 TB |
The software used to implement the five steps shown in Fig 1.
As the software for is used only to produce the input data at minid:b9dt2t, we do not provide identifiers for specific versions of that software.
| # | Name | Identifiers for software |
|---|---|---|
| Extract | ||
| Alignment, Footprints | Galaxy pipeline(T3): | |
| Hits | R script(T6): | |
| TFBSs | R scripts(T7): |
FAIRness assessment derived using FAIRShake.
| FAIR Assessment | D1 | D2 | D3 | D4 | D5 | D6 | T1 | T2 | T3 | T4 | T5 | T6 | T7 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | |
| ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |