| Literature DB >> 33230554 |
Anthony Westbrook1,2, Elizabeth Varki1, W Kelley Thomas2,3.
Abstract
MOTIVATION: Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which bioinformatics analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed RepeatFS, a file system that records, replicates and verifies informatics workflows with no alteration to the original methods. RepeatFS also provides several other features to help promote analytical transparency and reproducibility, including provenance visualization and task automation.Entities:
Year: 2021 PMID: 33230554 PMCID: PMC8189677 DOI: 10.1093/bioinformatics/btaa950
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Provenance complexity
| Executable | IO Ops | Parameters | Releases |
|---|---|---|---|
| Fastq-dump | 268 552 | 43 | 48 |
| Trimmomatic | 261 775 | 17 | 39 |
| SPAdes | 609 644 | 53 | 32 |
| Prokka | 39194 | 39 | 25 |
|
|
|
|
|
| Wget | 96 449 | 148 | 35 |
| GZip | 202 196 | 18 | 12 |
| Bioawk | 6547 | Many* | 1 |
| MAFFT | 1496 | 8 | 177 |
| RAxML | 2356 | 69 | 68 |
|
|
|
|
|
|
|
|
|
|
Fig. 1.A provenance graph was generated by RepeatFS for the target annotation file (green) for pipeline 1. Relationships between processes (red) and files (blue) are shown for every causal read or write operation (black arrows) that affected the creation or modification of the target file. Each pipeline shell script was expanded to display spawned child processes (red arrows). Files with identical read and write processes are automatically grouped and counted, greatly reducing the visual complexity of the graph. Though top-level graphs are shown here, we were also able to further expand and verify sub-process activity under parent programs, such as SPAdes and Prokka
Fig. 2.A provenance graph was generated by RepeatFS for the target tree file (green) for pipeline 2. Relationships between processes (red) and files (blue) are shown for every causal read or write operation (black arrows) that affected the creation or modification of the target file. Each pipeline shell script was expanded to display spawned child processes (red arrows). Files with identical read and write processes are automatically grouped and counted, greatly reducing the visual complexity of the graph.
Fig. 3.RepeatFS structure, outlining the flow of data originating from a system call issued by a process. The system call is first directed into RepeatFS by FUSE. Once the operation is received, information necessary to later reconstruct provenance is stored within a database and then sent for routing. Operations performed on real files are relayed to the underlying file system, and those performed on VDFs are handled by the block cache system. Since RepeatFS is a multithreaded file system, multiple system calls and VDF task processes may be serviced concurrently