| Literature DB >> 30367595 |
Neha Kulkarni1, Luca Alessandrì1, Riccardo Panero1, Maddalena Arigoni1, Martina Olivero2, Giulio Ferrero3, Francesca Cordero4, Marco Beccuti3, Raffaele A Calogero5.
Abstract
BACKGROUND: Reproducibility of a research is a key element in the modern science and it is mandatory for any industrial application. It represents the ability of replicating an experiment independently by the location and the operator. Therefore, a study can be considered reproducible only if all used data are available and the exploited computational analysis workflow is clearly described. However, today for reproducing a complex bioinformatics analysis, the raw data and the list of tools used in the workflow could be not enough to guarantee the reproducibility of the results obtained. Indeed, different releases of the same tools and/or of the system libraries (exploited by such tools) might lead to sneaky reproducibility issues.Entities:
Keywords: Chromatin Immuno precipitation sequencing; Community; Docker; Reproducible research; Single nucleotide variants; Whole transcriptome sequencing; microRNA sequencing
Mesh:
Substances:
Year: 2018 PMID: 30367595 PMCID: PMC6191970 DOI: 10.1186/s12859-018-2296-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Good practice bioinformatics rules, derived from Sandve et al. [5]
| 1 | For Every Result, Keep Track of How It Was Produced |
| 2 | Avoid Manual Data Manipulation Steps |
| 3 | Archive the Exact Versions of All External Programs Used |
| 4 | Version Control All Custom Scripts |
| 5 | Record All Intermediate Results, When Possible in Standardized Formats |
| 6 | For Analyses That Include Randomness, Note Underlying Random Seeds |
| 7 | Always Store Raw Data behind Plots |
| 8 | Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected |
| 9 | Connect Textual Statements to Underlying Results |
| 10 | Provide Public Access to Scripts, Runs, and Results |
Fig. 1Reproducible Bioinformatics Project structure
Fig. 2Workflows available in the stable branch of docker4seq. a Whole transcriptome sequencing workflow, b ChIP sequencing workflow, and c miRNA sequencing workflow. The names followed by parenthesis are the docker4seq functions used to execute the analysis steps. Black indicate elements in common among more than one workflow
Fig. 3Variant calling workflows under refinement in the development branch of docker4seq. a SNVs calling in DNA workflow. The function snvPreprocessing requires that users provides its own copy of the GATK software, because of Broad Institute license restrictions. This function returns a bam file sorted, with duplicates marked after GATK indel realignment and quality recalibration. b Data preprocessing for samples derived by Patient Derived Xenografths (PDX). The xenome function discriminates between the mouse host reads and the human tumor reads, then DNA or RNA SNV calling workflows can be applied. c SNVs calling in RNA workflow. The function star2steps generates a sorted bam, where duplicates are marked and processed by opossum for removal of intronic regions and merging of overlapping reads. The names followed by parenthesis are the docker4seq functions used to execute the analysis steps. Black indicate elements in common between more than one workflow
Fig. 4Variant calling workflows under development in the development branch of docker4seq. a Somatic SNVs detection using GATK MUTECT 1 or 2. b Platypus based join mutations caller. Dashed blocks are not implemented, yet
Fig. 5sncRNA workflow. The sncRNA pipeline starts from a reference composed by the set of sncRNAs that contains all sncRNA characterized by a length minor than 80 bp. Then, two types of scripts are used one dedicated to the detection of known and novel microRNAs while the other is focused on sncRNAs
Fig. 6HashClone pipeline. The HashClone strategy is organized in three steps: The first step (red box) is used to detect k-mer in all patients’ samples. The second step (green box) focus on the generation of sequence signatures leading to the identification of the set of putative clones present in each of the patients’ sample; the third step (blue box) is used to the characterization and evaluation of the cancer clones