| Literature DB >> 27255077 |
Masaomi Hatakeyama1,2, Lennart Opitz1, Giancarlo Russo1, Weihong Qi1, Ralph Schlapbach1, Hubert Rehrauer3.
Abstract
BACKGROUND: Next generation sequencing (NGS) produces massive datasets consisting of billions of reads and up to thousands of samples. Subsequent bioinformatic analysis is typically done with the help of open source tools, where each application performs a single step towards the final result. This situation leaves the bioinformaticians with the tasks to combine the tools, manage the data files and meta-information, document the analysis, and ensure reproducibility.Entities:
Keywords: Data analysis framework; Meta-level system design; Reproducible research
Mesh:
Year: 2016 PMID: 27255077 PMCID: PMC4890512 DOI: 10.1186/s12859-016-1104-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The use case of DataSet generation. By running a SUSHI application with an input DataSet and parameters, a new DataSet is generated. Initially (Step 1) only the meta-information, the parameter file, and the job scripts are generated. The actual data files and the log files are generated by executing the static job scripts (Step 2)
A sample DataSet
| Name | Read [File] | Species | Genotype [Factor] |
|---|---|---|---|
| Mut1 | P1001/ventricles/mut1_R1.fastq.gz | Mus musculus | Mutant |
| Mut2 | P1001/ventricles/mut2_R1.fastq.gz | Mus musculus | Mutant |
| Wt1 | P1001/ventricles/wt1_R1.fastq.gz | Mus musculus | Wildtype |
| Wt2 | P1001/ventricles/wt2_R1.fastq.gz | Mus musculus | Wildtype |
Example of a sequencing read DataSet where a subset of the meta-information is shown as annotation columns. The DataSet includes four samples with four categories of meta-information, 1. Name, 2. Read, 3. Species, and 4. Genotype. Each column header can have a tag. E.g. [File] means the column holds file locations, and [Factor] means the values represent an experimental factor. The DataSet object is implemented as an Array of Hash objects in the SUSHI system and it can be imported from or exported to tab-separated-value file
Fig. 2The screenshots of a DataSet and parameter setting view. a DataSet view shows basic information of the DataSet, sample information, and the compatible SUSHI applications at the bottom. The SUSHI application is shown as a button and categorized based on the @analysis_category defined in the SUSHI application Ruby code. b After selecting a SUSHI application, the parameter setting view lets users modify the analysis parameters. According to the SUSHI application definition, GUI components are auto-generated and placed in the view
Fig. 3The screenshots of DataSet list and a part of a result generated by the edgeR SUSHI application. a The DataSets are listed with a tree view (top) and table view (bottom). In the tree view, each node indicates a DataSet and the parental node indicates the input DataSet for the child node. b Visualizations form the differential expression result the edgeR SUSHI application. We show a scatter plot with significantly differential expressed genes red-colored (left) and clustered heatmap (right). All calculated data is downloadable from this view
Various types of workflow management systems are compared
| System | UI | Language | Application | Meta-info. | Reproducibility | Documentation |
|---|---|---|---|---|---|---|
| Galaxy | GUI | Python | Workflow editor | Generating | Workflow | Galaxy file (.ga) |
| Chipster | GUI | Java | Workflow view | None | Workflow | Chipster file (.bsh) |
| GeneProf | GUI | Java | Workflow designer | None | Workflow | Image file |
| GenePattern | GUI,CLI | Java | Additional module | None | Pipeline | GenePattern library |
| Taverna | GUI,CLI | Java,Scufl | Plugin | Three types | Workflow | Workflow file |
| TOGGLE | CLI | Perl | Text file | None | Perl script | Text file |
| bpipe | CLI | Goovy,Java | bpipe script | None | bpipe script | bpipe script |
| NGSANE | CLI | Bash | Text file | None | trigger.sh | Text file |
| nestly | CLI | Python | Python script | None | nestrun | Python script |
| Snakemake | CLI | Python | Build file | None | snakemake | Build file |
| Ruffus | CLI | Python | Python script | None | Python script | Python script |
| Makeflow | CLI | C | Makeflow Language | None | Makeflow script | Workflow script |
| SUSHI | GUI,CLI | Ruby | Ruby script | tsv format | Shell script | Shell script |
The systems are described by several features. The systems are categorized into two types by the user interface types, either GUI or CLI. Most systems have a proprietary format to save a workflow definition. More details are available in the Result section and in Additional file 7
Fig. 4The number of submitted jobs using SUSHI at the Functional Genomics Center Zurich. It has been increasing since 2013 and now more than 5000 jobs are submitted on SUSHI at the Functional Genomics Center Zurich