Literature DB >> 31134269

snakePipes: facilitating flexible, scalable and integrative epigenomic analysis.

Vivek Bhardwaj^1,2, Steffen Heyne¹, Katarzyna Sikora¹, Leily Rabbani¹, Michael Rauer¹, Fabian Kilpert³, Andreas S Richter⁴, Devon P Ryan¹, Thomas Manke¹.

Abstract

SUMMARY: Due to the rapidly increasing scale and diversity of epigenomic data, modular and scalable analysis workflows are of wide interest. Here we present snakePipes, a workflow package for processing and downstream analysis of data from common epigenomic assays: ChIP-seq, RNA-seq, Bisulfite-seq, ATAC-seq, Hi-C and single-cell RNA-seq. snakePipes enables users to assemble variants of each workflow and to easily install and upgrade the underlying tools, via its simple command-line wrappers and yaml files.
AVAILABILITY AND IMPLEMENTATION: snakePipes can be installed via conda: `conda install -c mpi-ie -c bioconda -c conda-forge snakePipes'. Source code (https://github.com/maxplanck-ie/snakepipes) and documentation (https://snakepipes.readthedocs.io/en/latest/) are available online. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31134269 PMCID： PMC6853707 DOI： 10.1093/bioinformatics/btz436

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The decreasing price of sequencing and increasing multiplexing ability has allowed researcher to easily produce large datasets. To understand genetic and epigenetic regulation, researchers routinely perform multiple assays, such as RNA-seq and Bisulfite-seq in the same project, necessitating scalable data processing workflows. Since exploratory studies demand more flexibility in data processing, and standards evolve rapidly, conventional rigid pipelines become quickly outdated. Computational frameworks, such as Galaxy (Goecks ), Nextflow (Di Tommaso ) and snakemake (Köster and Rahmann, 2012) address this issue to some extent by allowing users to create their own workflows, or adopt workflows from public repositories. However, these frameworks are still challenging for novice users, as they require training in their specific programing language or syntax and assembling workflows themselves. This leads to a conundrum, how can we offer the flexibility of assembling and upgrading analyses workflows to the novice users, while still keeping them scalable and reproducible? We developed snakePipes to address this issue. snakePipes provides a set of best-practices workflows for processing, quality control and downstream analysis of data from the most common assays used in epigenomic studies: ChIP-seq, RNA-seq, whole-genome bisulfite-seq (WGBS), ATAC-seq, Hi-C and single-cell RNA-seq (Supplementary Fig. S1a; Supplementary Methods). However, unlike conventional pipelines, workflows in snakePipes are based on a repository of modular rules, such that multiple variations of each workflow can be assembled on-the-fly by changing the parameters on their command-line wrappers. This novel approach allows novice users to perform exploratory analysis in a reproducible way without manually assembling workflows.

2 Implementation

snakePipes employs snakemake (Köster and Rahmann, 2012) as its core workflow language, which benefits from easy readability of the code, widespread adoption and scalability to most clusters and cloud platforms. snakePipes also makes use of conda environments and the bioconda platform (Grüning ), which allows hassle-free installation and upgrade of known-compatible and known-functional tools (Fig. 1a;Supplementary Methods). Conda environments alleviate the need to manually manage tools or have administrator permissions.

Fig. 1.

Setup, execution and results from snakePipes. (a) All configurable parameters for snakepipes are defined as YAML files during setup. However, most parameters can be overwritten during execution by providing another YAML file, adding flexibility to the analysis. (b) Output of HiC (track 1), WGBS (track 2), ATAC-seq (track 3), allele-specific ChIP-seq (tracks 3–7) and RNA-seq (tracks 8–9) workflows, plotted using pyGenomeTracks (Ramírez ) snakePipes’ modular architecture allows various tools and resources to be shared between workflows, simplifying data integration since data from multiple assays are processed using identical tool versions. Genome annotations and indices are shared by all workflows, and can also be generated directly via snakePipes, facilitating easy setup as well as integrative analysis. Finally, all workflows in snakePipes calculate extensive quality control metrics and produce reports using multiQC (Ewels ) and R, that inform the user of processing and analysis results. Apart from conventional processing steps such as mapping and peak calling, workflows in snakePipes also include various downstream analyses. All workflows (except scRNA-seq workflow) optionally accept a sample information (tab-separated) file that can be used to define groups of sample. This allows comparative analysis, such as differential expression (RNA-seq), differential peak calling (ChIP-Seq), differential accessibility (ATAC-seq) and differential methylation (WGBS). Complex design formulas are supported using additional columns of the sample sheet. The HiC workflow uses sample information to merge groups and can perform TAD calling with parameters adapted to the resolution of the produced matrix [using HiCExplorer (Ramírez )]. Most workflows also allow allele-specific processing of data via SNPSplit (Krueger and Andrews, 2016) where a single or dual-hybrid genome can be created on-the-fly using the ‘allelic-mapping’ mode and a Variant Call Format file (Danecek ). Further downstream analysis, such as allele-specific differential expression can be performed automatically. This preliminary analysis, combined with visualization-ready BED and bigWig files, allows users to quickly interpret their data (Fig. 1b). Our comparison with other recently released workflows and pipelines suggests that snakePipes offers the most extensive processing and analysis options under a single package. Further, it compares equally well to the other available alternatives in terms of installation, ease of use and scalability (Supplementary Table S1).

3 Application

To demonstrate how snakePipes can simplify analysis of data from multiple epigenomic assays, we processed data from a study of the mammalian X-chromosome (Wang ). The knock-out of Smchd1 in mouse neural progenitor cells affects the X-chromosome organization and leads to a loss of H3K27me3 domains, gain of H3K4me3, along with de-repression of genes on the inactive X-chromosome. These changes are apparent directly from the snakePipes output (Fig. 1b;Supplementary Fig. S1b and c). We further combined these results with those obtained from online ATAC-seq (Giorgetti ) and WGBS data (GSE101090) processed via snakePipes, and find that these de-repressed genes have a higher open chromatin signature compared to the downregulated or unchanged genes (Supplementary Fig. S1d). These genes also show a methylation status similar to the downregulated but lower than unchanged genes (Supplementary Fig. S1e), corroborating previous (Schübeler, 2015) and recent (Lea ) links between promoter CpG methylation and gene repression.

4 Conclusion

In summary, snakePipes simplifies the analysis of large-scale epigenomic studies by allowing fast and reproducible processing of data from several assays. While further downstream analysis would still be required to integrate the results depending upon biological questions, snakePipes’ outputs allow biologists to quickly interpret and understand their results, facilitating integrative analysis. Click here for additional data file.

12 in total

1. Genome-wide quantification of the effects of DNA methylation on human gene regulation.

Authors: Amanda J Lea; Christopher M Vockley; Rachel A Johnston; Christina A Del Carpio; Luis B Barreiro; Timothy E Reddy; Jenny Tung
Journal: Elife Date: 2018-12-21 Impact factor: 8.140

2. Nextflow enables reproducible computational workflows.

Authors: Paolo Di Tommaso; Maria Chatzou; Evan W Floden; Pablo Prieto Barja; Emilio Palumbo; Cedric Notredame
Journal: Nat Biotechnol Date: 2017-04-11 Impact factor: 54.908

3. SMCHD1 Merges Chromosome Compartments and Assists Formation of Super-Structures on the Inactive X.

Authors: Chen-Yu Wang; Teddy Jégu; Hsueh-Ping Chu; Hyun Jung Oh; Jeannie T Lee
Journal: Cell Date: 2018-06-07 Impact factor: 41.582

4. Bioconda: sustainable and comprehensive software distribution for the life sciences.

Authors: Björn Grüning; Ryan Dale; Andreas Sjödin; Brad A Chapman; Jillian Rowe; Christopher H Tomkins-Tinch; Renan Valieris; Johannes Köster
Journal: Nat Methods Date: 2018-07 Impact factor: 28.547

5. Snakemake--a scalable bioinformatics workflow engine.

Authors: Johannes Köster; Sven Rahmann
Journal: Bioinformatics Date: 2012-08-20 Impact factor: 6.937

6. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

7. The variant call format and VCFtools.

Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937

8. MultiQC: summarize analysis results for multiple tools and samples in a single report.

Authors: Philip Ewels; Måns Magnusson; Sverker Lundin; Max Käller
Journal: Bioinformatics Date: 2016-06-16 Impact factor: 6.937

9. SNPsplit: Allele-specific splitting of alignments between genomes with known SNP genotypes.

Authors: Felix Krueger; Simon R Andrews
Journal: F1000Res Date: 2016-06-23

10. High-resolution TADs reveal DNA sequences underlying genome organization in flies.

Authors: Fidel Ramírez; Vivek Bhardwaj; Laura Arrigoni; Kin Chung Lam; Björn A Grüning; José Villaveces; Bianca Habermann; Asifa Akhtar; Thomas Manke
Journal: Nat Commun Date: 2018-01-15 Impact factor: 14.919

34 in total

Review 1. Sequences to Differences in Gene Expression: Analysis of RNA-Seq Data.

Authors: Polina V Pavlovich; Pierre Cauchy
Journal: Methods Mol Biol Date: 2022

2. Multilayer omics analysis reveals a non-classical retinoic acid signaling axis that regulates hematopoietic stem cell identity.

Authors: Katharina Schönberger; Nadine Obier; Mari Carmen Romero-Mulero; Pierre Cauchy; Julian Mess; Polina V Pavlovich; Yu Wei Zhang; Michael Mitterer; Jasmin Rettkowski; Maria-Eleni Lalioti; Karin Jäcklein; Jonathan D Curtis; Betty Féret; Pia Sommerkamp; Claudia Morganti; Keisuke Ito; Norbert B Ghyselinck; Eirini Trompouki; Joerg M Buescher; Erika L Pearce; Nina Cabezas-Wallscheid
Journal: Cell Stem Cell Date: 2021-10-26 Impact factor: 25.269

3. Analytical Approaches for ATAC-seq Data Analysis.

Authors: Jason P Smith; Nathan C Sheffield
Journal: Curr Protoc Hum Genet Date: 2020-06

4. Transcriptome annotation in the cloud: complexity, best practices, and cost.

Authors: Roberto Vera Alvarez; Leonardo Mariño-Ramírez; David Landsman
Journal: Gigascience Date: 2021-01-29 Impact factor: 6.524

5. PM4NGS, a project management framework for next-generation sequencing data analysis.

Authors: Roberto Vera Alvarez; Lorinc Pongor; Leonardo Mariño-Ramírez; David Landsman
Journal: Gigascience Date: 2021-01-07 Impact factor: 6.524

Review 6. Analysis and Performance Assessment of the Whole Genome Bisulfite Sequencing Data Workflow: Currently Available Tools and a Practical Guide to Advance DNA Methylation Studies.

Authors: Ting Gong; Heather Borgard; Zao Zhang; Shaoqiu Chen; Zitong Gao; Youping Deng
Journal: Small Methods Date: 2022-01-22

7. Differentiation and localization of interneurons in the developing spinal cord depends on DOT1L expression.

Authors: Angelica Gray de Cristoforis; Francesco Ferrari; Frédéric Clotman; Tanja Vogel
Journal: Mol Brain Date: 2020-05-29 Impact factor: 4.041

8. wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data.

Authors: Marius Wöste; Elsa Leitão; Sandra Laurentino; Bernhard Horsthemke; Sven Rahmann; Christopher Schröder
Journal: BMC Bioinformatics Date: 2020-05-01 Impact factor: 3.169

9. Complete loss of H3K9 methylation dissolves mouse heterochromatin organization.

Authors: Thomas Montavon; Nicholas Shukeir; Galina Erikson; Bettina Engist; Megumi Onishi-Seebacher; Devon Ryan; Yaarub Musa; Gerhard Mittler; Alexandra Graff Meyer; Christel Genoud; Thomas Jenuwein
Journal: Nat Commun Date: 2021-07-16 Impact factor: 14.919

10. Deletion of the mitochondria-shaping protein Opa1 during early thymocyte maturation impacts mature memory T cell metabolism.

Authors: Mauro Corrado; Dijana Samardžić; Marta Giacomello; Nisha Rana; Erika L Pearce; Luca Scorrano
Journal: Cell Death Differ Date: 2021-03-01 Impact factor: 15.828