Literature DB >> 27547532

fluff: exploratory analysis and visualization of high-throughput sequencing data.

Georgios Georgiou¹, Simon J van Heeringen¹.

Abstract

In this article we describe fluff, a software package that allows for simple exploration, clustering and visualization of high-throughput sequencing data mapped to a reference genome. The package contains three command-line tools to generate publication-quality figures in an uncomplicated manner using sensible defaults. Genome-wide data can be aggregated, clustered and visualized in a heatmap, according to different clustering methods. This includes a predefined setting to identify dynamic clusters between different conditions or developmental stages. Alternatively, clustered data can be visualized in a bandplot. Finally, fluff includes a tool to generate genomic profiles. As command-line tools, the fluff programs can easily be integrated into standard analysis pipelines. The installation is straightforward and documentation is available at http://fluff.readthedocs.org. Availability. fluff is implemented in Python and runs on Linux. The source code is freely available for download at https://github.com/simonvh/fluff.

Entities: CellLine Chemical Disease Gene Species

Keywords: ChIP-seq; Clustering; High-throughput sequencing; Next-generation sequencing; Python; Visualization

Year: 2016 PMID： 27547532 PMCID： PMC4957989 DOI： 10.7717/peerj.2209

Source DB: PubMed Journal: PeerJ ISSN： 2167-8359 Impact factor: 2.984

Introduction

The advances in sequencing technology and the reduction of costs have led to a rapid increase of High-Throughput Sequencing (HTS) data. Applications include chromatin immunoprecipitation followed by high-throughput deep sequencing (ChIP-seq; Robertson et al., 2007) to determine the genomic location of DNA-associated proteins, chromatin accessibility assays (Buenrostro et al., 2013; Hesselberth et al., 2009) and bisulfite sequencing to assay DNA methylation (Lister et al., 2009). The integration of these diverse data allow identification of the epigenomic state, for instance in different tissues (Martens & Stunnenberg, 2013; Roadmap Epigenomics Consortium et al., 2015) or during development (Hontelez et al., 2015). However, the scale and complexity of these datasets call for the use of computational methods that facilitate data exploration and visualization. Various options exist to explore and visualize HTS data mapped to a reference genome, for instance in aggregated form such as heatmaps and average profiles. These include general purpose modules for specific programming languages (Huber et al., 2015), dedicated HTS modules (Dale, Matzat & Lei, 2014; Statham et al., 2010; Akalin et al., 2015), command-line tools (Shen et al., 2014; Giannopoulou & Elemento, 2011), web tools (Ramírez et al., 2014), stand-alone applications (Ramírez et al., 2014; Ye et al., 2011) and tools that depend on other software for visualization (Heinz et al., 2010). Here, we present fluff, a Python package for visual, reference-based HTS data exploration. It includes command-line applications to both cluster and visualize aggregated signals in genomic regions, as well as to create genome browser-like profiles. The scripts can be included in analysis pipelines and accept commonly used file formats. The fluff applications are pitched at the beginner to intermediate user. They have sensible defaults, yet allow for customizable creation of high-quality, publication-ready figures.

Methods

General

Detailed documentation, including tutorials, is available at http://fluff.readthedocs.org. Fluff is implemented in Python, and uses several previously published modules (Brewer, 2016; Anders, Pyl & Huber, 2015; Dale, Pedersen & Quinlan, 2011; Quinlan & Hall, 2010; Li et al., 2009; De Hoon et al., 2004, see Supplemental Information). All fluff tools support indexed BAM, bigWig or (tabix-indexed) BED, WIG or bedGraph files as input. A large selection of major image formats are supported as output. The fluff tools were developed to explore ChIP-seq data, however, they will work with any type of data where (spliced) reads can be mapped to a genomic reference. For instance DNA methylation profiles from bisulfite-sequencing or RNA-seq data (Fig. S1) can also be visualized.

Normalization

Normalization of sequencing data is critical for downstream analysis and various methods have been proposed (see for instance Angelini et al., 2015 and Bailey et al., 2013 for an overview of ChIP-seq normalization methods). For visualization, the most important factor is the sequencing read depth. Therefore fluff has the option to normalize to the total number of mapped reads. Alternatively, averaged signal files such as bigWig tracks that are processed or normalized by a different method can be used as input.

Program descriptions

Heatmaps

Visualization of HTS data as heatmaps, where rows represent different genomic regions, can highlight important aspects of the data, like differential enrichment or positional patterns for specific groups of features. In addition, it allows for comparison between multiple regions within the same or between different experiments. The fluff heatmap tool visualizes HTS data on basis of a list of genomic coordinates. The data can optionally be clustered using either k-means or hierarchical clustering. For clustering, the read counts in the bins are normalized to the 75 percentile. The distance can be calculated using either the Euclidean distance or Pearson correlation similarity. If the regions in the input file are not strand-specific, different clusters might represent the same strand-specific profile in two different orientations. Clusters that are mirrored relative to the center can optionally be merged. Here, the similarity is based on the chi-squared p-value of the mean profile per cluster. One important use case for clustering is the ability to identify dynamic patterns, for instance during different time points or conditions. For this purpose, clustering on the binned signal is not ideal. Therefore, fluff heatmap provides the option to cluster genomic regions based on a single value derived from the number of reads in the feature centers (+∕ − 1 kb). In combination with the Pearson correlation metric, this allows for efficient retrieval of dynamic clusters. The difference is illustrated in Fig. 2.

Figure 2

Example of the output of fluff heatmap using standard clustering compared to using the dynamics option.

Bandplots

In heatmaps, more subtle patterns can be difficult to detect, as the dynamic range of signal intensities is not well-reflected in the color scale. Therefore, as an alternative to a heatmap, fluff bandplot plots the average profiles in small multiples (Shoresh & Wong, 2012). Here, the spatial encoding of the signal allows for more accurate comparison of values (Gehlenborg, Nils & Bang, 2012). The median enrichment is visualized as a black line with the 50th and 90th percentile as a dark and light colour respectively.

Profiles

Genome browsers are unrivaled for data exploration and visualization in a genomic context. However, it can be useful to create profiles of HTS data in genomic intervals using a consistent command-line tool, that can optionally be automated. The fluff profile tool can plot summarized profiles from one or more profiles, together with (gene) annotation from a BED12-formatted file.

Analysis

In short, FASTQ files were download from NCBI GEO (Edgar, Domrachev & Lash, 2002) and mapped to the human genome (hg19) using bwa (Li & Durbin, 2009). Duplicate reads were marked using bamUtil (http://genome.sph.umich.edu/wiki/BamUtil). All BAM files from replicate experiments were merged. Peaks were called using MACS2 (Zhang et al., 2008) with default settings. See the Supplemental Information for specific details and accession numbers.

Results

Demonstrating fluff: dynamic enhancers during macrophage differentiation

To illustrate the functionality of fluff we visualized previously published ChIP-seq data (Saeed et al., 2014). Here, the epigenomes of human monocytes and in vitro-differentiated naïve, tolerized, and trained macrophages were analyzed, with the aim to understand the epigenetic basis of innate immunity. Circulating monocytes (Mo) were differentiated into three macrophages states: to macrophages (Mf), to long-term tolerant cells (LPS-Mf) by exposition to lipopolysaccharide and to trained immune cells (BG-Mf) by priming with β-glucan. We used fluff heatmap to cluster and visualize the signal of histone 3 lysine 27 acetylation (H3K27ac), which is located at active enhancers and promoters (Fig. 1A). The input consisted of a BED file with 7,611 differentially regulated enhancers (Table S1) and four BAM files, for each of the monocytes and three types of macrophages. Using k-means clustering (k = 5) with the Pearson correlation metric, the heatmap recapitulates the H3K27ac dynamics as described (Saeed et al., 2014).

Figure 1

An example of the fluff output.

All panels were generated by the fluff command-line tools and were not post-processed or edited. (A) Heatmap showing the results of k-means clustering (k = 5, metric = Pearson) of dynamic H3K27ac regions in monocytes (Mo), naïve macrophages (Mf), tolerized (LPS-Mf) and trained cells (BG-Mf) (Saeed et al., 2014). ChIP-seq read counts are visualized in 100-bp bins in 24-kb regions. (B) Bandplot showing the average profile (median: black, 50%: dark color, 90%: light color) of the clusters as identified in Fig. 1A. (C) The H3K27ac ChIP-seq profiles at the CNRIP1 gene locus, which shows a gain of H3K27ac in Mf, LPS-Mf and BG-Mf relative to Mo.

An example of the fluff output.

Example of the output of fluff heatmap using standard clustering compared to using the dynamics option.

Shown are the H3K27ac ChIP-seq read counts in 100 bp bins in 20 kb around the DNaseI peak summit in human H1 ES cell-derived cells. (A) Heatmap showing the results of k-means clustering of all bins (k = 7, metric = Euclidean) (B) Heatmap showing the results of k-means clustering in 2 kb regions centered at the peak summit (k = 7, metric = Pearson). While heatmaps are often used for visualization of signals over genomic features, either clustered or ordered by signal intensity, it can be difficult to distinguish relative levels of individual clusters. Figure 1B shows an alternative visualization of average enrichment profiles in small multiples. The same clusters as in Fig. 1A are plotted using fluff bandplot. Shown are the median (black line), along with the 50th (darker color) and 90th percentile (lighter color) of the data. This allows for more detailed comparisons. Finally, we illustrate fluff profile, which can visualize one or more genomic regions (Fig. 1C). This figure highlights the CNRIP1 gene from cluster 2, which shows a consistent increase of H3K27ac from Mo to Mf, LPS-Mf and BG-Mf. The signal profiles are directly generated from the BAM files.

Identification and visualization of dynamic patterns

Most applications that cluster HTS data for heatmap visualization use a binning approach, followed by clustering using the Euclidean distance. The implicit effect is that the bins are clustered on basis of the spatial patterns relative to the region of interest. Often, this is the desired result, for instance when clustering the ChIP-seq enrichment patterns of different histone modifications at the transcription start sites of genes. However, for other analyses this clustering approach does not suffice. An example could be the ChIP-seq profiles of specific histone modifications correlated to the activity of a regulatory element, such as H3K4me3 at promoters or H3K27ac at enhancers. In this case, a relevant objective is to identify the clusters associated with differential activation dynamics. As illustration, we visualized the H3K27ac enrichment profile at DNaseI hypersensitive sites in human embryonic stem (ES) cells differentiated into different lineages (Xie et al., 2013). Here, H1 ES cells were differentiated into mesendoderm, neural progenitor cells, trophoblast-like cells, and mesenchymal stem cells. We first clustered the H3K27ac profiles at regulatory elements on chromosome 1 using the standard approach, based on comparing all the bins using the Euclidean distance metric (Fig. 2A). Here, we identify two clusters with high enrichment (cluster 3 and cluster 5), a cluster with relatively low, narrow enrichment (cluster 1), and two clusters with broad enhancer domains (cluster 4 and 6). However, only two strong dynamic clusters are identified, cluster 2, which shows enhancers specifically activated in mesenchymal stem cells and cluster 7 which shows enhancers specifically activated in trophoblast-like stem cells. Figure 2B shows an alternative clustering approach implemented in fluff heatmap. Here the regions were clustered on basis of the Pearson correlation of read counts in the center of the region (extended to 2 kb). This shows a completely different picture and we now can identify enhancers specific to H1 ES cells (cluster 5), mesenchymal (cluster 4), mesendoderm (cluster 7), neuronal progenitor (cluster 3) and trophoblast cells (cluster 6). These lineage-specific enhancer dynamics were not visible in the clustering in Fig. 2A.

Conclusion

The analysis of multi-dimensional genomic data requires methods for data exploration and visualization. We provide fluff, a Python package that contains several command-line tools to generate figures for use in high-throughput sequencing analysis workflows. We aim to fill the gap between powerful, flexible libraries that require programming skills on the one hand, and intuitive, graphical programs with limited customization possibilities on the other hand. These tools were developed based on a need for straightforward analysis and visualization of ChIP-seq data and have been successfully applied in a variety of projects (Menafra et al., 2014; Van den Boom et al., 2016; Kouwenhoven et al., 2015). In conclusion, fluff helps to interpret genome-wide experiments by efficient visualization of sequencing data. Click here for additional data file. Click here for additional data file.

33 in total

1. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.

Authors: Ron Edgar; Michael Domrachev; Alex E Lash
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

2. Open source clustering software.

Authors: M J L de Hoon; S Imoto; J Nolan; S Miyano
Journal: Bioinformatics Date: 2004-02-10 Impact factor: 6.937

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. Repitools: an R package for the analysis of enrichment-based epigenomic data.

Authors: Aaron L Statham; Dario Strbenac; Marcel W Coolen; Clare Stirzaker; Susan J Clark; Mark D Robinson
Journal: Bioinformatics Date: 2010-05-10 Impact factor: 6.937

5. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

6. Human DNA methylomes at base resolution show widespread epigenomic differences.

Authors: Ryan Lister; Mattia Pelizzola; Robert H Dowen; R David Hawkins; Gary Hon; Julian Tonti-Filippini; Joseph R Nery; Leonard Lee; Zhen Ye; Que-Minh Ngo; Lee Edsall; Jessica Antosiewicz-Bourget; Ron Stewart; Victor Ruotti; A Harvey Millar; James A Thomson; Bing Ren; Joseph R Ecker
Journal: Nature Date: 2009-10-14 Impact factor: 49.962

7. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.

Authors: Gordon Robertson; Martin Hirst; Matthew Bainbridge; Misha Bilenky; Yongjun Zhao; Thomas Zeng; Ghia Euskirchen; Bridget Bernier; Richard Varhol; Allen Delaney; Nina Thiessen; Obi L Griffith; Ann He; Marco Marra; Michael Snyder; Steven Jones
Journal: Nat Methods Date: 2007-06-11 Impact factor: 28.547

8. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

9. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting.

Authors: Jay R Hesselberth; Xiaoyu Chen; Zhihong Zhang; Peter J Sabo; Richard Sandstrom; Alex P Reynolds; Robert E Thurman; Shane Neph; Michael S Kuehn; William S Noble; Stanley Fields; John A Stamatoyannopoulos
Journal: Nat Methods Date: 2009-03-22 Impact factor: 28.547

10. Model-based analysis of ChIP-Seq (MACS).

Authors: Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal: Genome Biol Date: 2008-09-17 Impact factor: 13.583

20 in total

1. ZBTB2 reads unmethylated CpG island promoters and regulates embryonic stem cell differentiation.

Authors: Ino D Karemaker; Michiel Vermeulen
Journal: EMBO Rep Date: 2018-02-01 Impact factor: 8.807

2. In Vivo Genome-Wide PGR Binding in Pregnant Human Myometrium Identifies Potential Regulators of Labor.

Authors: Ariel J Dotts; Derek Reiman; Ping Yin; Stacy Kujawa; William A Grobman; Yang Dai; Serdar E Bulun
Journal: Reprod Sci Date: 2022-06-22 Impact factor: 3.060

3. Screening thousands of transcribed coding and non-coding regions reveals sequence determinants of RNA polymerase II elongation potential.

Authors: Hanneke Vlaming; Claudia A Mimoso; Andrew R Field; Benjamin J E Martin; Karen Adelman
Journal: Nat Struct Mol Biol Date: 2022-06-09 Impact factor: 18.361

4. Direct targeted therapy for MLL-fusion-driven high-risk acute leukaemias.

Authors: Sandra Cantilena; Luca Gasparoli; Deepali Pal; Olaf Heidenreich; Jan-Henning Klusmann; Joost H A Martens; Alexandre Faille; Alan J Warren; Mawar Karsa; Ruby Pandher; Klaartje Somers; Owen Williams; Jasper de Boer
Journal: Clin Transl Med Date: 2022-06

5. p63 Directs Subtype-Specific Gene Expression in HPV+ Head and Neck Squamous Cell Carcinoma.

Authors: Alexandra Ruth Glathar; Akinsola Oyelakin; Christian Gluck; Jonathan Bard; Satrajit Sinha
Journal: Front Oncol Date: 2022-05-31 Impact factor: 5.738

6. Selective Requirement of MYB for Oncogenic Hyperactivation of a Translocated Enhancer in Leukemia.

Authors: Johannes Zuber; Ruud Delwel; Leonie Smeenk; Sophie Ottema; Roger Mulet-Lazaro; Anja Ebert; Marije Havermans; Andrea Arricibita Varea; Michaela Fellner; Dorien Pastoors; Stanley van Herk; Claudia Erpelinck-Verschueren; Tim Grob; Remco M Hoogenboezem; François G Kavelaars; Daniel R Matson; Emery H Bresnick; Eric M Bindels; Alex Kentsis
Journal: Cancer Discov Date: 2021-05-12 Impact factor: 39.397

7. Interpretation of allele-specific chromatin accessibility using cell state-aware deep learning.

Authors: Zeynep Kalender Atak; Ibrahim Ihsan Taskiran; Jonas Demeulemeester; Christopher Flerin; David Mauduit; Liesbeth Minnoye; Gert Hulselmans; Valerie Christiaens; Ghanem-Elias Ghanem; Jasper Wouters; Stein Aerts
Journal: Genome Res Date: 2021-04-08 Impact factor: 9.043

8. Molecular dissection of the oncogenic role of ETS1 in the mesenchymal subtypes of head and neck squamous cell carcinoma.

Authors: Christian Gluck; Alexandra Glathar; Maria Tsompana; Norma Nowak; Lee Ann Garrett-Sinha; Michael J Buck; Satrajit Sinha
Journal: PLoS Genet Date: 2019-07-15 Impact factor: 5.917

9. Chromatin Accessibility-Based Characterization of the Gene Regulatory Network Underlying Plasmodium falciparum Blood-Stage Development.

Authors: Christa Geeke Toenhake; Sabine Anne-Kristin Fraschka; Mahalingam Shanmugiah Vijayabaskar; David Robert Westhead; Simon Jan van Heeringen; Richárd Bártfai
Journal: Cell Host Microbe Date: 2018-04-11 Impact factor: 21.023

10. Regulatory remodeling in the allo-tetraploid frog Xenopus laevis.

Authors: Dei M Elurbe; Sarita S Paranjpe; Georgios Georgiou; Ila van Kruijsbergen; Ozren Bogdanovic; Romain Gibeaux; Rebecca Heald; Ryan Lister; Martijn A Huynen; Simon J van Heeringen; Gert Jan C Veenstra
Journal: Genome Biol Date: 2017-10-24 Impact factor: 13.583