Literature DB >> 27378302

Heat*seq: an interactive web tool for high-throughput sequencing experiment comparison with public data.

Guillaume Devailly¹, Anna Mantsoki¹, Anagha Joshi¹.

Abstract

Better protocols and decreasing costs have made high-throughput sequencing experiments now accessible even to small experimental laboratories. However, comparing one or few experiments generated by an individual lab to the vast amount of relevant data freely available in the public domain might be limited due to lack of bioinformatics expertise. Though several tools, including genome browsers, allow such comparison at a single gene level, they do not provide a genome-wide view. We developed Heat*seq, a web-tool that allows genome scale comparison of high throughput experiments chromatin immuno-precipitation followed by sequencing, RNA-sequencing and Cap Analysis of Gene Expression) provided by a user, to the data in the public domain. Heat*seq currently contains over 12 000 experiments across diverse tissues and cell types in human, mouse and drosophila. Heat*seq displays interactive correlation heatmaps, with an ability to dynamically subset datasets to contextualize user experiments. High quality figures and tables are produced and can be downloaded in multiple formats.
AVAILABILITY AND IMPLEMENTATION: Web application: http://www.heatstarseq.roslin.ed.ac.uk/ Source code: https://github.com/gdevailly CONTACT: Guillaume.Devailly@roslin.ed.ac.uk or Anagha.Joshi@roslin.ed.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27378302 PMCID： PMC5079476 DOI： 10.1093/bioinformatics/btw407

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

High throughput sequencing is now becoming routine for many biological assays including transcriptome analysis through RNA-sequencing (RNA-seq), or transcription factor (TF) binding sites identification through chromatin immuno-precipitation followed by sequencing (ChIP-seq). Additionally, collaborative projects such as Bgee (Bastian et al.), ENCODE (Bernstein ) and Roadmap Epigenomics (Kundaje ) have generated genome-wide datasets across hundreds of cell types or tissues. Despite this large data being freely available in the public domain, the lack of computational tools accessible to experimental scientists with no or elementary computational skills prohibits the use of this data to its full potential for discovery. Though genome browsers, including summary tracks provided by many consortia, are extremely useful to study a few genes, promoters or single nucleotide polymorphisms, they lack the genome-wide overview. Only a few public resources such as the CODEX database (Sánchez-Castillo ) and the BLUEPRINT GenomeStats tool (Zerbino ) allow a genome-wide comparison with the user data. We therefore developed Heat*seq, a free, open source, web application providing fast and interactive comparison against high throughput sequencing experiments in the public domain. Users can upload a processed text file containing either gene expression value (Fragments Per Kilobase of transcript per Million (FPKM) or Tags Per Million (TPM)), peak coordinates or peak coordinates and corresponding expression value for CAGE(Cap Analysis of Gene Expression). The application provides clustered correlation heatmaps, summarising global similarities between all samples in the dataset and the user sample. Heat*seq provides over 12 000 publicly available genome-wide experiments in human, mouse and drosophila for fast and interactive comparison. In summary, Heat*seq is an interactive web tool that allows users to contextualize their sequencing data with respect to vast amounts of public data in a few minutes without requiring any programming skills.

2 Methods

2.1 Data collection

We collected gene expression data (RNA-seq), TF ChIP-seq data and CAGE data (over 4000 individual experiments) from Bgee (Bastian ), Blueprint epigenome (Pradel ), CODEX (Sánchez-Castillo et al., 2015b), ENCODE (Bernstein ), FANTOM5 (Forrest ), FlyBase (Attrill ), GTEx (Lonsdale ), modENCODE (Celniker ) and Roadmap Epigenomics (Bernstein ), in human, mouse and drosophila. Data formatting was done using R (R scripts available on GitHub). The source for each dataset is listed in Supplementary Table S1. Heatmaps represent Pearson’s correlation values between experiments calculated using a Gene x Experiment numeric matrix with gene expression values for expression data (log scaled), a Genomic regions × Experiments binary matrix indicating presence or absence of a peak for TF ChIP-seq data and a Genomic regions × Experiments numeric matrix of expression values for CAGE data (log scaled). Importantly, we constructed a metadata table which provides a web-link to original data and allows users to sub select each dataset.

2.2 Web-application development

Heat*seq is an R shiny open source interactive tool which computes correlation values between the user file and each experiment in a dataset. Detailed user instructions are on the application website.

3 Results

3.1 Application description

Heat*seq tool supports three data types: HeatRNAseq, HeatChIPseq and HeatCAGEseq. Data upload, correlation calculation and heatmap generation takes about a minute. Importantly, users can interactively sub select relevant experiments using the metadata information (e.g. cell type, TF name). The interactive heatmap also allows selecting different clustering methods as well as zooming in and out on the heatmap. The high resolution figures and tables can be downloaded in multiple formats. Thus, Heat*seq provides global overview of relationships between public experiments and the user data. Four user scenarios are discussed below.

3.2 User scenarios

3.2.1 User data quality control

We compared a Neocortex, 10 days post-partum (Ray ) RNA-seq sample with Bgee mouse RNA-seq data using HeatRNAseq. The top five correlation values (Pearson Correlation Coefficient > 0.9) correspond to Bgee brain samples (Supplementary Table S2). Thus, Heat*seq can be used as a fast data quality check for next-generation sequencing data.

3.2.2 Cell context identification

An oestrogen receptor (ER) alpha ChIP-seq in MCF7 cells (Zhuang ) comparison to the ENCODE TFBS dataset by sub-selecting ENCODE ER ChIP-seq experiments revealed that the binding pattern of ERα in MCF7 cells was more similar to its binding pattern in T-47D cells than in ECC-1 cells (Fig. 1A). MCF7 and T-47D were derived from mammary tumours while ECC-1 is an endometrial cell line.

Fig. 1.

Correlations heatmaps from Heat*seq. (A) ERα ChIP-seq in MCF7 cells from Zhuang et al. is closer to ENCODE ERα ChIP-seq in T-47D than in ECC-1 cells. (B) BRF1 and RNA PolIII bind tRNA genes, but nor BRF2. (C) c-MYC ChIP-seq in H1-hESC from UT-A and Stanford show low correlation. The colour key next to B is for A, B and C. (D) Two erythroblast RNA-seq samples from BLUEPRINT are closely related to endothelial cells (Color version of this figure is available at Bioinformatics online.)

3.2.3 New hypotheses by data integration

CpG islands (CGI) from the UCSC (Karolchik ) comparison to HeatChIPseq found that RNA polymerase II and TAF1 (Supplementary Table S4) were enriched at CGIs, as ∼50% of human gene promoters contain a CGI (Illingworth and Bird, 2009). Interestingly, we identified factors avoiding CGIs including MAFK, GATA3 and ZNF274. Similarly, tRNA promoters were highly correlated with RNA polymerase III, and its co-factors BDP1, RPC155 and BRF1 (Supplementary Table S4) using HeatChIPseq. Interestingly, comparison with BRF family data revealed that BRF1, but not BRF2 was bound at tRNA genes (Supplementary Fig. S1B).

3.2.4 Public data assessment

Heat*seq can be used to assess data in the public domain, highlighted by two examples below amongst others: A MYC ChIP-seq in H1-hESC cells does not cluster with other ENCODE MYC ChIP-seq experiments (Fig. 1C), including H1-hESC sample from a different experimental group (Devailly ). Two out of seven erythroblast RNA-seq samples from the Blueprint Epigenome consortium are more correlated with endothelial cells than with the rest of the erythroblast samples (Fig. 1D).

4 Conclusion

With Heat*seq, comparing RNA-seq, ChIP-seq or CAGE experiments to hundreds of publicly available datasets becomes a trivial task. Researchers can now investigate the relationships between various high-throughput sequencing experiments fast and interactively without requiring any programming skills. Such analysis can assess data quality, cell variability and generate novel regulatory hypotheses.

15 in total

Review 1. CpG islands--'a rough guide'.

Authors: Robert S Illingworth; Adrian P Bird
Journal: FEBS Lett Date: 2009-04-18 Impact factor: 4.124

2. The NIH Roadmap Epigenomics Mapping Consortium.

Authors: Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson
Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908

3. The Genotype-Tissue Expression (GTEx) project.

Authors:
Journal: Nat Genet Date: 2013-06 Impact factor: 38.330

4. A promoter-level mammalian expression atlas.

Authors: Alistair R R Forrest; Hideya Kawaji; Michael Rehli; J Kenneth Baillie; Michiel J L de Hoon; Vanja Haberle; Timo Lassmann; Ivan V Kulakovskiy; Marina Lizio; Masayoshi Itoh; Robin Andersson; Christopher J Mungall; Terrence F Meehan; Sebastian Schmeier; Nicolas Bertin; Mette Jørgensen; Emmanuel Dimont; Erik Arner; Christian Schmidl; Ulf Schaefer; Yulia A Medvedeva; Charles Plessy; Morana Vitezic; Jessica Severin; Colin A Semple; Yuri Ishizu; Robert S Young; Margherita Francescatto; Intikhab Alam; Davide Albanese; Gabriel M Altschuler; Takahiro Arakawa; John A C Archer; Peter Arner; Magda Babina; Sarah Rennie; Piotr J Balwierz; Anthony G Beckhouse; Swati Pradhan-Bhatt; Judith A Blake; Antje Blumenthal; Beatrice Bodega; Alessandro Bonetti; James Briggs; Frank Brombacher; A Maxwell Burroughs; Andrea Califano; Carlo V Cannistraci; Daniel Carbajo; Yun Chen; Marco Chierici; Yari Ciani; Hans C Clevers; Emiliano Dalla; Carrie A Davis; Michael Detmar; Alexander D Diehl; Taeko Dohi; Finn Drabløs; Albert S B Edge; Matthias Edinger; Karl Ekwall; Mitsuhiro Endoh; Hideki Enomoto; Michela Fagiolini; Lynsey Fairbairn; Hai Fang; Mary C Farach-Carson; Geoffrey J Faulkner; Alexander V Favorov; Malcolm E Fisher; Martin C Frith; Rie Fujita; Shiro Fukuda; Cesare Furlanello; Masaaki Furino; Jun-ichi Furusawa; Teunis B Geijtenbeek; Andrew P Gibson; Thomas Gingeras; Daniel Goldowitz; Julian Gough; Sven Guhl; Reto Guler; Stefano Gustincich; Thomas J Ha; Masahide Hamaguchi; Mitsuko Hara; Matthias Harbers; Jayson Harshbarger; Akira Hasegawa; Yuki Hasegawa; Takehiro Hashimoto; Meenhard Herlyn; Kelly J Hitchens; Shannan J Ho Sui; Oliver M Hofmann; Ilka Hoof; Furni Hori; Lukasz Huminiecki; Kei Iida; Tomokatsu Ikawa; Boris R Jankovic; Hui Jia; Anagha Joshi; Giuseppe Jurman; Bogumil Kaczkowski; Chieko Kai; Kaoru Kaida; Ai Kaiho; Kazuhiro Kajiyama; Mutsumi Kanamori-Katayama; Artem S Kasianov; Takeya Kasukawa; Shintaro Katayama; Sachi Kato; Shuji Kawaguchi; Hiroshi Kawamoto; Yuki I Kawamura; Tsugumi Kawashima; Judith S Kempfle; Tony J Kenna; Juha Kere; Levon M Khachigian; Toshio Kitamura; S Peter Klinken; Alan J Knox; Miki Kojima; Soichi Kojima; Naoto Kondo; Haruhiko Koseki; Shigeo Koyasu; Sarah Krampitz; Atsutaka Kubosaki; Andrew T Kwon; Jeroen F J Laros; Weonju Lee; Andreas Lennartsson; Kang Li; Berit Lilje; Leonard Lipovich; Alan Mackay-Sim; Ri-ichiroh Manabe; Jessica C Mar; Benoit Marchand; Anthony Mathelier; Niklas Mejhert; Alison Meynert; Yosuke Mizuno; David A de Lima Morais; Hiromasa Morikawa; Mitsuru Morimoto; Kazuyo Moro; Efthymios Motakis; Hozumi Motohashi; Christine L Mummery; Mitsuyoshi Murata; Sayaka Nagao-Sato; Yutaka Nakachi; Fumio Nakahara; Toshiyuki Nakamura; Yukio Nakamura; Kenichi Nakazato; Erik van Nimwegen; Noriko Ninomiya; Hiromi Nishiyori; Shohei Noma; Shohei Noma; Tadasuke Noazaki; Soichi Ogishima; Naganari Ohkura; Hiroko Ohimiya; Hiroshi Ohno; Mitsuhiro Ohshima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry A Ovchinnikov; Arnab Pain; Robert Passier; Margaret Patrikakis; Helena Persson; Silvano Piazza; James G D Prendergast; Owen J L Rackham; Jordan A Ramilowski; Mamoon Rashid; Timothy Ravasi; Patrizia Rizzu; Marco Roncador; Sugata Roy; Morten B Rye; Eri Saijyo; Antti Sajantila; Akiko Saka; Shimon Sakaguchi; Mizuho Sakai; Hiroki Sato; Suzana Savvi; Alka Saxena; Claudio Schneider; Erik A Schultes; Gundula G Schulze-Tanzil; Anita Schwegmann; Thierry Sengstag; Guojun Sheng; Hisashi Shimoji; Yishai Shimoni; Jay W Shin; Christophe Simon; Daisuke Sugiyama; Takaai Sugiyama; Masanori Suzuki; Naoko Suzuki; Rolf K Swoboda; Peter A C 't Hoen; Michihira Tagami; Naoko Takahashi; Jun Takai; Hiroshi Tanaka; Hideki Tatsukawa; Zuotian Tatum; Mark Thompson; Hiroo Toyodo; Tetsuro Toyoda; Elvind Valen; Marc van de Wetering; Linda M van den Berg; Roberto Verado; Dipti Vijayan; Ilya E Vorontsov; Wyeth W Wasserman; Shoko Watanabe; Christine A Wells; Louise N Winteringham; Ernst Wolvetang; Emily J Wood; Yoko Yamaguchi; Masayuki Yamamoto; Misako Yoneda; Yohei Yonekura; Shigehiro Yoshida; Susan E Zabierowski; Peter G Zhang; Xiaobei Zhao; Silvia Zucchelli; Kim M Summers; Harukazu Suzuki; Carsten O Daub; Jun Kawai; Peter Heutink; Winston Hide; Tom C Freeman; Boris Lenhard; Vladimir B Bajic; Martin S Taylor; Vsevolod J Makeev; Albin Sandelin; David A Hume; Piero Carninci; Yoshihide Hayashizaki
Journal: Nature Date: 2014-03-27 Impact factor: 49.962

5. CODEX: a next-generation sequencing experiment database for the haematopoietic and embryonic stem cell communities.

Authors: Manuel Sánchez-Castillo; David Ruau; Adam C Wilkinson; Felicia S L Ng; Rebecca Hannah; Evangelia Diamanti; Patrick Lombard; Nicola K Wilson; Berthold Gottgens
Journal: Nucleic Acids Res Date: 2014-09-30 Impact factor: 19.160

6. WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis.

Authors: Daniel R Zerbino; Nathan Johnson; Thomas Juettemann; Steven P Wilder; Paul Flicek
Journal: Bioinformatics Date: 2013-12-19 Impact factor: 6.937

7. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

8. Integrative analysis of 111 reference human epigenomes.

Authors: Anshul Kundaje; Wouter Meuleman; Jason Ernst; Misha Bilenky; Angela Yen; Alireza Heravi-Moussavi; Pouya Kheradpour; Zhizhuo Zhang; Jianrong Wang; Michael J Ziller; Viren Amin; John W Whitaker; Matthew D Schultz; Lucas D Ward; Abhishek Sarkar; Gerald Quon; Richard S Sandstrom; Matthew L Eaton; Yi-Chieh Wu; Andreas R Pfenning; Xinchen Wang; Melina Claussnitzer; Yaping Liu; Cristian Coarfa; R Alan Harris; Noam Shoresh; Charles B Epstein; Elizabeta Gjoneska; Danny Leung; Wei Xie; R David Hawkins; Ryan Lister; Chibo Hong; Philippe Gascard; Andrew J Mungall; Richard Moore; Eric Chuah; Angela Tam; Theresa K Canfield; R Scott Hansen; Rajinder Kaul; Peter J Sabo; Mukul S Bansal; Annaick Carles; Jesse R Dixon; Kai-How Farh; Soheil Feizi; Rosa Karlic; Ah-Ram Kim; Ashwinikumar Kulkarni; Daofeng Li; Rebecca Lowdon; GiNell Elliott; Tim R Mercer; Shane J Neph; Vitor Onuchic; Paz Polak; Nisha Rajagopal; Pradipta Ray; Richard C Sallari; Kyle T Siebenthall; Nicholas A Sinnott-Armstrong; Michael Stevens; Robert E Thurman; Jie Wu; Bo Zhang; Xin Zhou; Arthur E Beaudet; Laurie A Boyer; Philip L De Jager; Peggy J Farnham; Susan J Fisher; David Haussler; Steven J M Jones; Wei Li; Marco A Marra; Michael T McManus; Shamil Sunyaev; James A Thomson; Thea D Tlsty; Li-Huei Tsai; Wei Wang; Robert A Waterland; Michael Q Zhang; Lisa H Chadwick; Bradley E Bernstein; Joseph F Costello; Joseph R Ecker; Martin Hirst; Alexander Meissner; Aleksandar Milosavljevic; Bing Ren; John A Stamatoyannopoulos; Ting Wang; Manolis Kellis
Journal: Nature Date: 2015-02-19 Impact factor: 69.504

9. Variable reproducibility in genome-scale public data: A case study using ENCODE ChIP sequencing resource.

Authors: Guillaume Devailly; Anna Mantsoki; Tom Michoel; Anagha Joshi
Journal: FEBS Lett Date: 2015-11-24 Impact factor: 4.124

10. p21-activated kinase group II small compound inhibitor GNE-2861 perturbs estrogen receptor alpha signaling and restores tamoxifen-sensitivity in breast cancer cells.

Authors: Ting Zhuang; Jian Zhu; Zhilun Li; Julie Lorent; Chunyan Zhao; Karin Dahlman-Wright; Staffan Strömblad
Journal: Oncotarget Date: 2015-12-22

5 in total

Review 1. From Genotype to Phenotype: Through Chromatin.

Authors: Julia Romanowska; Anagha Joshi
Journal: Genes (Basel) Date: 2019-01-23 Impact factor: 4.096

2. A comprehensive resource for retrieving, visualizing, and integrating functional genomics data.

Authors: Matthias Blum; Pierre-Etienne Cholley; Valeriya Malysheva; Samuel Nicaise; Julien Moehlin; Hinrich Gronemeyer; Marco Antonio Mendoza-Parra
Journal: Life Sci Alliance Date: 2019-12-09

3. Clustergrammer, a web-based heatmap visualization and analysis tool for high-dimensional biological data.

Authors: Nicolas F Fernandez; Gregory W Gundersen; Adeeb Rahman; Mark L Grimes; Klarisa Rikova; Peter Hornbeck; Avi Ma'ayan
Journal: Sci Data Date: 2017-10-10 Impact factor: 6.444

4. Insights into mammalian transcription control by systematic analysis of ChIP sequencing data.

Authors: Guillaume Devailly; Anagha Joshi
Journal: BMC Bioinformatics Date: 2018-11-20 Impact factor: 3.169

5. Generation of iPSC-Derived Human Peripheral Sensory Neurons Releasing Substance P Elicited by TRPV1 Agonists.

Authors: Marília Z P Guimarães; Rodrigo De Vecchi; Gabriela Vitória; Jaroslaw K Sochacki; Bruna S Paulsen; Igor Lima; Felipe Rodrigues da Silva; Rodrigo F M da Costa; Newton G Castro; Lionel Breton; Stevens K Rehen
Journal: Front Mol Neurosci Date: 2018-08-22 Impact factor: 5.639

5 in total