Literature DB >> 30304373

TEPIC 2-an extended framework for transcription factor binding prediction and integrative epigenomic analysis.

Florian Schmidt^1,2,3, Fabian Kern^1,2,4, Peter Ebert^2,3, Nina Baumgarten^1,2,5,6, Marcel H Schulz^1,2,5,6.

Abstract

SUMMARY: Prediction of transcription factor (TF) binding from epigenetics data and integrative analysis thereof are challenging. Here, we present TEPIC 2 a framework allowing for fast, accurate and versatile prediction, and analysis of TF binding from epigenetics data: it supports 30 species with binding motifs, computes TF gene and scores up to two orders of magnitude faster than before due to improved implementation, and offers easy-to-use machine learning pipelines for integrated analysis of TF binding predictions with gene expression data allowing the identification of important TFs.
AVAILABILITY AND IMPLEMENTATION: TEPIC is implemented in C++, R, and Python. It is freely available at https://github.com/SchulzLab/TEPIC and can be used on Linux based systems. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene Species

Mesh：

Substances：

Year: 2019 PMID： 30304373 PMCID： PMC6499243 DOI： 10.1093/bioinformatics/bty856

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Transcription Factors (TFs) are key players of transcriptional regulation. Prediction of TF binding is essential to gain a deeper understanding of their function. While experimental identification of TF binding is possible through laborious and expensive ChIP-seq assays, several computational approaches have been proposed to identify TF binding sites (TFBSs) (Jayaram ). These predictions have been successfully augmented using epigenetics data (Cuellar-Partida ; Pique-Regi ; Sherwood ). As delineated in Supplementary Section 1, TEPIC 2 builds upon and extends the functionality of existing TFBS prediction tools. Among other features, TEPIC 2 allows the direct aggregation of TFBS predictions on the gene level and uses these scores to gain novel insights on cell type specific functions of TFs via several machine learning analysis. This is a unique feature not supported by competitive TFBS prediction approaches (Supplementary Tables S1 and S2). Compared to its predecessor, TEPIC 2 has substantially lower runtime, contains an extended set of TF motifs, offers various means for downstream machine learning analyses as easy-to-use pipelines, and adds new functionalities to compute TF gene scores.

2 Features

The core functionalities of TEPIC 2 are to predict TFBS in user provided regions and to aggregate them to TF gene scores. The TF gene score computation has been modified to compute statistical features such as region length, region count, and the signal of an epigenetic assay within the considered regions. TEPIC 2 can compute a binary binding assessment, i.e. a TF binds or does not bind, based on p-values obtained using a set of background regions of similar characteristics as the input set. This feature complements the continuous TF affinity values of TRAP, which are not suitable for all downstream applications (Supplementary Section 4). Additionally, the aforementioned TF gene scores can be used in several integrative analysis workflows (Supplementary Section 7, 8 and 9). INVOKE refers to a sparse linear regression model to reveal key TFs potentially regulating transcription. It highlighted several known tissue-specific regulators in liver hepatocytes and CD4+ T cells (Schmidt ) and is also available as a web-server (Kehl ). Besides, TEPIC 2 includes a sparse logistic regression classifier to infer TFs related to gene expression changes between samples (DYNAMITE). DYNAMITE has been successfully applied to discover regulators of CD4+ T cell differentiation (Durek ). Recently, we combined TEPIC with DREM (Schulz ) to uncover master regulatory TFs from paired time-series expression and epigenomics data (EPIC-DREM), which was used to analyze mesenchymal stem-cell differentiation of osteoblasts and adipocytes (Gerard ). Furthermore, we considerably extended the set of TF motifs readily available in TEPIC 2. Now, this resource contains 30 species-specific and six taxonomy-specific sets from JASPAR (Mathelier ), as well as aggregated sets for humans, mice and vertebrates containing 561, 380 and 690 TF motifs (Supplementary Section 3). To streamline the training and interpretation of statistical models (Supplementary Fig. S1), we provide clustered versions of the merged TF motif files, representing families of binding motifs with high similarity (Pape ).

3 Implementation

TEPIC 2 uses a parallelized C++ implementation of TRAP (Roider ) that is considerably faster than the previous R implementation. Runtime was further reduced by using more efficient search algorithms and by enabling pre-filtered analyses of samples in minutes (Fig. 1a, Supplementary Table S5, Supplementary Fig. S2 and Supplementary Section 5). We evaluated the accuracy of TFBS predictions from TEPIC 2 using TF footprints called with HINT-BC (Gusmao ) on ENCODE data (The ENCODE Project Consortium, 2012). In comparison to established tools for TFBS prediction using epigenomics data (Cuellar-Partida ; Sherwood ), TEPIC performs favorably in terms of area under the precision recall curve (AUPR) (Fig. 1b, Supplementary Fig. S3 and Supplementary Section 6). Details on samples used are provided in Supplementary Section 2. The machine learning pipelines included in TEPIC 2 are implemented in R. Both workflows deliver results that are easy to interpret, also for non-expert users, due to automated figure generation and extensive documentation. As input, the pipelines require standard file formats, e.g. bed files for candidate TFBS and tab delimited txt files containing gene expression data. TEPICs full functionality is brought to the user via start-to-finish pipelines, which are automatically installed with TEPIC 2.

Fig. 1.

(a) Runtime comparison of TEPIC to TEPIC 2 using a subset of 458 human TFs. While the original implementation ran up to 1300 minutes to compute TFBS, TEPIC 2 is able to compute TF affinities for peaks in the vicinity of genes in at most 15 minutes. We used four cell line samples and three primary human hepatocyte samples (LiHe1–3) to conduct the runtime experiments. (b) We compared TEPIC TF affinities computed in footprints called with HINT-BC (Gusmao ) in four different cell-lines in terms of AUPR against PIQ (Sherwood ) and an extension of the widely used method Fimo, called Fimo-Prior (Cuellar-Partida ). Notably, TF affinities computed with TEPIC outperform both PIQ and Fimo-Prior

4 Conclusion

TEPIC 2 is a fast and easy-to-use tool for TFBS prediction combined with integrative analysis capabilities for gene expression and epigenomic data. TFBS prediction and downstream machine learning pipelines for various analysis settings allow a deep, seamless exploration of epigenomic datasets supporting data driven hypothesis generation about the role of individual TFs in complex regulatory landscapes. Click here for additional data file.

14 in total

1. Epigenetic priors for identifying active transcription factor binding sites.

Authors: Gabriel Cuellar-Partida; Fabian A Buske; Robert C McLeay; Tom Whitington; William Stafford Noble; Timothy L Bailey
Journal: Bioinformatics Date: 2011-11-08 Impact factor: 6.937

2. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data.

Authors: Roger Pique-Regi; Jacob F Degner; Athma A Pai; Daniel J Gaffney; Yoav Gilad; Jonathan K Pritchard
Journal: Genome Res Date: 2010-11-24 Impact factor: 9.043

3. Predicting transcription factor affinities to DNA from a biophysical model.

Authors: Helge G Roider; Aditi Kanhere; Thomas Manke; Martin Vingron
Journal: Bioinformatics Date: 2006-11-10 Impact factor: 6.937

4. Natural similarity measures between position frequency matrices with an application to clustering.

Authors: Utz J Pape; Sven Rahmann; Martin Vingron
Journal: Bioinformatics Date: 2008-01-02 Impact factor: 6.937

5. Analysis of computational footprinting methods for DNase sequencing experiments.

Authors: Eduardo G Gusmao; Manuel Allhoff; Martin Zenke; Ivan G Costa
Journal: Nat Methods Date: 2016-02-22 Impact factor: 28.547

6. DREM 2.0: Improved reconstruction of dynamic regulatory networks from time-series expression data.

Authors: Marcel H Schulz; William E Devanny; Anthony Gitter; Shan Zhong; Jason Ernst; Ziv Bar-Joseph
Journal: BMC Syst Biol Date: 2012-08-16

7. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape.

Authors: Richard I Sherwood; Tatsunori Hashimoto; Charles W O'Donnell; Sophia Lewis; Amira A Barkal; John Peter van Hoff; Vivek Karun; Tommi Jaakkola; David K Gifford
Journal: Nat Biotechnol Date: 2014-01-19 Impact factor: 54.908

8. Evaluating tools for transcription factor binding site prediction.

Authors: Narayan Jayaram; Daniel Usvyat; Andrew C R Martin
Journal: BMC Bioinformatics Date: 2016-11-02 Impact factor: 3.169

9. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

10. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.

Authors: Anthony Mathelier; Oriol Fornes; David J Arenillas; Chih-Yu Chen; Grégoire Denay; Jessica Lee; Wenqiang Shi; Casper Shyr; Ge Tan; Rebecca Worsley-Hunt; Allen W Zhang; François Parcy; Boris Lenhard; Albin Sandelin; Wyeth W Wasserman
Journal: Nucleic Acids Res Date: 2015-11-03 Impact factor: 16.971

8 in total

1. Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency.

Authors: Deborah Gérard; Florian Schmidt; Aurélien Ginolhac; Martine Schmitz; Rashi Halder; Peter Ebert; Marcel H Schulz; Thomas Sauter; Lasse Sinkkonen
Journal: Nucleic Acids Res Date: 2019-02-20 Impact factor: 16.971

2. Prediction of single-cell gene expression for transcription factor analysis.

Authors: Fatemeh Behjati Ardakani; Kathrin Kattler; Tobias Heinen; Florian Schmidt; David Feuerborn; Gilles Gasparoni; Konstantin Lepikhov; Patrick Nell; Jan Hengstler; Jörn Walter; Marcel H Schulz
Journal: Gigascience Date: 2020-10-30 Impact factor: 6.524

3. Unique and assay specific features of NOMe-, ATAC- and DNase I-seq data.

Authors: Karl J V Nordström; Florian Schmidt; Nina Gasparoni; Abdulrahman Salhab; Gilles Gasparoni; Kathrin Kattler; Fabian Müller; Peter Ebert; Ivan G Costa; Nico Pfeifer; Thomas Lengauer; Marcel H Schulz; Jörn Walter
Journal: Nucleic Acids Res Date: 2019-11-18 Impact factor: 16.971

4. A hierarchical regulatory network analysis of the vitamin D induced transcriptome reveals novel regulators and complete VDR dependency in monocytes.

Authors: Timothy Warwick; Marcel H Schulz; Stefan Günther; Ralf Gilsbach; Antonio Neme; Carsten Carlberg; Ralf P Brandes; Sabine Seuter
Journal: Sci Rep Date: 2021-03-22 Impact factor: 4.379

5. Integrative analysis of epigenetics data identifies gene-specific regulatory elements.

Authors: Florian Schmidt; Alexander Marx; Nina Baumgarten; Marie Hebel; Martin Wegner; Manuel Kaulich; Matthias S Leisegang; Ralf P Brandes; Jonathan Göke; Jilles Vreeken; Marcel H Schulz
Journal: Nucleic Acids Res Date: 2021-10-11 Impact factor: 16.971

6. Integrative prediction of gene expression with chromatin accessibility and conformation data.

Authors: Florian Schmidt; Fabian Kern; Marcel H Schulz
Journal: Epigenetics Chromatin Date: 2020-02-06 Impact factor: 4.954

7. Constructing gene regulatory networks using epigenetic data.

Authors: Abhijeet Rajendra Sonawane; Dawn L DeMeo; John Quackenbush; Kimberly Glass
Journal: NPJ Syst Biol Appl Date: 2021-12-09

8. Nuclear receptor activation shapes spatial genome organization essential for gene expression control: lessons learned from the vitamin D receptor.

Authors: Timothy Warwick; Marcel H Schulz; Ralf Gilsbach; Ralf P Brandes; Sabine Seuter
Journal: Nucleic Acids Res Date: 2022-04-22 Impact factor: 19.160

8 in total