Literature DB >> 26668005

DNAshapeR: an R/Bioconductor package for DNA shape prediction and feature encoding.

Tsu-Pei Chiu1, Federico Comoglio2, Tianyin Zhou1, Lin Yang1, Renato Paro3, Remo Rohs1.   

Abstract

UNLABELLED: DNAshapeR predicts DNA shape features in an ultra-fast, high-throughput manner from genomic sequencing data. The package takes either nucleotide sequence or genomic coordinates as input and generates various graphical representations for visualization and further analysis. DNAshapeR further encodes DNA sequence and shape features as user-defined combinations of k-mer and DNA shape features. The resulting feature matrices can be readily used as input of various machine learning software packages for further modeling studies.
AVAILABILITY AND IMPLEMENTATION: The DNAshapeR software package was implemented in the statistical programming language R and is freely available through the Bioconductor project at https://www.bioconductor.org/packages/devel/bioc/html/DNAshapeR.html and at the GitHub developer site, http://tsupeichiu.github.io/DNAshapeR/ CONTACT: rohs@usc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26668005      PMCID: PMC4824130          DOI: 10.1093/bioinformatics/btv735

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Two distinct readout modes have emerged as crucial components of protein–DNA recognition (Abe ). These modes include sequence-based readout of direct contacts with the functional groups of the bases (base readout) and structure-based readout of intrinsic deviations from a canonical double helix (shape readout). DNA shape readout was originally described based on the analysis of co-crystal structures of protein–DNA complexes. Studies of DNA shape readout were then extended to massive datasets of protein-interacting DNA sequences via the use of DNAshape, a method for the high-throughput prediction of DNA structural features (Zhou ). Using DNAshape as the underlying tool, a motif database for transcription factor (TF) binding sites, TFBSshape (Yang ) and a genome browser database for DNA shape annotations, GBshape (Chiu ), were developed. Rules that determine the binding affinity between TFs and their binding sites can be statistically learned from the data derived from in vitro high-throughput binding assays. Although sequence-based methods have long been used to model TF binding specificities, high-throughput prediction of DNA shape enabled us to develop methods that leverage both DNA sequence and shape information. Trained with either linear regression or support vector regression algorithms, shape-augmented models were consistently shown to outperform sequence-based methods in modeling the in vitro binding of TFs quantitatively (Zhou ). DNAshape is currently released as a stand-alone web service (Zhou ). Its pre-defined functionality and internet bandwidth-bounded performance made it difficult to use in genome-wide studies. To address these issues, we developed DNAshapeR, an R/Bioconductor package that can generate DNA shape predictions in an easy-to-use, easy-to-integrate and easy-to-extend manner. The output can be readily integrated into other high-throughput genomic analysis platforms.

2 High-throughput DNA shape prediction

The core of DNAshapeR is the DNAshape prediction method (Zhou ), which uses a sliding pentamer window to derive the structural features minor groove width (MGW), helix twist (HelT), propeller twist (ProT) and Roll (Fig. 1) from all-atom Monte Carlo simulations. These DNA shape features were observed in various cocrystal structures as playing an important role in achieving protein–DNA binding specificity. High-throughput predictions of DNA shape have shed light on the DNA binding specificity of TFs (He ; Murphy ) and were shown to be predictive of replication origins (Comoglio ).
Fig. 1.

Flowchart of DNAshapeR analysis. The input data can be either nucleotide sequence(s) in FASTA file format or genomic intervals, provided by the user in BED format or derived from public databases. The core of DNAshapeR includes a high-throughput approach for the prediction of DNA shape features. MGW, HelT, ProT and Roll can then be visualized in the form of plots, heat maps or genome browser tracks or used for the assembly of feature vectors of user-defined combinations of k-mer and shape features

Flowchart of DNAshapeR analysis. The input data can be either nucleotide sequence(s) in FASTA file format or genomic intervals, provided by the user in BED format or derived from public databases. The core of DNAshapeR includes a high-throughput approach for the prediction of DNA shape features. MGW, HelT, ProT and Roll can then be visualized in the form of plots, heat maps or genome browser tracks or used for the assembly of feature vectors of user-defined combinations of k-mer and shape features The DNAshapeR package enables ultra-fast, high-throughput predictions of shape features for thousands of genomic sequences and generates various graphical outputs of the data (Fig. 1; Supplementary Data). The modular design of DNAshapeR enables the expansion to additional features, such as conformational flexibility, biophysical properties and methylation status, to be added in future releases of the DNAshapeR package.

3 DNA shape and k-mer feature encoding

Besides DNA shape predictions and data visualization, DNAshapeR can also be used to generate feature vectors for user-defined models. These models consist of sequence features (1mer, 2mer, 3mer), shape features (MGW, Roll, ProT, HelT) or any combination of those features (Fig. 1; Supplementary Data). DNAshapeR encodes sequence as binary features. DNA shape features are normalized by default and can include second-order shape features. The detailed definitions of sequence and shape features were provided in an earlier study (Zhou ). The feature encoding function of DNAshapeR enables the generation of any user-defined subset of these features. The result of the feature encoding for each sequence is a chimera feature vector. Feature encoding of multiple sequences thus results in a feature matrix, which can be used as input for a variety of statistical machine learning methods.

Funding

This work was supported by the NIH (R01GM106056, R01HG003008 in part, and U01GM103804 to R.R.). Open-source software release and open-access publication were supported by the NSF (MCB-1413539 to R.R.). R.R. is an Alfred P. Sloan Research Fellow. Conflict of Interest: none declared.
  8 in total

1.  An ancient protein-DNA interaction underlying metazoan sex determination.

Authors:  Mark W Murphy; John K Lee; Sandra Rojo; Micah D Gearhart; Kayo Kurahashi; Surajit Banerjee; Guy-André Loeuille; Anu Bashamboo; Kenneth McElreavey; David Zarkower; Hideki Aihara; Vivian J Bardwell
Journal:  Nat Struct Mol Biol       Date:  2015-05-25       Impact factor: 15.369

2.  Deconvolving the recognition of DNA shape from sequence.

Authors:  Namiko Abe; Iris Dror; Lin Yang; Matthew Slattery; Tianyin Zhou; Harmen J Bussemaker; Remo Rohs; Richard S Mann
Journal:  Cell       Date:  2015-04-02       Impact factor: 41.582

3.  Quantitative modeling of transcription factor binding specificities using DNA shape.

Authors:  Tianyin Zhou; Ning Shen; Lin Yang; Namiko Abe; John Horton; Richard S Mann; Harmen J Bussemaker; Raluca Gordân; Remo Rohs
Journal:  Proc Natl Acad Sci U S A       Date:  2015-03-09       Impact factor: 11.205

4.  High-resolution profiling of Drosophila replication start sites reveals a DNA shape and chromatin signature of metazoan origins.

Authors:  Federico Comoglio; Tommy Schlumpf; Virginia Schmid; Remo Rohs; Christian Beisel; Renato Paro
Journal:  Cell Rep       Date:  2015-04-23       Impact factor: 9.423

5.  GBshape: a genome browser database for DNA shape annotations.

Authors:  Tsu-Pei Chiu; Lin Yang; Tianyin Zhou; Bradley J Main; Stephen C J Parker; Sergey V Nuzhdin; Thomas D Tullius; Remo Rohs
Journal:  Nucleic Acids Res       Date:  2014-10-17       Impact factor: 16.971

6.  ChIP-nexus enables improved detection of in vivo transcription factor binding footprints.

Authors:  Qiye He; Jeff Johnston; Julia Zeitlinger
Journal:  Nat Biotechnol       Date:  2015-03-09       Impact factor: 54.908

7.  DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale.

Authors:  Tianyin Zhou; Lin Yang; Yan Lu; Iris Dror; Ana Carolina Dantas Machado; Tahereh Ghane; Rosa Di Felice; Remo Rohs
Journal:  Nucleic Acids Res       Date:  2013-05-22       Impact factor: 16.971

8.  TFBSshape: a motif database for DNA shape features of transcription factor binding sites.

Authors:  Lin Yang; Tianyin Zhou; Iris Dror; Anthony Mathelier; Wyeth W Wasserman; Raluca Gordân; Remo Rohs
Journal:  Nucleic Acids Res       Date:  2013-11-07       Impact factor: 16.971

  8 in total
  58 in total

1.  Co-SELECT reveals sequence non-specific contribution of DNA shape to transcription factor binding in vitro.

Authors:  Soumitra Pal; Jan Hoinka; Teresa M Przytycka
Journal:  Nucleic Acids Res       Date:  2019-07-26       Impact factor: 16.971

2.  Alternative Activation of Macrophages Is Accompanied by Chromatin Remodeling Associated with Lineage-Dependent DNA Shape Features Flanking PU.1 Motifs.

Authors:  Mei San Tang; Emily R Miraldi; Natasha M Girgis; Richard A Bonneau; P'ng Loke
Journal:  J Immunol       Date:  2020-07-13       Impact factor: 5.422

3.  Conserved DNA sequence features underlie pervasive RNA polymerase pausing.

Authors:  Martyna Gajos; Olga Jasnovidova; Alena van Bömmel; Susanne Freier; Martin Vingron; Andreas Mayer
Journal:  Nucleic Acids Res       Date:  2021-05-07       Impact factor: 16.971

4.  Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding.

Authors:  Jinsen Li; Jared M Sagendorf; Tsu-Pei Chiu; Marco Pasi; Alberto Perez; Remo Rohs
Journal:  Nucleic Acids Res       Date:  2017-12-15       Impact factor: 16.971

5.  The G-Box Transcriptional Regulatory Code in Arabidopsis.

Authors:  Daphne Ezer; Samuel J K Shepherd; Anna Brestovitsky; Patrick Dickinson; Sandra Cortijo; Varodom Charoensawan; Mathew S Box; Surojit Biswas; Katja E Jaeger; Philip A Wigge
Journal:  Plant Physiol       Date:  2017-09-01       Impact factor: 8.340

6.  Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility.

Authors:  Xi Chen; Bowen Yu; Nicholas Carriero; Claudio Silva; Richard Bonneau
Journal:  Nucleic Acids Res       Date:  2017-05-05       Impact factor: 16.971

7.  Human Enhancers Harboring Specific Sequence Composition, Activity, and Genome Organization Are Linked to the Immune Response.

Authors:  Charles-Henri Lecellier; Wyeth W Wasserman; Anthony Mathelier
Journal:  Genetics       Date:  2018-06-05       Impact factor: 4.562

8.  Experimental maps of DNA structure at nucleotide resolution distinguish intrinsic from protein-induced DNA deformations.

Authors:  Robert N Azad; Dana Zafiropoulos; Douglas Ober; Yining Jiang; Tsu-Pei Chiu; Jared M Sagendorf; Remo Rohs; Thomas D Tullius
Journal:  Nucleic Acids Res       Date:  2018-03-16       Impact factor: 16.971

9.  Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework.

Authors:  Jinyu Yang; Anjun Ma; Adam D Hoppe; Cankun Wang; Yang Li; Chi Zhang; Yan Wang; Bingqiang Liu; Qin Ma
Journal:  Nucleic Acids Res       Date:  2019-09-05       Impact factor: 16.971

10.  Intrinsic DNA topology as a prioritization metric in genomic fine-mapping studies.

Authors:  Hannah C Ainsworth; Timothy D Howard; Carl D Langefeld
Journal:  Nucleic Acids Res       Date:  2020-11-18       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.