Literature DB >> 26092377

The jigsaw puzzle of sequence phenotype inference: Piecing together Shannon entropy, importance sampling, and Empirical Bayes.

Zeina Shreif1, Deborah A Striegel2, Vipul Periwal3.   

Abstract

A nucleotide sequence 35 base pairs long can take 1,180,591,620,717,411,303,424 possible values. An example of systems biology datasets, protein binding microarrays, contain activity data from about 40,000 such sequences. The discrepancy between the number of possible configurations and the available activities is enormous. Thus, albeit that systems biology datasets are large in absolute terms, they oftentimes require methods developed for rare events due to the combinatorial increase in the number of possible configurations of biological systems. A plethora of techniques for handling large datasets, such as Empirical Bayes, or rare events, such as importance sampling, have been developed in the literature, but these cannot always be simultaneously utilized. Here we introduce a principled approach to Empirical Bayes based on importance sampling, information theory, and theoretical physics in the general context of sequence phenotype model induction. We present the analytical calculations that underlie our approach. We demonstrate the computational efficiency of the approach on concrete examples, and demonstrate its efficacy by applying the theory to publicly available protein binding microarray transcription factor datasets and to data on synthetic cAMP-regulated enhancer sequences. As further demonstrations, we find transcription factor binding motifs, predict the activity of new sequences and extract the locations of transcription factor binding sites. In summary, we present a novel method that is efficient (requiring minimal computational time and reasonable amounts of memory), has high predictive power that is comparable with that of models with hundreds of parameters, and has a limited number of optimized parameters, proportional to the sequence length. Published by Elsevier Ltd.

Entities:  

Keywords:  Binding motifs; Model induction; Protein binding microarrays; Quantitative sequence activity models; Systems biology; Transcription factor binding activity

Mesh:

Year:  2015        PMID: 26092377      PMCID: PMC4522360          DOI: 10.1016/j.jtbi.2015.06.010

Source DB:  PubMed          Journal:  J Theor Biol        ISSN: 0022-5193            Impact factor:   2.691


  32 in total

Review 1.  DNA binding sites: representation and discovery.

Authors:  G D Stormo
Journal:  Bioinformatics       Date:  2000-01       Impact factor: 6.937

2.  A biophysical approach to transcription factor binding site discovery.

Authors:  Marko Djordjevic; Anirvan M Sengupta; Boris I Shraiman
Journal:  Genome Res       Date:  2003-11       Impact factor: 9.043

3.  Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays.

Authors:  Sonali Mukherjee; Michael F Berger; Ghil Jona; Xun S Wang; Dale Muzzey; Michael Snyder; Richard A Young; Martha L Bulyk
Journal:  Nat Genet       Date:  2004-11-14       Impact factor: 38.330

4.  A systems approach to measuring the binding energy landscapes of transcription factors.

Authors:  Sebastian J Maerkl; Stephen R Quake
Journal:  Science       Date:  2007-01-12       Impact factor: 47.728

5.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities.

Authors:  Michael F Berger; Anthony A Philippakis; Aaron M Qureshi; Fangxue S He; Preston W Estep; Martha L Bulyk
Journal:  Nat Biotechnol       Date:  2006-09-24       Impact factor: 54.908

6.  Quantitative sequence-activity models (QSAM)--tools for sequence design.

Authors:  J Jonsson; T Norberg; L Carlsson; C Gustafsson; S Wold
Journal:  Nucleic Acids Res       Date:  1993-02-11       Impact factor: 16.971

7.  A new bioinformatics analysis tools framework at EMBL-EBI.

Authors:  Mickael Goujon; Hamish McWilliam; Weizhong Li; Franck Valentin; Silvano Squizzato; Juri Paern; Rodrigo Lopez
Journal:  Nucleic Acids Res       Date:  2010-05-03       Impact factor: 16.971

8.  Dinucleotide weight matrices for predicting transcription factor binding sites: generalizing the position weight matrix.

Authors:  Rahul Siddharthan
Journal:  PLoS One       Date:  2010-03-22       Impact factor: 3.240

9.  Evaluation of methods for modeling transcription factor sequence specificity.

Authors:  Matthew T Weirauch; Atina Cote; Raquel Norel; Matti Annala; Yue Zhao; Todd R Riley; Julio Saez-Rodriguez; Thomas Cokelaer; Anastasia Vedenko; Shaheynoor Talukder; Harmen J Bussemaker; Quaid D Morris; Martha L Bulyk; Gustavo Stolovitzky; Timothy R Hughes
Journal:  Nat Biotechnol       Date:  2013-01-27       Impact factor: 54.908

10.  Universally sloppy parameter sensitivities in systems biology models.

Authors:  Ryan N Gutenkunst; Joshua J Waterfall; Fergal P Casey; Kevin S Brown; Christopher R Myers; James P Sethna
Journal:  PLoS Comput Biol       Date:  2007-08-15       Impact factor: 4.475

View more
  3 in total

1.  Correlated rigid modes in protein families.

Authors:  D A Striegel; D Wojtowicz; T M Przytycka; V Periwal
Journal:  Phys Biol       Date:  2016-04-11       Impact factor: 2.583

2.  Block network mapping approach to quantitative trait locus analysis.

Authors:  Zeina Z Shreif; Daniel M Gatti; Vipul Periwal
Journal:  BMC Bioinformatics       Date:  2016-12-22       Impact factor: 3.169

3.  A model of k-mer surprisal to quantify local sequence information content surrounding splice regions.

Authors:  Sam Humphrey; Alastair Kerr; Magnus Rattray; Caroline Dive; Crispin J Miller
Journal:  PeerJ       Date:  2020-11-04       Impact factor: 2.984

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.