Literature DB >> 34599093

Field-theoretic density estimation for biological sequence space with applications to 5' splice site diversity and aneuploidy in cancer.

Wei-Chia Chen1, Juannan Zhou1, Jason M Sheltzer2, Justin B Kinney1, David M McCandlish3.   

Abstract

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5' splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.

Entities:  

Keywords:  bioinformatics; field theory; maximum entropy; molecular evolution; spectral graph theory

Mesh:

Substances:

Year:  2021        PMID: 34599093      PMCID: PMC8501885          DOI: 10.1073/pnas.2025782118

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


  51 in total

1.  Field Theories for Learning Probability Distributions.

Authors: 
Journal:  Phys Rev Lett       Date:  1996-12-02       Impact factor: 9.161

2.  The application of statistical physics to evolutionary biology.

Authors:  Guy Sella; Aaron E Hirsh
Journal:  Proc Natl Acad Sci U S A       Date:  2005-06-24       Impact factor: 11.205

3.  Identification of direct residue contacts in protein-protein interaction by message passing.

Authors:  Martin Weigt; Robert A White; Hendrik Szurmant; James A Hoch; Terence Hwa
Journal:  Proc Natl Acad Sci U S A       Date:  2008-12-30       Impact factor: 11.205

Review 4.  The Role of Aneuploidy in Cancer Evolution.

Authors:  Laurent Sansregret; Charles Swanton
Journal:  Cold Spring Harb Perspect Med       Date:  2017-01-03       Impact factor: 6.915

5.  Learning generative models for protein fold families.

Authors:  Sivaraman Balakrishnan; Hetunandan Kamisetty; Jaime G Carbonell; Su-In Lee; Christopher James Langmead
Journal:  Proteins       Date:  2011-01-25

6.  Mutation effects predicted from sequence co-variation.

Authors:  Thomas A Hopf; John B Ingraham; Frank J Poelwijk; Charlotta P I Schärfe; Michael Springer; Chris Sander; Debora S Marks
Journal:  Nat Biotechnol       Date:  2017-01-16       Impact factor: 54.908

Review 7.  Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness.

Authors:  Ronald M Levy; Allan Haldane; William F Flynn
Journal:  Curr Opin Struct Biol       Date:  2016-11-18       Impact factor: 6.809

8.  Density Estimation on Small Data Sets.

Authors:  Wei-Chia Chen; Ammar Tareen; Justin B Kinney
Journal:  Phys Rev Lett       Date:  2018-10-19       Impact factor: 9.185

9.  DNA copy number analysis of Grade II-III and Grade IV gliomas reveals differences in molecular ontogeny including chromothripsis associated with IDH mutation status.

Authors:  Adam Cohen; Mariko Sato; Kenneth Aldape; Clinton C Mason; Kristin Alfaro-Munoz; Lindsey Heathcock; Sarah T South; Lisa M Abegglen; Joshua D Schiffman; Howard Colman
Journal:  Acta Neuropathol Commun       Date:  2015-06-20       Impact factor: 7.801

10.  Ranking noncanonical 5' splice site usage by genome-wide RNA-seq analysis and splicing reporter assays.

Authors:  Steffen Erkelenz; Stephan Theiss; Wolfgang Kaisers; Johannes Ptok; Lara Walotka; Lisa Müller; Frank Hillebrand; Anna-Lena Brillen; Michael Sladek; Heiner Schaal
Journal:  Genome Res       Date:  2018-10-24       Impact factor: 9.043

View more
  1 in total

1.  Higher-order epistasis and phenotypic prediction.

Authors:  Juannan Zhou; Mandy S Wong; Wei-Chia Chen; Adrian R Krainer; Justin B Kinney; David M McCandlish
Journal:  Proc Natl Acad Sci U S A       Date:  2022-09-21       Impact factor: 12.779

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.