Literature DB >> 24494927

A Bayesian sampler for optimization of protein domain hierarchies.

Andrew F Neuwald1.   

Abstract

The process of identifying and modeling functionally divergent subgroups for a specific protein domain class and arranging these subgroups hierarchically has, thus far, largely been done via manual curation. How to accomplish this automatically and optimally is an unsolved statistical and algorithmic problem that is addressed here via Markov chain Monte Carlo sampling. Taking as input a (typically very large) multiple-sequence alignment, the sampler creates and optimizes a hierarchy by adding and deleting leaf nodes, by moving nodes and subtrees up and down the hierarchy, by inserting or deleting internal nodes, and by redefining the sequences and conserved patterns associated with each node. All such operations are based on a probability distribution that models the conserved and divergent patterns defining each subgroup. When we view these patterns as sequence determinants of protein function, each node or subtree in such a hierarchy corresponds to a subgroup of sequences with similar biological properties. The sampler can be applied either de novo or to an existing hierarchy. When applied to 60 protein domains from multiple starting points in this way, it converged on similar solutions with nearly identical log-likelihood ratio scores, suggesting that it typically finds the optimal peak in the posterior probability distribution. Similarities and differences between independently generated, nearly optimal hierarchies for a given domain help distinguish robust from statistically uncertain features. Thus, a future application of the sampler is to provide confidence measures for various features of a domain hierarchy.

Mesh:

Year:  2014        PMID: 24494927      PMCID: PMC3948484          DOI: 10.1089/cmb.2013.0099

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  16 in total

1.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability.

Authors:  Christian E V Storm; Erik L L Sonnhammer
Journal:  Bioinformatics       Date:  2002-01       Impact factor: 6.937

2.  Clustering of proximal sequence space for the identification of protein families.

Authors:  Federico Abascal; Alfonso Valencia
Journal:  Bioinformatics       Date:  2002-07       Impact factor: 6.937

3.  Surveying the manifold divergence of an entire protein class for statistical clues to underlying biochemical mechanisms.

Authors:  Andrew F Neuwald
Journal:  Stat Appl Genet Mol Biol       Date:  2011-08-04

4.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

5.  Optimization by simulated annealing.

Authors:  S Kirkpatrick; C D Gelatt; M P Vecchi
Journal:  Science       Date:  1983-05-13       Impact factor: 47.728

6.  Genome-scale phylogenetic function annotation of large and diverse protein families.

Authors:  Barbara E Engelhardt; Michael I Jordan; John R Srouji; Steven E Brenner
Journal:  Genome Res       Date:  2011-07-22       Impact factor: 9.043

7.  Ran's C-terminal, basic patch, and nucleotide exchange mechanisms in light of a canonical structure for Rab, Rho, Ras, and Ran GTPases.

Authors:  Andrew F Neuwald; Natarajan Kannan; Aleksandar Poleksic; Naoya Hata; Jun S Liu
Journal:  Genome Res       Date:  2003-04       Impact factor: 9.043

8.  Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures.

Authors:  Andrew F Neuwald; Christopher J Lanczycki; Aron Marchler-Bauer
Journal:  BMC Bioinformatics       Date:  2012-06-22       Impact factor: 3.169

9.  RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs.

Authors:  Christian M Zmasek; Sean R Eddy
Journal:  BMC Bioinformatics       Date:  2002-05-16       Impact factor: 3.169

10.  The Pfam protein families database.

Authors:  Robert D Finn; John Tate; Jaina Mistry; Penny C Coggill; Stephen John Sammut; Hans-Rudolf Hotz; Goran Ceric; Kristoffer Forslund; Sean R Eddy; Erik L L Sonnhammer; Alex Bateman
Journal:  Nucleic Acids Res       Date:  2007-11-26       Impact factor: 16.971

View more
  15 in total

1.  Initial Cluster Analysis.

Authors:  Stephen F Altschul; Andrew F Neuwald
Journal:  J Comput Biol       Date:  2017-08-03       Impact factor: 1.479

2.  Evaluating, comparing, and interpreting protein domain hierarchies.

Authors:  Andrew F Neuwald
Journal:  J Comput Biol       Date:  2014-02-21       Impact factor: 1.479

3.  The crystal structure of the protein kinase HIPK2 reveals a unique architecture of its CMGC-insert region.

Authors:  Christopher Agnew; Lijun Liu; Shu Liu; Wei Xu; Liang You; Wayland Yeung; Natarajan Kannan; David Jablons; Natalia Jura
Journal:  J Biol Chem       Date:  2019-07-24       Impact factor: 5.157

4.  Tracing the origin and evolution of pseudokinases across the tree of life.

Authors:  Annie Kwon; Steven Scott; Rahil Taujale; Wayland Yeung; Krys J Kochut; Patrick A Eyers; Natarajan Kannan
Journal:  Sci Signal       Date:  2019-04-23       Impact factor: 8.192

5.  SPARC: Structural properties associated with residue constraints.

Authors:  Andrew F Neuwald; Hui Yang; B Tracy Nixon
Journal:  Comput Struct Biotechnol J       Date:  2022-04-07       Impact factor: 6.155

6.  Inference of Functionally-Relevant N-acetyltransferase Residues Based on Statistical Correlations.

Authors:  Andrew F Neuwald; Stephen F Altschul
Journal:  PLoS Comput Biol       Date:  2016-12-21       Impact factor: 4.475

7.  Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties.

Authors:  Andrew F Neuwald; Stephen F Altschul
Journal:  PLoS Comput Biol       Date:  2016-05-18       Impact factor: 4.475

8.  Inferring joint sequence-structural determinants of protein functional specificity.

Authors:  Andrew F Neuwald; L Aravind; Stephen F Altschul
Journal:  Elife       Date:  2018-01-16       Impact factor: 8.140

9.  Deep evolutionary analysis reveals the design principles of fold A glycosyltransferases.

Authors:  Rahil Taujale; Aarya Venkat; Liang-Chin Huang; Zhongliang Zhou; Wayland Yeung; Khaled M Rasheed; Sheng Li; Arthur S Edison; Kelley W Moremen; Natarajan Kannan
Journal:  Elife       Date:  2020-04-01       Impact factor: 8.140

10.  A survey of TIR domain sequence and structure divergence.

Authors:  Vladimir Y Toshchakov; Andrew F Neuwald
Journal:  Immunogenetics       Date:  2020-01-30       Impact factor: 2.846

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.