Literature DB >> 16123115

Calibrating E-values for hidden Markov models using reverse-sequence null models.

Kevin Karplus1, Rachel Karchin, George Shackelford, Richard Hughey.   

Abstract

MOTIVATION: Hidden Markov models (HMMs) calculate the probability that a sequence was generated by a given model. Log-odds scoring provides a context for evaluating this probability, by considering it in relation to a null hypothesis. We have found that using a reverse-sequence null model effectively removes biases owing to sequence length and composition and reduces the number of false positives in a database search. Any scoring system is an arbitrary measure of the quality of database matches. Significance estimates of scores are essential, because they eliminate model- and method-dependent scaling factors, and because they quantify the importance of each match. Accurate computation of the significance of reverse-sequence null model scores presents a problem, because the scores do not fit the extreme-value (Gumbel) distribution commonly used to estimate HMM scores' significance.
RESULTS: To get a better estimate of the significance of reverse-sequence null model scores, we derive a theoretical distribution based on the assumption of a Gumbel distribution for raw HMM scores and compare estimates based on this and other distribution families. We derive estimation methods for the parameters of the distributions based on maximum likelihood and on moment matching (least-squares fit for Student's t-distribution). We evaluate the modeled distributions of scores, based on how well they fit the tail of the observed distribution for data not used in the fitting and on the effects of the improved E-values on our HMM-based fold-recognition methods. The theoretical distribution provides some improvement in fitting the tail and in providing fewer false positives in the fold-recognition test. An ad hoc distribution based on assuming a stretched exponential tail does an even better job. The use of Student's t to model the distribution fits well in the middle of the distribution, but provides too heavy a tail. The moment-matching methods fit the tails better than maximum-likelihood methods. AVAILABILITY: Information on obtaining the SAM program suite (free for academic use), as well as a server interface, is available at http://www.soe.ucsc.edu/research/compbio/sam.html and the open-source random sequence generator with varying compositional biases is available at http://www.soe.ucsc.edu/research/compbio/gen_sequence

Entities:  

Mesh:

Year:  2005        PMID: 16123115     DOI: 10.1093/bioinformatics/bti629

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  15 in total

1.  Boosting Protein Threading Accuracy.

Authors:  Jian Peng; Jinbo Xu
Journal:  Res Comput Mol Biol       Date:  2009

2.  Predicting conserved protein motifs with Sub-HMMs.

Authors:  Kevin Horan; Christian R Shelton; Thomas Girke
Journal:  BMC Bioinformatics       Date:  2010-04-26       Impact factor: 3.169

3.  Building and assessing atomic models of proteins from structural templates: learning and benchmarks.

Authors:  Brinda Kizhakke Vallat; Jaroslaw Pillardy; Peter Májek; Jaroslaw Meller; Thomas Blom; Baoqiang Cao; Ron Elber
Journal:  Proteins       Date:  2009-09

4.  A template-finding algorithm and a comprehensive benchmark for homology modeling of proteins.

Authors:  Brinda Kizhakke Vallat; Jaroslaw Pillardy; Ron Elber
Journal:  Proteins       Date:  2008-08-15

5.  Decreasing the number of false positives in sequence classification.

Authors:  Ariane Machado-Lima; André Yoshiaki Kashiwabara; Alan Mitchell Durham
Journal:  BMC Genomics       Date:  2010-12-22       Impact factor: 3.969

6.  Profile Comparer: a program for scoring and aligning profile hidden Markov models.

Authors:  Martin Madera
Journal:  Bioinformatics       Date:  2008-10-09       Impact factor: 6.937

7.  Genomic scale sub-family assignment of protein domains.

Authors:  Julian Gough
Journal:  Nucleic Acids Res       Date:  2006-07-28       Impact factor: 16.971

8.  HARMONY: a server for the assessment of protein structures.

Authors:  G Pugalenthi; K Shameer; N Srinivasan; R Sowdhamini
Journal:  Nucleic Acids Res       Date:  2006-07-01       Impact factor: 16.971

9.  Error statistics of hidden Markov model and hidden Boltzmann model results.

Authors:  Lee A Newberg
Journal:  BMC Bioinformatics       Date:  2009-07-09       Impact factor: 3.307

10.  The effectiveness of position- and composition-specific gap costs for protein similarity searches.

Authors:  Aleksandar Stojmirović; E Michael Gertz; Stephen F Altschul; Yi-Kuo Yu
Journal:  Bioinformatics       Date:  2008-07-01       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.