Literature DB >> 18845584

Profile Comparer: a program for scoring and aligning profile hidden Markov models.

Martin Madera1.   

Abstract

UNLABELLED: Profile Comparer (PRC) is a stand-alone program for scoring and aligning profile hidden Markov models (HMMs) of protein families. PRC can read models produced by SAM and HMMER, two popular profile HMM packages, as well as PSI-BLAST checkpoint files. This application note provides a brief description of the profile-profile algorithm used by PRC. AVAILABILITY: The C source code licensed under the GNU General Public Licence and Linux and Mac OS X binaries can be downloaded from http://supfam.org/PRC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2008        PMID: 18845584      PMCID: PMC2579712          DOI: 10.1093/bioinformatics/btn504

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Profile Comparer (PRC) is a program for scoring and aligning a profile hidden Markov model (HMM) of a protein family against other profile HMMs. Profiles are tables that give a score for a particular amino acid to be found at a particular position in an alignment of a protein family. The best known profile method is probably PSI-BLAST (Altschul et al., 1997). Profile HMMs are similar to profiles, but replace scores with probabilities, and introduce additional probabilities for insertions and deletions at each position in the profile (Durbin et al., 1998; Eddy, 1998). All probabilities are placed within a single statistical framework, an HMM. In this note, we shall count profile HMMs among profile methods. It is now well established that profile–profile methods detect more distant homologies than profile–sequence methods, which in turn are more powerful than sequence–sequence methods (see e.g. Sadreyev and Grishin, 2008; Soding, 2005). Profile–profile methods also generate the most accurate alignments; in fact, profile–profile methods were first used in progressive multiple sequence alignment and only later for homology recognition. Out of profile–sequence methods, the SAM and HMMER profile HMM programs (Eddy, 1998; Hughey and Krogh, 1996) are believed to be the best (Fig. 1). In addition to insertion and deletion probabilities that vary along the profile, the improvement over, e.g. PSI-BLAST comes from a number of other innovations, including use of the forward algorithm instead of Viterbi (Durbin et al., 1998) and a better algorithm for estimating a profile from a given alignment.
Fig. 1.

A SCOP domain benchmark (Madera and Gough, 2002) of PRC, illustrating the improvement over standard methods. The SCOP seed sequences were filtered to <25% sequence identity. PRC and SAM (Hughey and Krogh, 1996) used SUPERFAMILY profile HMMs (Gough et al., 2001). PSI-BLAST (Altschul et al., 1997) checkpoint files used in the benchmark were derived from SUPERFAMILY profile HMMs and use identical probabilities for the profile part. For a comparison of PRC to competing profile–profile methods, the reader is referred to Soding (2006) and Sadreyev and Grishin (2008).

A SCOP domain benchmark (Madera and Gough, 2002) of PRC, illustrating the improvement over standard methods. The SCOP seed sequences were filtered to <25% sequence identity. PRC and SAM (Hughey and Krogh, 1996) used SUPERFAMILY profile HMMs (Gough et al., 2001). PSI-BLAST (Altschul et al., 1997) checkpoint files used in the benchmark were derived from SUPERFAMILY profile HMMs and use identical probabilities for the profile part. For a comparison of PRC to competing profile–profile methods, the reader is referred to Soding (2006) and Sadreyev and Grishin (2008). The goal of PRC is to apply lessons learned from development of SAM and HMMER to the profile–profile case. PRC was first publicly released in 2002 and has been used by Pfam since 2005. Recently PRC has performed well in benchmarks (Sadreyev and Grishin, 2008; Soding, 2006) carried out by the authors of the two main alternative profile–profile methods, COMPASS (Sadreyev and Grishin, 2008) and HHsearch (Soding, 2005). Here, we provide an overview of the PRC algorithm (version 1.5.5) and explain how to use the program.

2 THE PRC ALGORITHM

When scoring a profile HMM against a library of profile HMMs, PRC reports E-values, which give an estimate of how significant the matches are. In order to calculate E-values, PRC first calculates three other scores: co-emission, simple and reverse. Each score builds upon the previous one, until finally reverse scores are converted into E-values. The co-emission score S is a generalization of the log-odds score S calculated by SAM and HMMER, to the HMM–HMM case: The sum is over all possible amino acid sequences σ, and the probability P(σ|HMM) that the profile HMM emits a sequence σ is calculated using the forward algorithm (Durbin et al., 1998). When one of the HMMs is extremely ‘narrow’, e.g. it only emits a single sequence τ with a non-zero probability (P(σ|HMM) = 1 if σ = τ, 0 otherwise), the co-emission score tends to the profile HMM log-odds score for τ. The null model emits random sequences with background amino acid frequencies and a geometric distribution of lengths. The simple score S is the same as the co-emission score S, but both profile HMMs are restricted to regions of significant similarity. The regions are found by an iterative procedure that picks a new end point as the maximum of the forward score in the dynamic programming matrix, and a start point as the maximum of the backward score. The reverse score S for two profile HMMs 1 and 2 is defined as where the reverse HMM is defined as follows: Here, rev is a reverse operator that maps residue or model segment i (1 ≤i≤L) onto residue L−i+1. This is a generalization of the reverse sequence null model used by SAM (Karplus et al., 2005). Finally, for library runs the reverse score S is turned into an E-value by fitting the following function to the observed distribution of reverse scores: The E-value E is the expected number of random matches with a reverse score better than x, and n is the number of profile HMMs in the library that are unrelated to the query. The formula is a slight generalization of the function used by SAM (Karplus et al., 2005). Optimal values of the two parameters λ and κ for each run are found using a censored Maximum Likelihood fitting procedure. HMM–HMM alignments are computed by finding the Viterbi path that maximizes the sum of forward–backward odds scores (Durbin et al., 1998).

3 USING PRC

PRC can read SAM3 (ASCII and binary) and HMMER2 model files, and PSI-BLAST checkpoint files. The same internal profile HMM is used for scoring all three. For PSI-BLAST checkpoint files, the profile part is taken from the checkpoint file and the insertion and deletion probabilities are set to default values, constant throughout the model. For best performance, users should build a full profile HMM using the SAM w0.5 script. For accurate E-values, the library should contain at least 1000 profile HMMs. For libraries of sufficient size, E <0.003 can be taken as indicative of homology and E < 10−5 as a strong match. When a large library is not available, Equation (5) with λ = 0.8, κ = 0 can be used as a conservative guide. Starting with version 1.5.5, the PRC source code also includes a simple Perl script, merge_aligns.pl. Given two HMM–sequence alignments in the SAM a2m format, and a PRC alignment between the two HMMs, the script will output a pairwise alignment between the two sequences. Users who would like to visualize their HMM–HMM alignments are referred to the pairwise HMM logos server (Schuster-Bockler and Bateman, 2005).

Funding

M.M.'s Internal Graduate Studentship from Trinity College, Cambridge, UK; the UK Medical Research Council and the Laboratory of Molecular Biology, Cambridge, UK (Cyrus Chothia's group); the European Bioinformatics Institute (Nick Goldman's group); Kevin Karplus's National Institutes of Health oupReleaseDelayRemoved from OA Article (12|0) grant R01 GM068570; Julian Gough's European Union Framework Programme 7 IMPACT grant. Conflict of Interest: none declared.
  9 in total

1.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.

Authors:  J Gough; K Karplus; R Hughey; C Chothia
Journal:  J Mol Biol       Date:  2001-11-02       Impact factor: 5.469

2.  A comparison of profile hidden Markov model procedures for remote homology detection.

Authors:  Martin Madera; Julian Gough
Journal:  Nucleic Acids Res       Date:  2002-10-01       Impact factor: 16.971

3.  Protein homology detection by HMM-HMM comparison.

Authors:  Johannes Söding
Journal:  Bioinformatics       Date:  2004-11-05       Impact factor: 6.937

4.  Visualizing profile-profile alignment: pairwise HMM logos.

Authors:  Benjamin Schuster-Böckler; Alex Bateman
Journal:  Bioinformatics       Date:  2005-04-12       Impact factor: 6.937

5.  Calibrating E-values for hidden Markov models using reverse-sequence null models.

Authors:  Kevin Karplus; Rachel Karchin; George Shackelford; Richard Hughey
Journal:  Bioinformatics       Date:  2005-08-25       Impact factor: 6.937

Review 6.  Profile hidden Markov models.

Authors:  S R Eddy
Journal:  Bioinformatics       Date:  1998       Impact factor: 6.937

7.  Hidden Markov models for sequence analysis: extension and analysis of the basic method.

Authors:  R Hughey; A Krogh
Journal:  Comput Appl Biosci       Date:  1996-04

Review 8.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

9.  Accurate statistical model of comparison between multiple sequence alignments.

Authors:  Ruslan I Sadreyev; Nick V Grishin
Journal:  Nucleic Acids Res       Date:  2008-02-19       Impact factor: 16.971

  9 in total
  49 in total

1.  The annotation of full zinc proteomes.

Authors:  Ivano Bertini; Leonardo Decaria; Antonio Rosato
Journal:  J Biol Inorg Chem       Date:  2010-05-05       Impact factor: 3.358

2.  Sequence context-specific profiles for homology searching.

Authors:  A Biegert; J Söding
Journal:  Proc Natl Acad Sci U S A       Date:  2009-02-20       Impact factor: 11.205

3.  ModLink+: improving fold recognition by using protein-protein interactions.

Authors:  Oriol Fornes; Ramon Aragues; Jordi Espadaler; Marc A Marti-Renom; Andrej Sali; Baldo Oliva
Journal:  Bioinformatics       Date:  2009-04-08       Impact factor: 6.937

4.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates.

Authors:  Yuedong Yang; Eshel Faraggi; Huiying Zhao; Yaoqi Zhou
Journal:  Bioinformatics       Date:  2011-06-11       Impact factor: 6.937

5.  Predicting the molecular interactions of CRIP1a-cannabinoid 1 receptor with integrated molecular modeling approaches.

Authors:  Mostafa H Ahmed; Glen E Kellogg; Dana E Selley; Martin K Safo; Yan Zhang
Journal:  Bioorg Med Chem Lett       Date:  2014-01-08       Impact factor: 2.823

6.  webPRC: the Profile Comparer for alignment-based searching of public domain databases.

Authors:  Bernd W Brandt; Jaap Heringa
Journal:  Nucleic Acids Res       Date:  2009-05-06       Impact factor: 16.971

7.  The Pfam protein families database.

Authors:  Robert D Finn; Jaina Mistry; John Tate; Penny Coggill; Andreas Heger; Joanne E Pollington; O Luke Gavin; Prasad Gunasekaran; Goran Ceric; Kristoffer Forslund; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman
Journal:  Nucleic Acids Res       Date:  2009-11-17       Impact factor: 16.971

8.  Predicting conserved protein motifs with Sub-HMMs.

Authors:  Kevin Horan; Christian R Shelton; Thomas Girke
Journal:  BMC Bioinformatics       Date:  2010-04-26       Impact factor: 3.169

9.  Hidden Markov Models and their Applications in Biological Sequence Analysis.

Authors:  Byung-Jun Yoon
Journal:  Curr Genomics       Date:  2009-09       Impact factor: 2.236

10.  Hepatitis C virus NS4B carboxy terminal domain is a membrane binding domain.

Authors:  Jolanda M P Liefhebber; Bernd W Brandt; Rene Broer; Willy J M Spaan; Hans C van Leeuwen
Journal:  Virol J       Date:  2009-05-25       Impact factor: 4.099

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.