| Literature DB >> 18586708 |
Aleksandar Stojmirović1, E Michael Gertz, Stephen F Altschul, Yi-Kuo Yu.
Abstract
MOTIVATION: The flexibility in gap cost enjoyed by hidden Markov models (HMMs) is expected to afford them better retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect of more general gap parameters by separately examining the influence of position- and composition-specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of multiple alignments.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18586708 PMCID: PMC2718649 DOI: 10.1093/bioinformatics/btn171
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An example of a protein profile HMM architecture used by HMMER. The model contains n positions plus a begin state (B) and end state (E). Each position contains a substitution (S) and a deletion state (D), with a possible insertion state (I) between two S-nodes. Allowed transitions are shown by arrows. To simulate local alignments, transitions B→S and S→E, for any S, are permitted.
Nomenclature of search strategies
| Name | Description |
|---|---|
| HO | Original HMM dataset |
| HB | HMMs, background insertion emission probabilities |
| HG | HMMs, constant state transitions and background insertion emissions |
| PO | PSSMs, converted from original HMMs. |
| PC | PSSMs, from five PSI-BLAST iterations over |
| PS | PSSMs, from five PSI-BLAST iterations over |
As shown in this table, the first two letters of the abbreviations of various search strategies denote the type of profile (HMM or PSSM), and the method of construction. The third letter is optionally appended to show the database of origin (ℱ for Pfam, U for SUPERFAMILY).
Fig. 2.ROC score statistics of 1 million samples. In each sample, 224 superfamilies are first randomly chosen from 299 superfamilies. A representative query profile is then randomly selected from each chosen superfamily. ROC score histograms from using Pfam HMMs (a) and SUPERFAMILY HMMs (b) show appreciable difference in average ROC scores for each search method tested: SUPERFAMILY HMMs always perform better. Note that in panels (a) and (b), the curve for HO is completely covered by that for HB. Using HOF and HOU as baselines, the values of RRSD224 (measurement at 1 EPQ) between various methods and the baselines are computed for each sample. The resulting histograms are shown in panels (c) and (d).
Summary of statistics of RRSD224 between every pair search strategies using the same source
In Figure 2c and d, HOF and HOU were used as the baselines for Pfam and SUPERFAMILY search strategies, respectively, and the histograms of RRSD224 relative to the baselines are shown. It is impractical to show such histograms for all possible baselines. However, for each pair of search strategies, we may sort (in ascending order) their 1 million values of RRSD224 and record the corresponding RRSD224 value at various designated percentiles. In the table, there are three numbers in a row for any given pair of search strategies. As an example, the numbers 2.9, 4.5 and 6.3, associated with M1=HBF and M2=HGF, are located in the row labeled by HBF and within the column headed by HGF. Those numbers, when divided by 100, have the following interpretation: the leftmost corresponds to the RRSD224 value at the 2.5th percentile, the middle to the median and the rightmost to the 97.5th percentile. Panel A records the numbers associated with Pfam search methods, while Panel B documents those associated with the SUPERFAMILY strategies tested.
Fig. 3.Example CVE curves for various search strategies based on Pfam (a) and SUPERFAMILY (b) profiles. Each curve shown is a representative that corresponds to a sample with ROC224 score equal to the median of 1 000 000 samples.