Literature DB >> 10869016

Fast probabilistic analysis of sequence function using scoring matrices.

T D Wu1, C G Nevill-Manning, D L Brutlag.   

Abstract

MOTIVATION: We present techniques for increasing the speed of sequence analysis using scoring matrices. Our techniques are based on calculating, for a given scoring matrix, the quantile function, which assigns a probability, or p, value to each segmental score. Our techniques also permit the user to specify a p threshold to indicate the desired trade-off between sensitivity and speed for a particular sequence analysis. The resulting increase in speed should allow scoring matrices to be used more widely in large-scale sequencing and annotation projects.
RESULTS: We develop three techniques for increasing the speed of sequence analysis: probability filtering, lookahead scoring, and permuted lookahead scoring. In probability filtering, we compute the score threshold that corresponds to the user-specified p threshold. We use the score threshold to limit the number of segments that are retained in the search process. In lookahead scoring, we test intermediate scores to determine whether they will possibly exceed the score threshold. In permuted lookahead scoring, we score each segment in a particular order designed to maximize the likelihood of early termination. Our two lookahead scoring techniques reduce substantially the number of residues that must be examined. The fraction of residues examined ranges from 62 to 6%, depending on the p threshold chosen by the user. These techniques permit sequence analysis with scoring matrices at speeds that are several times faster than existing programs. On a database of 12 177 alignment blocks, our techniques permit sequence analysis at a speed of 225 residues/s for a p threshold of 10-6, and 541 residues/s for a p threshold of 10-20. In order to compute the quantile function, we may use either an independence assumption or a Markov assumption. We measure the effect of first- and second-order Markov assumptions and find that they tend to raise the p value of segments, when compared with the independence assumption, by average ratios of 1.30 and 1.69, respectively. We also compare our technique with the empirical 99. 5th percentile scores compiled in the BLOCKSPLUS database, and find that they correspond on average to a p value of 1.5 x 10-5. AVAILABILITY: The techniques described above are implemented in a software package called EMATRIX. This package is available from the authors for free academic use or for licensed commercial use. The EMATRIX set of programs is also available on the Internet at http://motif.stanford.edu/ematrix.

Mesh:

Year:  2000        PMID: 10869016     DOI: 10.1093/bioinformatics/16.3.233

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  19 in total

1.  The EMOTIF database.

Authors:  J Y Huang; D L Brutlag
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  DIAN: a novel algorithm for genome ontological classification.

Authors:  Y Pouliot; J Gao; Q J Su; G G Liu; X B Ling
Journal:  Genome Res       Date:  2001-10       Impact factor: 9.043

3.  3MATRIX and 3MOTIF: a protein structure visualization system for conserved sequence motifs.

Authors:  Steven P Bennett; Lin Lu; Douglas L Brutlag
Journal:  Nucleic Acids Res       Date:  2003-07-01       Impact factor: 16.971

4.  SMOTIF: efficient structured pattern and profile motif search.

Authors:  Yongqiang Zhang; Mohammed J Zaki
Journal:  Algorithms Mol Biol       Date:  2006-11-21       Impact factor: 1.405

5.  LigProf: a simple tool for in silico prediction of ligand-binding sites.

Authors:  Grzegorz Koczyk; Lucjan S Wyrwicz; Leszek Rychlewski
Journal:  J Mol Model       Date:  2007-01-03       Impact factor: 1.810

6.  A probabilistic method for small RNA flowgram matching.

Authors:  Vladimir Vacic; Hailing Jin; Jian-Kang Zhu; Stefano Lonardi
Journal:  Pac Symp Biocomput       Date:  2008

7.  Identification and characterization of a pSLA2 plasmid locus required for linear DNA replication and circular plasmid stable inheritance in Streptomyces lividans.

Authors:  Zhongjun Qin; Meijuan Shen; Stanley N Cohen
Journal:  J Bacteriol       Date:  2003-11       Impact factor: 3.490

8.  MOODS: fast search for position weight matrix matches in DNA sequences.

Authors:  Janne Korhonen; Petri Martinmäki; Cinzia Pizzi; Pasi Rastas; Esko Ukkonen
Journal:  Bioinformatics       Date:  2009-09-22       Impact factor: 6.937

9.  The distribution of GYR- and YLP-like motifs in Drosophila suggests a general role in cuticle assembly and other protein-protein interactions.

Authors:  R Scott Cornman
Journal:  PLoS One       Date:  2010-09-02       Impact factor: 3.240

10.  Significant speedup of database searches with HMMs by search space reduction with PSSM family models.

Authors:  Michael Beckstette; Robert Homann; Robert Giegerich; Stefan Kurtz
Journal:  Bioinformatics       Date:  2009-10-14       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.