Literature DB >> 9109037

Score distributions for simultaneous matching to multiple motifs.

T L Bailey1, M Gribskov.   

Abstract

Several computer algorithms now exist for discovering multiple motifs (expressed as weight matrices) that characterize a family of protein sequences known to be homologous. This paper describes a method for performing similarity searches of protein sequence databases using such a group of motifs. By simultaneously using all the motifs that characterize a protein family, the sensitivity and specificity of the database search are increased. We define the p-value for a target sequence to be the probability of a random sequence of the same length scoring as well or better in comparison to all the motifs that characterize the family. (The p-value of a database search can be determined from this value and the size of the database.) We show that estimating the distribution of single motif scores by a Gaussian extreme value distribution is insufficiently accurate to provide a useful estimate of the p-value, but that this deficiency can be corrected by reestimating the parameters of the underlying Gaussian distribution from observed scores for comparison of a given motif and sequence database. These parameters are used to calculate a "reduced variate" which has a Gumbel limiting distribution. Multiple motif scores are combined to give a single p-value by using the sum of the reduced variates for the motif scores as the test statistic. We give a computationally efficient approximation to the distribution of the sum of independent Gumbel random variables and verify experimentally that it closely approximates the distribution of the test statistic. Experiments on pseudorandom sequences show that the approximated p-values are conservative, so the significance of high scores in database searches will not be overstated. Experiments with real protein sequences and motifs identified by the MEME algorithm show that determining an overall p-value based on the combination of multiple motifs gives significantly better database search results than using p-values of single motifs.

Entities:  

Mesh:

Substances:

Year:  1997        PMID: 9109037     DOI: 10.1089/cmb.1997.4.45

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  19 in total

1.  Increased coverage of protein families with the blocks database servers.

Authors:  J G Henikoff; E A Greene; S Pietrokovski; S Henikoff
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  FAN: fingerprint analysis of nucleotide sequences.

Authors:  Neil Maudling; Teresa K Attwood
Journal:  Nucleic Acids Res       Date:  2004-07-01       Impact factor: 16.971

Review 3.  Experimental strategies for studying transcription factor-DNA binding specificities.

Authors:  Marcel Geertz; Sebastian J Maerkl
Journal:  Brief Funct Genomics       Date:  2010-09-23       Impact factor: 4.241

4.  Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments.

Authors:  Hongyi Zhou; Yaoqi Zhou
Journal:  Proteins       Date:  2005-02-01

5.  A computational genomics approach to identify cis-regulatory modules from chromatin immunoprecipitation microarray data--a case study using E2F1.

Authors:  Victor X Jin; Alina Rabinovich; Sharon L Squazzo; Roland Green; Peggy J Farnham
Journal:  Genome Res       Date:  2006-10-19       Impact factor: 9.043

6.  Identification of genes directly regulated by the oncogene ZNF217 using chromatin immunoprecipitation (ChIP)-chip assays.

Authors:  Sheryl R Krig; Victor X Jin; Mark C Bieda; Henriette O'Geen; Paul Yaswen; Roland Green; Peggy J Farnham
Journal:  J Biol Chem       Date:  2007-01-26       Impact factor: 5.157

7.  A comprehensive ChIP-chip analysis of E2F1, E2F4, and E2F6 in normal and tumor cells reveals interchangeable roles of E2F family members.

Authors:  Xiaoqin Xu; Mark Bieda; Victor X Jin; Alina Rabinovich; Mathew J Oberley; Roland Green; Peggy J Farnham
Journal:  Genome Res       Date:  2007-10-01       Impact factor: 9.043

8.  Identification of an OCT4 and SRY regulatory module using integrated computational and experimental genomics approaches.

Authors:  Victor X Jin; Henriette O'Geen; Sushma Iyengar; Roland Green; Peggy J Farnham
Journal:  Genome Res       Date:  2007-06       Impact factor: 9.043

9.  Superior performance in protein homology detection with the Blocks Database servers.

Authors:  S Henikoff; S Pietrokovski; J G Henikoff
Journal:  Nucleic Acids Res       Date:  1998-01-01       Impact factor: 16.971

10.  RXLR effector reservoir in two Phytophthora species is dominated by a single rapidly evolving superfamily with more than 700 members.

Authors:  Rays H Y Jiang; Sucheta Tripathy; Francine Govers; Brett M Tyler
Journal:  Proc Natl Acad Sci U S A       Date:  2008-03-14       Impact factor: 11.205

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.