| Literature DB >> 19906734 |
Guojun Li1, Bingqiang Liu, Ying Xu.
Abstract
We present a new computational method for solving a classical problem, the identification problem of cis-regulatory motifs in a given set of promoter sequences, based on one key new idea. Instead of scoring candidate motifs individually like in all the existing motif-finding programs, our method scores groups of candidate motifs with similar sequences, called motif closures, using a P-value, which has substantially improved the prediction reliability over the existing methods. Our new P-value scoring scheme is sequence length independent, hence allowing direct comparisons among predicted motifs with different lengths on the same footing. We have implemented this method as a Motif Recognition Computer (MREC) program, and have extensively tested MREC on both simulated and biological data from prokaryotic genomes. Our test results indicate that MREC can accurately pick out the actual motif with the correct length as the best scoring candidate for the vast majority of the cases in our test set. We compared our prediction results with two motif-finding programs Cosmo and MEME, and found that MREC outperforms both programs across all the test cases by a large margin. The MREC program is available at http://csbl.bmb.uga.edu/~bingqiang/MREC1/.Entities:
Mesh:
Year: 2009 PMID: 19906734 PMCID: PMC2811016 DOI: 10.1093/nar/gkp907
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Performance of MREC, MEME and Cosmo on simulated data generated using nine point–mutation rates. (A) The dataset with different mutation rates. (B) The dataset with different motif lengths.
Motif prediction by MREC on E. coli dataset
aThe numbers in parentheses (L, N) denote the length and the binding sites number of a motif, respectively.
bThe segments with underline are the overlap parts with the real motifs.
Prediction by Cosmo and MEME on the E. coli dataset
| Transcription factor | Known motifs (L and N) | MEME result (L, N) | |
|---|---|---|---|
| ArgR | |||
| CpxR | |||
| CRP | |||
| DnaA | |||
| Fnr | |||
| FruR | |||
| Fur | |||
| GntR | |||
| LexA | |||
| MetJ | |||
| NarP | |||
| NtrC | |||
| PhoB | |||
| PurR | |||
| TrpR | |||
| TyrR | |||
| Total | Number of binding sites: 831 | Detected binding sites: 281 | Detected binding sites: 331 |
aThe number in parenthesis (L, N) denotes the length and binding sites number of a motif, respectively.
bThe segments with underline is the overlap part with real motifs.
cThe actual motif was completely missed.
dBayesian Information Criterion used by Cosmo for identifying the motif length.
Figure 2.Comparison between the P-value by MREC, csFFT and CONSENSUS. Here we take examples of the ArgR and DnaA datasets in E. coli. The pink dash lines correspond to the correct motif length.