| Literature DB >> 19911049 |
Eivind Valen1, Albin Sandelin, Ole Winther, Anders Krogh.
Abstract
A major goal in post-genome biology is the complete mapping of the gene regulatory networks for every organism. Identification of regulatory elements is a prerequisite for realizing this ambitious goal. A common problem is finding regulatory patterns in promoters of a group of co-expressed genes, but contemporary methods are challenged by the size and diversity of regulatory regions in higher metazoans. Two key issues are the small amount of information contained in a pattern compared to the large promoter regions and the repetitive characteristics of genomic DNA, which both lead to "pattern drowning". We present a new computational method for identifying transcription factor binding sites in promoters using a discriminatory approach with a large negative set encompassing a significant sample of the promoters from the relevant genome. The sequences are described by a probabilistic model and the most discriminatory motifs are identified by maximizing the probability of the sets given the motif model and prior probabilities of motif occurrences in both sets. Due to the large number of promoters in the negative set, an enhanced suffix array is used to improve speed and performance. Using our method, we demonstrate higher accuracy than the best of contemporary methods, high robustness when extending the length of the input sequences and a strong correlation between our objective function and the correct solution. Using a large background set of real promoters instead of a simplified model leads to higher discriminatory power and markedly reduces the need for repeat masking; a common pre-processing step for other pattern finders.Entities:
Mesh:
Year: 2009 PMID: 19911049 PMCID: PMC2770120 DOI: 10.1371/journal.pcbi.1000562
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Synthetic set evaluation.
The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods.
Figure 2Correlation of MoAn's objective function (Sc) and site sensitivity (sSn).
All 5 runs on the 84 synthetic sets are used.
Figure 3Repeat Assessment.
The average site performance (lines) and the nucleotide correlation coefficient (bars) of MoAn with repeats planted in the two sets.
Figure 4PAZAR set evaluation.
The average site performance (lines) and the nucleotide correlation coefficient (bars) of the methods.
Figure 5Performance of co-occurrence vs. serial runs.
The average site performance (lines) and nucleotide correlation coefficient (bars) of co-occurrence and serial runs on 5 different sets with co-occurring motifs.