| Literature DB >> 18237374 |
Fabrice Touzain1, Sophie Schbath, Isabelle Debled-Rennesson, Bertrand Aigle, Gregory Kucherov, Pierre Leblond.
Abstract
BACKGROUND: Many programs have been developed to identify transcription factor binding sites. However, most of them are not able to infer two-word motifs with variable spacer lengths. This case is encountered for RNA polymerase Sigma (sigma) Factor Binding Sites (SFBSs) usually composed of two boxes, called -35 and -10 in reference to the transcription initiation point. Our goal is to design an algorithm detecting SFBS by using combinational and statistical constraints deduced from biological observations.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18237374 PMCID: PMC2375139 DOI: 10.1186/1471-2105-9-73
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Merging of upstream overlapping sequences on a same strand. The final statistical test of motifs needs to count the number of occurrences in the upstream sequences. If genes overlap each other, their upstream sequences could overlap each other. We avoid to count twice the same motif occurrence by merging upstream overlapped sequences which are on a same strand.
Figure 2Conservation of interesting words in promoter regions of orthologues. We search for pairs of conserved significantly over-represented words with approximately the same spacer in the two promoter regions: sp2 - sp1 = δ, δ ∈ {-1, 0, 1}.
Figure 3Grouping of pairs of interesting words found in promoter regions according to pairs of hits. From the conservation of pairs of words in the two bacteria (on the left of the Figure), we deduce the sets of sequences SS1 and SS2 – one for each bacterium – sharing a given pair of patterns.
Figure 4Extension of shared trinucleotides, classifying of related promoter regions. The set SS1 corresponds to n promoter regions of a given bacterium sharing a pair of given trinucleotides t1 and t2. We compute the probabilities to obtain the encountered letters at the positions neighbouring t1 and t2, considering our n sequences. We retain the position associated with the letter which has the lowest probability to be obtained as soon as observed in this set of n sequences. We group sequences according to the letters at this position which have a low probability to be obtained (with at least eight related sequences). They constitute new sets of sequences to be evaluated with LRT statistical test (see Section "Computing a consensus motif and its statistical evaluation"). "INTERESTING SETS" means sets of promoter regions whose shared motif is over-represented in merged usptream sequences.
Summary of found motifs similar to known SigR SFBSs
| SIGffRid motif | % | |||||
| in | ||||||
| 0.49 | 54.69 | 79 | 0.49 | 32 | ||
| 0.48 | 42.97 | 58 | 0.48 | 12 | ||
| in | ||||||
| 0.51 | 30.98 | 38 | 0.51 | |||
| 0.60 | 30.59 | 31 | 0.60 | |||
| 0.44 | 25.36 | 40 | 0.45 | |||
(1) Nis the number of occurrences found in merged sequences
(2) %is the proportion of occurrences found in merged sequences (%= N/N, where Nis the number of occurrences found in the whole genome on direct and reverse strand)
(3) Nis the number of occurrences in merged sequences related to a gene over-expressed in microarray data experiments under oxidative stress conditions, from Paget, personal communication