| Literature DB >> 20435683 |
Michael J Palumbo1, Lee A Newberg.
Abstract
The transcription of a gene from its DNA template into an mRNA molecule is the first, and most heavily regulated, step in gene expression. Especially in bacteria, regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a short, approximately conserved sequence within a gene's promoter region and, by binding to it, either enhances or represses expression of the nearby gene. Since the sought-for motif (pattern) is short and accommodating to variation, computational approaches that scan for binding sites have trouble distinguishing functional sites from look-alikes. Many computational approaches are unable to find the majority of experimentally verified binding sites without also finding many false positives. Phyloscan overcomes this difficulty by exploiting two key features of functional binding sites: (i) these sites are typically more conserved evolutionarily than are non-functional DNA sequences; and (ii) these sites often occur two or more times in the promoter region of a regulated gene. The website is free and open to all users, and there is no login requirement. Address: (http://bayesweb.wadsworth.org/phyloscan/).Entities:
Mesh:
Substances:
Year: 2010 PMID: 20435683 PMCID: PMC2896078 DOI: 10.1093/nar/gkq330
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Figure 1.Shown are receiver operating characteristic (ROC) curves for Phyloscan as applied to promoter regions containing a pair of full-strength Escherichia coli Crp binding sites, a pair of 1/2-strength sites, and a pair of 1/3-strength sites. The simulated sequence data are for 14 prokaryotic species organized into four clades; the orthologous promoter regions are each 500-nt long and are multiply aligned within each clade, but not between clades. ROC curves are shown for fully enabled Phyloscan, as well as for Phyloscan without the advantage of its multi-clade functionality and for Phyloscan without the advantage of both its multi-clade and its multi-site functionality. (Phyloscan with its multi-clade functionality but without its multi-site functionality is not displayed, because it is nearly indistinguishable from the fully enabled Phyloscan.) A comparison of the ‘1 clade/2 sites’ curves to the ‘1 clade/1 site’ curves shows that there is value in combining evidence from multiple sites within a promoter region, using the Neuwald–Green calculation (23). A comparison of the ‘4 clades’ curves to the ‘1 clade/2 sites’ curves indicates that there is additional value in considering data from multiple clades, using the Bailey–Gribskov calculation (24). For instance, if -value cutoffs are chosen so that the false-positive rate (type I error) is % (i.e. the specificity is %), then Phyloscan correctly classifies % of the full-strength-Crp promoter regions, % of the 1/2-strength regions and % of the 1/3-strength regions. The corresponding numbers for ‘1 clade/2 sites’ are %, % and %. The corresponding numbers for ‘1 clade/1 site’ are %, % and %. See the Phyloscan algorithmics paper (21) for further details.
Figure 2.A run with the example data set provided by our web server, for identifying Escherichia coli binding sites for Crp, gives the ‘mtlA’ gene family as the best result. The combined -value for this gene family, 3.544×10−16, indicates that the user who takes all results of this quality or better (in this case, just the one result) will, ‘on average,' find that <10−15of the results are false discoveries. The combined p-value, 8.643×10−18, indicates that if the user had looked at only the mtlA gene family, and believed the family to be non-functional for Crp binding, then the chance that it would accidentally look this functional for Crp binding is <10−17. The combined p-value is computed from the promoter p-values via the technique of Bailey and Gribskov (24). The promoter -values, 7.116×10−13, 5.337×10−7 and 2.331×10−2, arise from the scans of the three user-supplied alignment blocks for mtlA: (i) E. coli aligned to Salmonella enterica serovar Typhi (S. typhi), (ii) Yersinia pestis and (iii) Vibrio cholerae, respectively. These promoter p-values are constructed from the best two, the best two and the best one sites found, respectively, using the technique of Neuwald and Green (23). The best two sites in the E. coli–S. typhi aligned sequence data have -values of and ; the user can display them in context in, e.g. the E. coli sequence, by clicking on the position numbers 170 and 53. The field names in yellow are links to help for these fields.