| Literature DB >> 17244358 |
C Steven Carmack1, Lee Ann McCue, Lee A Newberg, Charles E Lawrence.
Abstract
BACKGROUND: When transcription factor binding sites are known for a particular transcription factor, it is possible to construct a motif model that can be used to scan sequences for additional sites. However, few statistically significant sites are revealed when a transcription factor binding site motif model is used to scan a genome-scale database.Entities:
Year: 2007 PMID: 17244358 PMCID: PMC1794230 DOI: 10.1186/1748-7188-2-1
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Crp Binding Site Motif and Generation of Weaker Versions. The logo in panel A indicates the Crp motif used to scan for Crp binding sites. It is also used to generate a pair of full-strength Crp sites in the synthetic sequence data. The binding site equilibria were calculated from sequence data aligned by the Gibbs Recursive Sampler [49], and were plotted using publicly available software [27]. The logo in panel B indicates the motif used to generate 1/2-strength Crp sites. It was generated by raising each probability of a nucleotide to its 0.637th power, with subsequent scaling so that the probabilities of the four nucleotides for any motif column sum to 1.0. The exponent was chosen so that the average information content (i.e., "bits") would be half that value for the full-strength sites. The logo in panel C is the 1/3-strength Crp motif, generated with an exponent of 0.507 so that average information content would be one-third of the full-strength value.
Figure 2Phylogenetic Tree of Fourteen Prokaryotes. This tree of fourteen prokaryotes specifies the phylogenetic relationship of the species in our simulated sequence data. The tree is realistic, but approximate. The branch lengths represent the number of substitutions (including subsequent substitutions at a given sequence position) expected for each 10,000 nucleotides not subject to selection pressures.
Figure 3ROC Curves for PhyloScan and MONKEY. Shown are Receiver Operating Characteristic (ROC) curves for algorithms applied to intergenic regions containing a pair of full-strength Crp sites, a pair of 1/2-strength sites, and a pair of 1/3-strength sites. The simulated sequence data is for fourteen prokaryotic species organized into four clades; the orthologous intergenic sequences are 500 bp and are multiply-aligned within each clade but not between clades. ROC curves are shown for fully enabled PhyloScan and MONKEY. Additionally, ROC curves for PhyloScan applied to only the Enterobacteriales clade are shown. The ROC curves for PhyloScan with its multiple-clades capability enabled but its multiple-sites capability disabled are not shown because they are nearly indistinguishable from the fully enabled PhyloScan. A comparison of the "PhyloScan (1 clade)" curves to the "MONKEY (1 clade)" curves shows that there is value in combining evidence from multiple sites within an intergenic region using the Neuwald-Green calculation. A comparison of the "PhyloScan (4 clades)" curves to the "PhyloScan (1 clade)" curves indicates that there is additional value in considering data from multiple clades. For instance, if p-value cutoffs are chosen so that type I error is 0.1% (i.e., the specificity is 99.9%) then PhyloScan correctly classifies 99.85% of the full-strength-Crp intergenic regions, 72.68% of the 1/2-strength regions, and 32.64% of the 1/3-strength regions. The corresponding numbers for "PhyloScan (1 clade)" are 96.98%, 33.01%, and 10.11%. The corresponding numbers for MONKEY are 79.02%, 21.66%, and 6.33%. It is possible that sensitivities for the four-clades curves would have been even stronger if we had not prohibited the non-Enterobacteriales clades from rescuing intergenic regions in the Enterobacteriales clade that had failed to pass our 0.05 p-value cutoff.
Summary of PhyloScan Predictions
| C1 | C2 | C3 | C4 | C5 | C6 | |
| Full | Full | Red. | Red. | Red. & Aligned | Red. & Aligned | |
| Indep. Species | No | Yes | No | Yes | No | Yes |
| Crp Known | 1(2) | 7(10) | 1(2) | 8(12) | 4(6) | 11(16) |
| Crp Novel | 0(0) | 16(20) | 0(0) | 16(18) | 6(7) | 18(21) |
| PurR Known | 1(1) | 9(9) | 1(1) | 11(11) | 9(9) | 12(12) |
| PurR Novel | 0(0) | 4(5) | 0(0) | 4(5) | 3(4) | 6(7) |
This table shows the number of E. coli intergenic regions predicted by PhyloScan to contain Crp or PurR binding sites, with the total number of sites predicted within parentheses. Column C1 is for a scan of the full set of E. coli intergenic sequence data (excluding the S. typhi sequence data and the sequence data from the other, independent clades). Column C3 is for a scan of only that E. coli sequence that is alignable with S. typhi; the S. typhi sequence data continue to be excluded. Column C5 is for a scan of the aligned E. coli-S. typhi sequence data. Columns C2, C4, and C6, are like Columns C1, C3, and C5, respectively, but the sequence data from the independent clades are also incorporated. Observing the lack of improvement of Column C3 over Column C1 (or the meager improvement of C4 over C2), we conclude that there is minimal gain in sensitivity from considering only E. coli sequence that is alignable with S. typhi, when not actually using the aligned S. typhi sequence data. Observing the modest improvement of C5 over C3 (or C6 over C4), we conclude that incorporating the aligned S. typhi sequence gives a moderate gain in sensitivity. Observing the large improvement of C2 over C1 (or C4 over C3, or C6 over C5), we conclude that incorporating the data from species that are not alignable with E. coli gives a significant gain in sensitivity. Notes: Database of 2379 intergenic sequences from E. coli [see Additional file 2]. Database of E. coli sequences (reduced search space) extracted from the E. coli-S. typhi database (see Real Sequence Data in Results). Database of E. coli-S. typhi aligned intergenic sequences (see Real Sequence Data in Results). The number of E. coli intergenic regions predicted by PhyloScan to contain Crp or PurR binding sites, where the total number of binding sites detected is in parentheses and those sites that correspond to known, experimentally verified transcription factor binding sites and those sites that are novel (not yet verified) are indicated.
Figure 4Crp-Significant Intergenic Regions Found. When counting Crp-significant intergenic regions, comparison of the bars labeled "+" (with the unalignable sequences) relative to those labeled "-" (without the unalignable sequences) indicates that the largest gain in sensitivity comes from the use of unalignable, evolutionarily distant sequences. The left part of this figure shows the sensitivity for the scan of E. coli data only. The center part of this figure shows the sensitivity from the scan of only those E. coli sequence data that are alignable with S. typhi. The right part of this figure shows the sensitivity from the scan of E. coli-S. typhi aligned sequence data.
Figure 5PurR-Significant Intergenic Regions Found. The results for PurR are similar to those for Crp. See the caption of Figure 4.
Top 20 Predictions by PhyloScan
| C1 | C2 | C3 | C4 | C5 | C6 | |||||||
| Full | Full | Reduced | Reduced | Reduced & Aligned | Reduced & Aligned | |||||||
| Indep. Species | No | Yes | No | Yes | No | Yes | ||||||
| Rank | Gene | log( | Gene | log( | Gene | log( | Gene | log( | Gene | log( | Gene | log( |
| 1 | yibI | -4.65 | cdd | -9.28 | mtlA | -5.14 | mtlA | -9.76 | mtlA | -7.66 | mtlA | -12.15 |
| 2 | yqcE | -2.86 | glpT | -7.21 | ygcW | -2.89 | cdd | -9.60 | yjcB | -4.55 | glpA | -9.19 |
| 3 | b1904 | -2.61 | mglB | -6.01 | yjcB | -2.62 | glpA | -8.31 | gcd | -3.99 | cdd | -9.16 |
| 4 | fucA | -2.51 | yibI | -5.26 | yjiY | -2.60 | mglB | -6.53 | b2146 | -3.97 | mglB | -7.60 |
| 5 | deaD | -2.51 | yjiY | -4.57 | b2146 | -2.53 | gapA | -5.21 | fucA | -3.93 | udp | -6.26 |
| 6 | yjiY | -2.42 | hemC | -4.38 | fucA | -2.51 | udp | -5.17 | ygcW | -3.42 | gapA | -6.02 |
| 7 | cdd | -2.29 | deaD | -4.35 | deaD | -2.47 | yjiY | -4.79 | flhD | -3.03 | yjcB | -5.09 |
| 8 | yeaA | -2.22 | ysgA | -4.33 | cdd | -2.31 | cyaA | -4.70 | gapA | -3.03 | cyaA | -5.04 |
| 9 | yhcR | -2.06 | yhcR | -3.99 | gapA | -2.22 | deaD | -4.37 | ycdZ | -3.01 | malE | -4.83 |
| 10 | ycdZ | -1.96 | yqcE | -3.56 | qseA | -2.03 | malE | -4.29 | udp | -2.78 | ycdZ | -4.69 |
| 11 | b2736 | -1.87 | adhE | -3.47 | ycdZ | -1.98 | ygcW | -3.63 | b2248 | -2.76 | adhE | -4.56 |
| 12 | uxaC | -1.81 | ycdZ | -3.45 | mglB | -1.90 | adhE | -3.58 | glpA | -2.76 | b2146 | -4.53 |
| 13 | ysgA | -1.77 | yeaA | -3.44 | udp | -1.86 | ycdZ | -3.52 | mglB | -2.73 | fucA | -4.46 |
| 14 | glpT | -1.75 | mlc | -3.37 | uxaC | -1.85 | mlc | -3.48 | qseA | -2.68 | pckA | -4.09 |
| 15 | mglB | -1.63 | b1904 | -3.31 | glpA | -1.84 | fucA | -3.32 | pckA | -2.36 | aer | -3.97 |
| 16 | pckA | -1.39 | fucA | -3.23 | pckA | -1.45 | yjcB | -3.32 | adhE | -2.14 | ygcW | -3.78 |
| 17 | serA | -1.23 | b2736 | -3.18 | malE | -1.36 | pckA | -3.23 | aer | -2.13 | gcd | -3.67 |
| 18 | aer | -1.23 | pckA | -3.17 | aer | -1.32 | aer | -3.17 | cdd | -2.10 | deaD | -3.65 |
| 19 | adhE | -1.22 | aer | -3.08 | serA | -1.32 | qseA | -3.07 | deaD | -2.04 | serA | -3.62 |
| 20 | mlc | -1.01 | yjeG | -3.05 | adhE | -1.28 | uxaC | -3.07 | uxaC | -2.02 | mlc | -3.62 |
| # Diffs from C6 | 10 | 11 | 3 | 3 | 4 | 0 | ||||||
Because it is sometimes instructive to examine a fixed number of top hits regardless of the reported q-values, in this table we compare the six approaches' best 20 intergenic regions for Crp. By comparing each column to Column C6, which is the best approach we employed, we see that the C1-C5 approaches give significantly different q-values for, and orderings of, the predicted regulated genes. As indicated in the bottom row, the C1-C5 approaches miss several of the top-20 genes reported in C6, replacing them with genes that did not make the C6 top-20 list. In particular, although it uses all of the sequence data except S. typhi, C2 is significantly different from C6. Furthermore, although C3 has few differences from C6 in the set of genes indicated, the q-values of C3 are considerably worse and the gene order is substantially rearranged. These data suggest that the ability to simultaneously handle both aligned and unaligned data is important in obtaining accurate predictions. Notes: See the caption notes for Table 1. Also see the Table 1 caption for descriptions of Columns C1-C6.
Figure 6Data Processing Flow Chart for PhyloScan. An overview of the steps taken to locate Crp and PurR transcription factor binding sites in E. coli intergenic regions. The species examined were Escherichia coli (EC), Salmonella enterica serovar Typhi (S. typhi) (ST), Yersinia pestis (YP), Haemophilus influenzae (HI), Vibrio cholerae (VC), Shewanella oneidensis (SO), and Pseudomonas aeruginosa (PA).
Figure 7PurR Binding Site Motif. Shown is the PurR motif used to scan for PurR binding sites. The binding site equilibria were calculated from sequence data aligned by the Gibbs Recursive Sampler [49], and were plotted using publicly available software [27].