| Literature DB >> 18544606 |
Mikael Bodén1, Timothy L Bailey.
Abstract
The roles and target genes of many transcription factors (TFs) are still unknown. To predict the roles of TFs, we present a computational method for associating Gene Ontology (GO) terms with TF-binding motifs. The method works by ranking all genes as potential targets of the TF, and reporting GO terms that are significantly associated with highly ranked genes. We also present an approach, whereby these predicted GO terms can be used to improve predictions of TF target genes. This uses a novel gene-scoring function that reflects the insight that genes annotated with GO terms predicted to be associated with the TF are more likely to be its targets. We construct validation sets of GO terms highly associated with known targets of various yeast and human TF. On the yeast reference sets, our prediction method identifies at least one correct GO term for 73% of the TF, 49% of the correct GO terms are predicted and almost one-third of the predicted GO terms are correct. Results on human reference sets are similarly encouraging. Validation of our target gene prediction method shows that its accuracy exceeds that of simple motif scanning.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18544606 PMCID: PMC2475605 DOI: 10.1093/nar/gkn374
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Gold standards of TF–GO term associations used in this study
| Panel (a) | |||
|---|---|---|---|
| Yeast gold standard | |||
| Target level | TFs | GO terms per TF | |
| 1 | ≤0.05 | 70 | 13.1 |
| 2 | ≤0.01 | 57 | 12.4 |
| 3 | ≤0.001 | 43 | 12.0 |
Panel (a) shows for each target level in the yeast transcription network the E-value range defining the level, the number of TFs with one or more signficant GO terms and the average number of GO terms per TF. Panel (b) shows the human TF gene name, the number of known target genes and the number of significantly over-represented GO terms (E ≤ 0.05) associated with that TF.
Accuracy of predicted TF–GO term associations in yeast
| TF–GO term predictions in yeast | |||||
|---|---|---|---|---|---|
| Scoring method | AUC50 | ||||
| Pred. | 1 TP | Rec. | Prec. | ||
| Hit-Count | |||||
| 10−3 Local | 38.4 | 0.64 | 0.33 | 0.12 | 0.31 |
| 10−3 Global | 34.2 | 0.76 | 0.44 | 0.16 | 0.39 |
| 10−4 Local | 16.3 | 0.73 | 0.49 | *0.37 | |
| 10−4 Global | 13.0 | 0.64 | 0.40 | *0.37 | 0.54 |
| 10−5 Local | 3.1 | 0.29 | 0.11 | *0.67 | 0.24 |
| 10−5 Global | 2.3 | 0.22 | 0.09 | *0.63 | 0.21 |
| Avg-Odds | |||||
| Local | 47.8 | 0.63 | 0.17 | 0.56 | |
| Global | 54.0 | 0.15 | 0.56 | ||
| ZS Local | 39.1 | 0.84 | 0.63 | 0.20 | 0.56 |
| ZS Global | 39.9 | 0.84 | 0.62 | 0.20 | |
| ZM Local | 36.7 | 0.60 | 0.56 | ||
| ZM Global | 37.1 | 0.84 | 0.59 | 0.20 | 0.55 |
| Max-Odds | |||||
| Local | 39.9 | 0.82 | 0.60 | 0.19 | 0.54 |
| Global | 44.2 | 0.84 | 0.63 | 0.17 | 0.54 |
The average results for yeast TFs using different methods of scoring target genes. Each row shows results at a significance level of E = 10, as well as overall results measured using the area under the ROC50 curve. The columns indicate the number of predictions returned (Pred.), the probability of predicting at least one true positive (1 TP), the recall (Rec.), the precision (Prec.) and AUC50. The best value for each metric is shown in bold-face. When one or more TFs render no predictions they are excluded from the precision average (marked with ‘*’; based on a different data source, such values are not included when determining the best for each metric).
Accuracy of predicted TF–GO term associations in human using the avg-odds global method
| TF-GO Term Predictions in Human | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Gene name | AUC50 | ||||||||
| Pred. | 1 TP | Rec. | Prec. | Pred. | 1 TP | Rec. | Prec. | ||
| NFKB1 | 120 | 1 | 0.19 | 0.32 | 18 | 1 | 0.07 | 0.78 | 0.10 |
| SRY | 147 | 1 | 0.28 | 0.15 | 28 | 1 | 0.08 | 0.21 | 0.07 |
| CREB1 | 212 | 1 | 0.69 | 0.52 | 78 | 1 | 0.37 | 0.76 | 0.40 |
| TP53 | 50 | 1 | 0.03 | 0.02 | 2 | 0 | 0.00 | 0.00 | 0.02 |
| Mean | 132.2 | 1 | 0.30 | 0.25 | 31.5 | 0.75 | 0.13 | 0.44 | 0.15 |
Each row shows results at significance level of E = 10 and E = 0.05, as well as overall results measured using the area under the ROC50 curve. The columns indicate the number of predictions returned (Pred.), the probability of predicting at least one true positive (1 TP), the recall (Rec.), the precision (Prec.) and AUC50, for four human TFs.
Accuracy of predicted TF–GO terms associations at different E-value cutoffs using the Avg-Odds global method
| Yeast | Human | |||||||
|---|---|---|---|---|---|---|---|---|
| Pred. | 1 TP | Rec. | Prec. | Pred. | 1 TP | Rec. | Prec. | |
| 0.01 | 5.4 | 0.56 | 0.27 | *0.61 | 22.8 | 0.75 | 0.10 | *0.60 |
| 0.05 | 8.2 | 0.58 | 0.32 | *0.49 | 31.5 | 0.75 | 0.13 | 0.44 |
| 0.1 | 10.0 | 0.60 | 0.36 | *0.44 | 36.8 | 0.75 | 0.14 | 0.40 |
| 1 | 20.2 | 0.73 | 0.49 | 0.27 | 72.3 | 1 | 0.22 | 0.33 |
| 10 | 54.0 | 0.87 | 0.66 | 0.15 | 132.3 | 1 | 0.30 | 0.25 |
| 50 | 144.6 | 0.93 | 0.78 | 0.07 | 250.3 | 1 | 0.37 | 0.17 |
The columns indicate the E-value cutoff, number of TF–GO term association predictions returned by the method at that cutoff, the probability of predicting at least one true positive (1 TP; higher is better), the recall (higher is better), and the precision (higher is better). The values for yeast are averages calculated from the 43 TFs for which target GO terms exist. The values for human are averages calculated from NFKB1, SRY, CREB1 and TP53. When one or more TFs render no predictions they are excluded from the precision average (marked with ‘*’).
Effect on accuracy of using the Z-score variant of Avg-Odds global scores in human
| Gene name | AUC50 | Motif content | Target CpG islands (%) | |
|---|---|---|---|---|
| Avg-Odds | Avg-Odds ZS | |||
| NFKB1 | 0.10 | 0.16 | 76% GC | 27 |
| SRY | 0.07 | 0.34 | 76% AT | 40 |
| CREB1 | 0.40 | 0.05 | 59% GC | 44 |
| TP53 | 0.02 | 0.00 | 60% GC | 48 |
The table compares the AUC50 accuracy score of predicted TF–GO term associations in Human using the Avg-Odds Global method and its shuffled-sequence Z-score variant (ZS) for four TFs. For each TF, the average GC- (or AT-) content of the motif (the average total probability of the two bases in the PWM) and the percentage of the known targets of the TF that are CpG islands [according to the definition of Takai and Jones (24)] are also shown.
Figure 1.Accuracy (ROC50) predicting yeast TF target genes. Curves show the ROC50 plots for three methods of predicting target genes of 63 Yeast TFs. Error bars show the standard error.