| Literature DB >> 18093302 |
Victor G Levitsky1, Elena V Ignatieva, Elena A Ananko, Igor I Turnaev, Tatyana I Merkulova, Nikolay A Kolchanov, T C Hodgman.
Abstract
BACKGROUND: Reliable transcription factor binding site (TFBS) prediction methods are essential for computer annotation of large amount of genome sequence data. However, current methods to predict TFBSs are hampered by the high false-positive rates that occur when only sequence conservation at the core binding-sites is considered.Entities:
Mesh:
Substances:
Year: 2007 PMID: 18093302 PMCID: PMC2265442 DOI: 10.1186/1471-2105-8-481
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Recognition performance for different PWMs. Each vertical block shows the performance of five different PWM algorithms: BVH (Berg and von Hippel) [2], LOD (log-odds) [63], MCH (MATCH) [62], NLG (natural logarithm) and OPT (natural logarithm, optimized matrix length) for each TF. The vertical axis shows the false-positive rate (logarithmic scale) for that algorithm at true-positive rates defined in the caption at the top of the figure. The upper and lower plots compare the algorithms using 15 nt mononucleotide and 20 nt dinucleotide PWMs respectively.
Average ranking of 5 monoPWM models (BVH, LOD, MCH, NLG and optimized NLG (OPT)) calculated for corresponding FP rates (matrix length 15 nt).
| TP rate, % | BVH | LOD | MCH | NLG | OPT |
| 50 | 3.78 | 2.00 | 4.78 | 2.78 | 1.67 |
| 60 | 3.11 | 2.44 | 4.67 | 2.44 | 2.33 |
| 70 | 4.11 | 2.89 | 4.22 | 2.22 | 1.56 |
| 80 | 3.11 | 2.78 | 4.56 | 3.00 | 1.56 |
| 90 | 4.11 | 3.00 | 2.89 | 3.00 | 2.00 |
| 100 | 3.44 | 3.22 | 3.67 | 2.44 | 2.22 |
The average ranks calculated on the basis of nine TFBS types (see methods, Sequence data)
Average ranking of 5 diPWM models (BVH, LOD, MCH, NLG and optimized NLG (OPT)) calculated for corresponding FP rates (matrix length 20 nt).
| TP rate, % | BVH | LOD | MCH | NLG | OPT |
| 50 | 3.44 | 3.00 | 4.22 | 3.00 | 1.33 |
| 60 | 3.11 | 3.78 | 3.89 | 2.44 | 1.67 |
| 70 | 3.78 | 3.44 | 4.11 | 2.22 | 1.44 |
| 80 | 3.56 | 4.22 | 3.89 | 2.00 | 1.33 |
| 90 | 3.89 | 4.33 | 3.00 | 2.00 | 1.78 |
| 100 | 3.11 | 3.89 | 3.67 | 2.44 | 1.89 |
The average ranks calculated on the basis of nine TFBS types (see methods, Sequence data)
Details for PWM and SiteGA models of TFBS recognition
| General parameters | Specific parameters | |||
| TF type | No. of training sequences | Window length, nt | PWM Type of matrix1 | SiteGA Number of LPDs |
| E2F | 40 | 38 | DI | 60 |
| HNF4 | 29 | 17 | DI | 140 |
| IRF1 | 28 | 33 | DI | 38 |
| ISGF3 | 27 | 34 | DI | 36 |
| NF-κB | 43 | 30 | DI | 150 |
| PPAR | 37 | 25 | DI | 90 |
| SF-1 | 53 | 30 | DI | 90 |
| SREBP | 37 | 18 | MONO | 110 |
| STAT1 | 32 | 20 | DI | 150 |
1 – DI – dinucleotide PWM, MONO – mononucleotide PWM.
Figure 2FP rates for the optimized matrices. The FP rates for each optimized PWM are plotted for 50% and 70% TP rates. The PWMs have also been arranged from left to right in order of sequence length, with the length axis provided on the right-hand side of the plot.
Figure 3ROC plots of SiteGA and PWM models. The plots compare performance of the two approaches for the 50–80% TP region. a) HNF4, b) SREBP, c) STAT1, d) PPAR, e) NF-κB, f) SF-1, g) IRF1, h) ISGF3, i) E2F. In this instance, better performance is marked by higher values and plots positioned further to the left. FP rates (X axis) are in logarithmic scale.
Figure 4Significant correlations between frequencies of locally positioned dinucleotides (LPDs) for SF-1 BSs. Each horizontal strip depicts one correlation between two LPDs. a) Positive correlations, 0.001 < p < 0.01; b) negative correlations, 0.001 < p < 0.01; c) positive correlations, p < 0.001; d) negative correlations, p < 0.001. Whole analyzed 30 bp long region corresponds to [1;30] window of 29 dinucleotide positions. Also note that the SF-1 consensus sequence GTCAAGGTCA [39] located within region [10;19].
Figure 5Analysis of significant correlations (p < 0.05) between locally positioned dinucleotides frequencies calculated for SiteGA models. a) HNF4, b) SREBP, c) STAT1, d) PPAR, e) NF-κB, f) SF-1, g) IRF1, h) ISGF3, i) E2F.
Figure 6Analysis of TFBSs predictions in EPD promoters by SiteGA, PWMs and combined PWM & SiteGA approach. X-axis – TF types and stringencies in terms of fixed TP rates (50, 60 and 70% for SiteGA and PWMs); Y-axis – no. of predicted sites (data labels are marked for each point). Combined approach implied that both PWM and SiteGA models recognized a potential site.
Figure 7Ratios between true positive (TP) and false positive (FP) rates calculated respectively on the basis of training and background data for SiteGA, PWMs and combined PWM & SiteGA approach. X-axis – TF types and stringencies in terms of fixed TP rate (50, 60 and 70% for SiteGA and PWMs); Y-axis (logarithmic scale) – ratios of frequencies of predictions for the train (TP) to those for background set (FP). Combined approach implied that both PWM and SiteGA models recognized a potential site.
Ranking of EPD human promoters containing predicted TFBSs by target functionality
| TF type, Stringency | Known targets | Very possible targets | Possible targets | BSs of homologous TFs | Possibly unrelated | Unknown |
| E2F, 60% | MCM7, MCM5, MCM2, Cyclin D1, RAD51, TYMS2 | MCM3 | SLC1A5, ALDOA E3P2, GORASP2, BTG1, RAE1, ZNF9, CDC37 | MYL6, COX7B, COX8, YME1L1, MFGE8, TPMT | CBARA1, AUP1, SFRS10, PGD, TARS, PRPS1, RNASE4, PAPOLA2, MRPS7 | |
| HNF4, 50% | hepatic lipase | apolipop. B, apolipop. A1, glucagon, HPD, GLUL | HGD, HBXIP, CSTB, Cyclin D1, CCNB1, CCNB2, UBL5, TAPBP, IFI27, NDUFA6 | histone H4-A1, PRM2, STIP1, RPS8, CASQ2, TOMM22, AUP1, SNX6, TRAP1, RPL23 | CEA, FLJ13154, FLJ10276, LOC57862, COPS4, BLCAP, SLC22A8 | |
| IRF1, 80% | complement f. B, b'-interferon, HLA C, HLA B, IFI27, SP100, IFIT1, IFI 54K, IFI 6–162, ISG 15K | BST2, B2M | Haptoglob Hp1F, haptoglob HpR, IL-4 (BSF-1), TAPBP, NDUFC2, PHGDH, GLRX2, | † CSF-1, PCNA | APOL3, SRP54, GTF2H4, DNASE2, NDUFB4 | |
| ISGF3, 70% | IFI272, IFI 54K, IFI 6–162, ISG 15K | Complement f. B, IFIT1 | SSBP1, ATP5J, SCARB2, ESRRBL1, CTNNBL1, STX10, PMAIP1 | TNP1, B4GALT4, KNS2, PPP1CA, | MGC2714 | |
| PPAR, 70% | ATIII | apolipop. CIII, PCK2 | SDHD, CIR, LOC51064 | histone H2B, BAP29, CHEK1, HNMT, HLA DPB1, MT-IB, MRPL11, SERPINA3, SGT, TNP1, TAPBP, TCL1A, UBL5, VPS29, ARAF1 | CDW52, CSNK1A1, KIAA0971, PCBP2, STIP1, TAF9, TM4SF2, TMP21 | |
| SF-1, 60% | CG/LH/FSH/TSH-a, CYP11A, HSD3B2 | ACPP, PSMD8 | gastrin, FXYD3, CKMT2, MYL2, TGFB1, FXR1, VAPA, TPI1, CDC42EP2, MRPL11, MRPS21, S100A2 | # CTRB1, PPGB, ATF4 | SAT, CNTF, PPA12, RPLP1, FARSL, SERPIND1, CASQ2, TXN, CCT7, SNRP70 | ATP6V0D1, IGF II E3P3, ERH, OS, HU, JM5 |
| STAT1, 60% | complement f. B, AGT3, CD14, FLJ20244, CDKN1A | TNF-b', C4BP b', ARHGDIB, CTNNA1, C1S, SERPINB6 | MCP, BLCAP, DEK, DUT | ‡ DDH hepatic, PGK1, PTTG1 | SAP18, PRSS1, | CG/LH/FSH/TSH a', ALDH1A1, LOC51231, PWP1, EIF3S6, TARS, DC6, GGPS1 |
Prediction for potential targets of E2F, IRF1, ISGF3, HNF4, PPAR, SF-1 and STAT1 were done by combined PWM & SiteGA approach. For each TFs stringencies for SiteGA and PWM are in terms of the same fixed TP rate (50–80%). Known targets – experimentally confirmed targets; very possible or possible targets – a lot of or some indirect evidence on the target functionality; BSs of homologous TFs – potential targets of homologous TFs approved or may be supposed; possibly unrelated – presumably false positives, unknown – no information on the target functionality. 2 – gene have two predicted sites; 3 – gene have three predicted sites; † – potential targets of members of IRF family; # – potential targets of LRH-1, that is homolog of SF-1; ‡ – potential targets of members of STAT family.
Figure 8Comparative analysis of predicted SF1 BSs densities for human chromosomes. X-axis – number of human chromosomes (1–22, X, Y); Y axis (logarithmic scale) – ratio of no. of predicted sites to the total no. of analyzed window positions.
Figure 9Web interface of the SiteGA tool.
Figure 10Example of the SiteGA tool output data.