| Literature DB >> 17973026 |
Stein Aerts1, Jacques van Helden, Olivier Sand, Bassem A Hassan.
Abstract
Networks of regulatory relations between transcription factors (TF) and their target genes (TG)- implemented through TF binding sites (TFBS)- are key features of biology. An idealized approach to solving such networks consists of starting from a consensus TFBS or a position weight matrix (PWM) to generate a high accuracy list of candidate TGs for biological validation. Developing and evaluating such approaches remains a formidable challenge in regulatory bioinformatics. We perform a benchmark study on 34 Drosophila TFs to assess existing TFBS and cis-regulatory module (CRM) detection methods, with a strong focus on the use of multiple genomes. Particularly, for CRM-modelling we investigate the addition of orthologous sites to a known PWM to construct phyloPWMs and we assess the added value of phylogenentic footprinting to predict contextual motifs around known TFBSs. For CRM-prediction, we compare motif conservation with network-level conservation approaches across multiple genomes. Choosing the optimal training and scoring strategies strongly enhances the performance of TG prediction for more than half of the tested TFs. Finally, we analyse a 35(th) TF, namely Eyeless, and find a significant overlap between predicted TGs and candidate TGs identified by microarray expression studies. In summary we identify several ways to optimize TF-specific TG predictions, some of which can be applied to all TFs, and others that can be applied only to particular TFs. The ability to model known TF-TG relations, together with the use of multiple genomes, results in a significant step forward in solving the architecture of gene regulatory networks.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17973026 PMCID: PMC2047340 DOI: 10.1371/journal.pone.0001115
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1LOOCV assessment scheme.
An enhancer model, consisting of one or more PWMs, is trained on known target regions of a TF, excluding the target region of the test TG (here TG4 is excluded). This left-out region, together with a set of negative sequences, is scored with the enhancer model. All 166 rank ratio's are plotted cumulatively.
Figure 2LOOCV performance for homotypic cluster detection on Dmel.
For each of the 34 TFs, target sequences are scored with Cluster-Buster, using a PWM built from all known binding sites (grey curve) or from all known binding sites except those located in the region being scored (blue curve). Scrambled PWMs are used as negative controls (black curve).
Figure 3Integrating HMM-based enhancer scoring across multiple genomes.
(A) LOOCV performances of Cluster-Buster (red dashed curve) and STUBBMS (green curve) on two genomes (Dmel and Dpse), Cluster-Buster on all genomes using network-level conservation (NLC) (red curve), and Cluster-Buster combined with motif conservation (MC) (purple curve). The red dotted curve is the negative control. (B) Implementation of network-level conservation by integrating the Cluster-Buster scores on multiple genomes through order statistics. Rank ratios for orthologous sequences (both positive and negative) are obtained for each species separately and are integrated by the order statistics formula. Dmel sequences are finally ranked according to the integrated score. (C) LOOCV performances (AUC values in the y axis) for each TF (x axis), using the Dmel PWM or the phyloPWM. The first two bars represent the scoring on Dmel alone, the next two on all genomes with network-level conservation (NLC), and the last two on all genomes with motif conservation (MC). The TFs are sorted according to decreasing baseline performance (black).
Figure 4Homotypic cluster detection with phyloPWMs.
(A) LOOCV performance for Dmel PWMs (blue curve), and phyloPWMs from all species tested (Dmel, Dsim, Dsec, Dere, Dyak, Dana, Dpse, Dper, Dvir, Dmoj, and Dgri). Orthologous sites were chosen based on a minimum of either 40% (green curve), 70% (red), 80% (brown), or 90% (orange) identity with the Dmel true binding site. (B) Differences between the AUC values obtained from a Dmel PWM and a phyloPWM, for each TF. TFs with differences above 0.05 are colored orange. PWMs with few sites (y-axis) have greater AUC differences than PWMs with many sites.
Figure 5LOOCV performances for heterotypic CRM-models.
(A) Heterotypic models consist of PWMs obtained by motif discovery on Dmel sequences using MotifSampler (brown curve) or oligo-analysis (orange curve) and on all species using PhyloGibbs (green curve). Models consisting of combined de novo PWMs with the true experimental PWM are shown for oligo-analysis (dashed orange) and PhyloGibbs (dashed green). Scoring is either done on Dmel alone (thin full lines) or on all species by network-level conservation (NLC) (thick lines and dashed lines). (B) In green and orange are LOOCV performances for scoring with heterotypic models (either species-specific for oligo-analysis or cross-species for PhyloGibbs) on all species including the Dmel PWM (cfr thick full lines in A). Scoring on all species is done by network-level conservation (NLC). In blue is the control, namely NLC using the Dmel PWM alone, without newly discovered motifs.
Selection of GO-filtered genome-wide target gene predictions from our website.
| TF | Known TGs | Model | AUC | GO ID | GO Term | P-value | Candidate TGs |
| bcd | tll, eve, ems, Kr, kni, salm, h, hb | Homotypic P-NLC | 0.91 | GO:0008595 | determination of anterior/posterior axis, embryo | 2.00E-07 | gt, kni, pum, hb, slp1, oc, tll, eve, Kr |
| dl | rho, zen, twi, sna, dpp | Homotypic F-NLC | 1 | GO:0007498 | mesoderm development | 6.08E-04 | dpp, mbl, zfh1, sna, S, pnt, twi, tmod, jeb, vnd |
| tin | Mef2, eve, betaTub60D, tin | Homotypic F-NLC | 0.99 | GO:0007507 | heart development | 4.63E-11 | mid, G-oalpha47A, fz, apt, lbl, svp, hh, fas, tin, Mef2, pnr |
| brk | zen, lab, bi | Homotypic P-MC | 0.95 | GO:0007179 | transforming growth factor beta receptor signaling pathway | 0.0308 | Dad, sog, pnr, bun |
| cad | ftz, kni, salm | Homotypic P-MC | 0.92 | GO:0007379 | segment specification | 7.58E-06 | kn, osa, Antp, cad, Abd-B, kis, abd-A |
| pan | slp1, eve, Ser | Homotypic P-MC | 0.88 | GO:0016055 | Wnt receptor signaling pathway | 0.0168 | osa, Notum, Wnt4, Axn, par-1 |
| Mad | zen, vg, tin | Homotypic P-MC | 0.88 | GO:0007267 | cell-cell signaling | 0.00563 | DopR, para, NetA, pum, mib1, D2R, shot, wb, scrib, Or98a, bab1, dlg1, fru |
| sd | ct, bs, vg, kni, salm | Homotypic P-MC | 0.91 | GO:0007476 | wing morphogenesis | 0.03205 | vg, dpp, bs, fz, shot, sgg, px |
| E(spl) | ac, sc, l(1)sc, Espl | Heterotypic F-O-NLC | 0.94 | GO:0045165 | cell fate commitment | 0.000383 | fz, l(1)sc, Ubx, pum, vn, bun, spen, mam, fas, sc |
| gt | abd-A, eve, Kr, kni | HeterotypicF-O-NLC | 0.96 | GO:0007354 | zygotic determination of anterior/posterior axis, embryo | 0.00104 | tll, gt, kni, sog, Kr |
| HLHm5 | ac, l(1)sc, Espl | Heterotypic F-O-NLC | 0.91 | GO:0045165 | cell fate commitment | 4.40E-05 | hth, srp, pnt, l(1)sc, pum, Ubx, vn, bun, spen, ac, kay |
| kni | Ubx, eve, Kr, h | Heterotypic F-O-NLC | 0.99 | GO:0035290 | trunk segmentation | 0.00150 | kni, Ubx, eve, Kr |
| ovo | orb, otu, Sxl, ovo | Heterotypic F-O-NLC | 0.92 | GO:0009993 | oogenesis (sensu Insecta) | 0.00158 | dpp, Ptp61F, Eip74EF, Sxl, pum, bun, dnc, ovo, Fas3, ttk, sty, Eip75B, Mef2, ct |
| tll | Ubx, ems, Kr, kni, h, hb | Heterotypic F-O-NLC | 0.9 | GO:0045165 | cell fate commitment | 1.76E-09 | hth, eya, fz, kni, vvl, Ubx, pum, hdc, run, h, tll, fas, sc, Kr, ct |
| twi | rho, Ubx, sna, sim, tin | Heterotypic F-O-NLC | 0.92 | GO:0007507 | heart development | 0.00244 | fz, fas, Antp, apt, Ubx, Mef2 |
| ey | so, shf, Optix, eya, ato | Homotypic F-NLC | 0.94 | GO:0007456 | eye development (sensu Endopterygota) | 9.98E-05 | hth, Optix, Fas2, eya, fz, pnt, so, bun, toy, lilli, S, klar, fred |
| Overlap with 188 upregulated genes after ey over-expression from | 2.33e-05 | mspo, SK, so, toy, ey, CG17816, Optix, CG30492, CG32521, osp, Fas2, CG5888, Tie, eya |
Candidate targets are presented for those TFs with AUC above 0.88 and with a TF-associated functional enrichment in the list of top 100 candidates. Results for other factors, for other functional classes, and for genomic locations of predicted motif clusters can be found at http://med.kuleuven.be/cme-mg/lng/cisTarget/. F = FlyReg PWM, P = phyloPWM, NLC = Network-level conservation, MC = Motif Conservation, F-O = FlyReg PWM+oligo-analysis motifs.