| Literature DB >> 24322294 |
Jun Ding1, Haiyan Hu, Xiaoman Li.
Abstract
The identification of transcription factor binding motifs is important for the study of gene transcriptional regulation. The chromatin immunoprecipitation (ChIP), followed by massive parallel sequencing (ChIP-seq) experiments, provides an unprecedented opportunity to discover binding motifs. Computational methods have been developed to identify motifs from ChIP-seq data, while at the same time encountering several problems. For example, existing methods are often not scalable to the large number of sequences obtained from ChIP-seq peak regions. Some methods heavily rely on well-annotated motifs even though the number of known motifs is limited. To simplify the problem, de novo motif discovery methods often neglect underrepresented motifs in ChIP-seq peak regions. To address these issues, we developed a novel approach called SIOMICS to de novo discover motifs from ChIP-seq data. Tested on 13 ChIP-seq data sets, SIOMICS identified motifs of many known and new cofactors. Tested on 13 simulated random data sets, SIOMICS discovered no motif in any data set. Compared with two recently developed methods for motif discovery, SIOMICS shows advantages in terms of speed, the number of known cofactor motifs predicted in experimental data sets and the number of false motifs predicted in random data sets. The SIOMICS software is freely available at http://eecs.ucf.edu/∼xiaoman/SIOMICS/SIOMICS.html.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24322294 PMCID: PMC3950686 DOI: 10.1093/nar/gkt1288
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The procedure in SIOMICS.
Predicted motifs by SIOMICS in 13 ChIP-seq data sets and 13 random data sets
| Data set | Number of peaks | Number of predicted motifs | Number of predicted motif modules | Percentage motifs similar to known motifs (Evalue < 1E-5) | Percentage motifs similar to known motifs (Evalue < 1E-4) | Percentage motifs not in original 100 | Number of motifs predicted in random data sets |
|---|---|---|---|---|---|---|---|
| Sox2 | 7761 | 99 | 889 | 78/99 = 78.8% | 96/99 = 97.0% | 51/99 = 51.5% | 0 |
| E2f1 | 20 670 | 99 | 2510 | 79/99 = 79.8% | 94/99 = 94.9% | 55/99 = 55.6% | 0 |
| Stat3 | 5347 | 91 | 1256 | 72/91 = 79.1% | 85/91 = 93.4% | 39/91 = 42.9% | 0 |
| Nanog | 17 834 | 99 | 1131 | 76/99 = 76.8% | 96/99 = 97.0% | 58/99 = 58.6% | 0 |
| Oct4 | 6915 | 73 | 719 | 64/73 = 87.7% | 69/73 = 94.5% | 42/73 = 45.2% | 0 |
| c-Myc | 6462 | 96 | 1901 | 74/96 = 77.1% | 94/96 = 97.9% | 77/96 = 80.2% | 0 |
| Klf4 | 18 144 | 99 | 2052 | 83/99 = 83.8% | 96/99 = 97.0% | 52/99 = 52.5% | 0 |
| Ctcf | 49 114 | 99 | 784 | 78/99 = 78.8% | 94/99 = 94.9% | 38/99 = 38.4% | 0 |
| Zfx | 17 201 | 98 | 1945 | 75/98 = 76.5% | 93/98 = 94.9% | 76/98 = 77.6% | 0 |
| Tcfcp2l1 | 45 885 | 71 | 782 | 55/71 = 77.5% | 68/71 = 95.8% | 41/71 = 57.8% | 0 |
| Esrrb | 49 127 | 43 | 308 | 35/43 = 81.4% | 41/43 = 95.3% | 30/43 = 69.8% | 0 |
| n-Myc | 10 987 | 94 | 1766 | 72/94 = 76.6% | 91/94 = 96.8% | 80/94 = 85.1% | 0 |
| Smad1 | 2185 | 21 | 33 | 21/21 = 100% | 21/21 = 100% | 16/21 = 76.2% | 0 |
Predicted motif modules are supported
| Data set | Motif modules contain at least a pair of interacting TF pairs from BioGRID | Shared motif modules across data sets | Motif modules with preferred motif order (corrected | Motif modules supported by at least one type of evidence | |
|---|---|---|---|---|---|
| Sox2 | 343/889 = 38.6% | 0 | 261/889 = 29.4% | 208/889 = 23.4% | 582/889 = 65.6% |
| E2f1 | 1373/2510 = 54.7% | 0 | 408/2510 = 16.3% | 1452/2510 = 57.8% | 2039/2510 = 81.2% |
| Stat3 | 469/1256 = 37.3% | 0 | 289/1256 = 23.0% | 244/1256 = 19.4% | 755/1256 = 60% |
| Nanog | 348/1131 = 30.8% | 0 | 273/1131 = 24.13% | 428/1131 = 37.8% | 712/1131 = 62.3% |
| Oct4 | 254/719 = 35.3% | 2.2E-271 | 110/719 = 15.3% | 179/719 = 24.9% | 406/719 = 56.6% |
| c-Myc | 715/1901 = 37.6% | 0 | 331/1901 = 17.4% | 506/1901 = 26.6% | 1166/1901 = 61.3% |
| Klf4 | 955/2052 = 46.5% | 0 | 357/2052 = 17.4% | 1044/2052 = 50.8% | 1517/2052 = 73.4% |
| Ctcf | 299/784 = 38.2% | 0 | 181/784 = 23.1% | 402/784 = 51.3% | 584/784 = 74.5% |
| Zfx | 535/1945 = 27.5% | 0 | 321/1945 = 16.5% | 762/1945 = 39.2% | 1207/1945 = 62.1% |
| Tcfcp2l1 | 169/782 = 21.6% | 8.8E-136 | 154/782 = 19.7% | 345/782 = 44.1% | 495/782 = 63.3% |
| Esrrb | 105/308 = 34.1% | 3.2E-106 | 51/308 = 16.6% | 125/308 = 40.6% | 204/308 = 66.2% |
| n-Myc | 807/1766 = 45.7% | 0 | 311/1766 = 17.6% | 723/1766 = 40.1% | 1249/1766 = 70.1% |
| Smad1 | 11/33 = 33.3% | 4.8E-12 | 9/33 = 27.3% | 3/33 = 9.1% | 17/33 = 51.5% |
Comparison of three methods on prediction of known cofactor motifs
| TF | Known motifs found (primary and cofactors) E-value cutoff E-4 | ||
|---|---|---|---|
| SIOMICS | DREME | Peak-motifs | |
| Sox2 | 8/9 (Sox2,Klf4, Stat3, Zic3, Hoxa5, Tcf3, Tead1,Oct4) | 8/9 (Sox2, Oct4, Klf4, Stat3,Esrrb, Zic3, Tcf3, Tead1) | 4/9 (Sox2,Oct4, Klf4, Esrrb) |
| E2f1 | 7/10 (E2f1,Stat3, Klf4, Fox, Sp1, Nfkb1, Tbp) | 6/10 (E2f1,Stat3, Myc, Klf4, Creb, Sp1) | 3/10 (Klf4, Creb, Sp1) |
| Stat3 | 6/8 (Stat3,Klf4, Sox2, Myc, Sp1, Irf) | 6/8 (Stat3,Klf4, Esrrb, Sox2, Myc,Sp1) | 6/8 (Stat3,Klf4, Sox2, Esrrb, Myc, Sp1) |
| Nanog | 7/8 (Nanog,Sox2,Oct4, Zic3, Klf4, Elf5, Tead1) | 4/8 (Nanog,Sox2, Klf4, Esrrb) | 4/8 (Sox2, Oct4, Klf4, Esrrb) |
| Oct4 | 8/10 (Oct4,Sox2, Klf4, Sox10, Ewsr1, Nanog, Zic2, Esrrb) | 7/10 (Oct4,Sox2, Klf4, Esrrb, Sox10, Ewsr1, Nanog) | 5/10 (Oct4,Klf4,Creb, Esrrb, Sox10) |
| c-Myc | 3/4 (Stat3, Egr1, Sp1) | 3/4 (c-Myc,Stat3, Sp1) | 3/4 (c-Myc,Egr1, Sp1) |
| Klf4 | 4/10 (Klf4,Stat3, Sox2, Sp1) | 6/10 (Klf4,Stat3,Esrrb, Sox2, Sp1, Myc) | 3/10 (Klf4,Stat3, Sp1) |
| Ctcf | 5/6 (Ctcf,Stat3,Gabpa, Yy1, Smad3) | 4/6 (Ctcf,Stat3,Gabpa, Smad3) | 2/6 (Ctcf,Myc) |
| Zfx | 2/4 (Zfx,Stat3) | 2/4 (Zfx,Stat3) | 2/4 (Zfx,Stat3) |
| Tcfcp2l1 | 7/12 (Tcfcp2l1,Stat3,Klf4, sox2, Esrrb, Fox, Sp1) | 6/12 (Tcfcp2l1,Stat3, Klf4, Esrrb, Fox, Sp1) | 5/12 (Klf4, Esrrb, Egr1, Fox, Sp1) |
| Esrrb | 4/10 (Esrrb,Klf4, Rxra, Sp1) | 8/10 (Esrrb,Klf4, Sox2, Stat3, Myc, Rxra, Ewsr1, Sp1) | 5/10 (Esrrb,Klf4, Stat3, Rxra, Sp1) |
| n-Myc | 2/5 (Stat3,Creb) | 2/5 (n-Myc,Stat3) | 1/5 (n-Myc) |
| Smad1 | 5/9 (Sox2, Oct4, Esrrb, Klf4, Stat3) | 4/9 (Sox2, Esrrb, Klf4, Stat3) | 4/9 (Sox2,Esrrb, Zic3, Klf4) |
Figure 2.The time cost comparison of SIOMICS with Dreme and Peak-motifs.