| Literature DB >> 25505144 |
Yiyu Zheng1, Xiaoman Li2, Haiyan Hu3.
Abstract
Comprehensive motif discovery under experimental conditions is critical for the global understanding of gene regulation. To generate a nearly complete list of human DNA motifs under given conditions, we employed a novel approach to de novo discover significant co-occurring DNA motifs in 349 human DNase I hypersensitive site datasets. We predicted 845 to 1325 motifs in each dataset, for a total of 2684 non-redundant motifs. These 2684 motifs contained 54.02 to 75.95% of the known motifs in seven large collections including TRANSFAC. In each dataset, we also discovered 43 663 to 2 013 288 motif modules, groups of motifs with their binding sites co-occurring in a significant number of short DNA regions. Compared with known interacting transcription factors in eight resources, the predicted motif modules on average included 84.23% of known interacting motifs. We further showed new features of the predicted motifs, such as motifs enriched in proximal regions rarely overlapped with motifs enriched in distal regions, motifs enriched in 5' distal regions were often enriched in 3' distal regions, etc. Finally, we observed that the 2684 predicted motifs classified the cell or tissue types of the datasets with an accuracy of 81.29%. The resources generated in this study are available at http://server.cs.ucf.edu/predrem/.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25505144 PMCID: PMC4288161 DOI: 10.1093/nar/gku1261
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The pipeline to discover motifs in a DHS dataset by SIOMICS.
The majority of vertebrate motifs in seven collections were included in our prediction
| Collections | #Motifs in the collection | #Motifs in the collection predicted by 2684 nrMotifs | % Motifs in the collection predicted by 2684 nrMotifs | Average #motifs in the collection predicted by 2684 randomly generated motifs | Average% motifs in the collection predicted by 2684 randomly generated motifs |
|---|---|---|---|---|---|
| TRANSFAC | 522 | 282 | 54.02 | 27.2 | 5.21 |
| JASPAR | 593 | 328 | 55.31 | 36.8 | 6.21 |
| HOCOMOCO | 1896 | 1210 | 63.82 | 101.6 | 5.36 |
| FactorBook | 79 | 60 | 75.95 | 4.3 | 5.44 |
| DBS | 843 | 458 | 54.33 | 52.4 | 6.22 |
| Neph | 683 | 497 | 72.77 | 41.5 | 6.08 |
| Kheradpour | 2065 | 1127 | 54.58 | 119.8 | 5.80 |
Known motifs in GM12878 and K562 were included in our prediction
| #Motifs predicted | Collection | #Motifs in the collection | #Motifs in the collection predicted | %Motifs in the collection predicted | |
|---|---|---|---|---|---|
| factorbook_gm12878 | 36 | 20 | 55.56 | ||
| GM12878 | 961 | dreme_gm12878 | 195 | 149 | 76.41 |
| Kheradpour | 67 | 43 | 64.18 | ||
| factorbook_k562 | 47 | 33 | 70.21 | ||
| K562 | 953 | dreme_k562 | 259 | 209 | 80.69 |
| Kheradpour | 64 | 43 | 67.19 |
Figure 2.(A) The number of motifs enriched in different types of regions. (B) The overlap of motifs enriched in distal and proximal regions in GM12878. (C) The overlap of motifs enriched in distal and proximal regions in K562.
Most known interacting TF pairs were included in our predictions
| Resources | #Interactions in the resource | #Interactions in the resource related to TFs in our study | # Known interactions discovered | %Known interactions discovered |
|---|---|---|---|---|
| BioGRID | 155 100 | 2769 | 2357 | 85.12% |
| DIP | 3060 | 90 | 87 | 96.67% |
| HPRD | 39 184 | 1397 | 1174 | 84.04% |
| IntAct | 267 000 | 931 | 673 | 72.29% |
| MINT | 25 756 | 198 | 174 | 87.88% |
| PIPs | 78 613 | 1265 | 1027 | 81.19% |
| Gerstein, M. B. | 519 691 | 13 211 | 10 876 | 82.33% |
| Ravasi, T. | 5238 | 1278 | 1078 | 84.35% |
Figure 3.(A) Classification of 310 datasets using 2684 motifs. (B) Individual ROC curves for four tissue types.