| Literature DB >> 23175603 |
Ivan V Kulakovskiy1, Yulia A Medvedeva, Ulf Schaefer, Artem S Kasianov, Ilya E Vorontsov, Vladimir B Bajic, Vsevolod J Makeev.
Abstract
Transcription factor (TF) binding site (TFBS) models are crucial for computational reconstruction of transcription regulatory networks. In existing repositories, a TF often has several models (also called binding profiles or motifs), obtained from different experimental data. Having a single TFBS model for a TF is more pragmatic for practical applications. We show that integration of TFBS data from various types of experiments into a single model typically results in the improved model quality probably due to partial correction of source specific technique bias. We present the Homo sapiens comprehensive model collection (HOCOMOCO, http://autosome.ru/HOCOMOCO/, http://cbrc.kaust.edu.sa/hocomoco/) containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23175603 PMCID: PMC3531053 DOI: 10.1093/nar/gks1089
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Basic statistics for sequences in each data source
| Data source | Total No. of sequences | Median sequence length (bp) | Average sequence length (bp) |
|---|---|---|---|
| TRANSFAC | 23 199 | 24 | 26 |
| JASPAR | 24 692 | 204 | 207 |
| Yale ChIP-Seq | 96 381 | 454 | 656 |
| HudsonAlpha ChIP-Seq | 65 081 | 559 | 561 |
| Parallel SELEX | 19 535 | 16 | 16 |
| All other datasets | 2655 | 1000 | 687 |
Figure 1.Comparison of AUC ratios for TFBS models of JASPAR (green bars), TRANSFAC (red curve) and HOCOMOCO (blue curve) TFBS models. Value of 1 corresponds to the best model with the highest AUC value. Points on X-axis correspond to control sets for different TFs. Y-axis shows AUC ratios. If several TFBS models were present in a collection, the best result is shown. Details are given in the text.
Overview of the HOCOMOCO TFBS models of different quality ratings
| Quality | TFs | Models | Sequences per TF (median) | Data sources per TF (median) |
|---|---|---|---|---|
| A | 49 | 52 | 2037 | 2 |
| B | 82 | 87 | 159.5 | 2 |
| C | 128 | 139 | 38.0 | 1 |
| D | 142 | 148 | 16.5 | 1 |
| E | 55 | 55 | 11.0 | 1 |
| F | 18 | — | 2047.0 | 1 |
Figure 2.TFBS model LOGOs for highly similar models within TF families. LOGOs for selected members of CEBP, E2F and SP families are given. The Discrete Information Content is used for nucleotide scaling as in (29). Note that in our LOGO representation, the dominant nucleotides are placed at the bottom enabling easy observing the sequence of the best scoring binding site.