| Literature DB >> 31723409 |
Fatemeh Behjati Ardakani1,2,3, Florian Schmidt1,2,3,4, Marcel H Schulz1,2,5.
Abstract
Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs).Entities:
Keywords: Chromatin accessibility; DNase1-seq; ENCODE-DREAM in vivo Transcription Factor binding site prediction challenge; Ensemble learning; Indirect-binding; TF-complexes; Transcription Factors
Mesh:
Substances:
Year: 2018 PMID: 31723409 PMCID: PMC6823902 DOI: 10.12688/f1000research.16200.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Number of bins labeled as bound per transcription factor (TF) and tissue, deduced from TF ChIP-seq data.
| TF | Number of bins labelled as bound per tissue |
|---|---|
| ATF7 | 272,2234 (GM12878), 218,239 (HepG2), 345,775 (K562) |
| CREB1 | 164,968 (GM12878), 103,752 (H1-hESC), 178,080 (HepG2), 98,554 (K562) |
| CTCF | 179,672 (A549), 271,097 (H1-hESC), 206,336 (HeLa-S3), 208,868 (HepG2), 215,238 (K562),
|
| E2F1 | 93,117 (GM12878), 55,391 (HeLa-S3) |
| EGR1 | 72,595 (GM12878), 52,733 (H1-hESC), 175,994 (HCT116), 58,793 (MCF-7) |
| EP300 | 126,409 (GM12878), 69,247 (H1-hESC), 157,629 (HeLa-S3), 168,173 (HepG2), 137,369 (K562) |
| GABPA | 26,467 (GM12878), 51,666(H1-hESC), 31,202 (HeLa-S3), 60,552 (HepG2), 109,423 (MCF-7), 78,403 (SK-N-SH) |
| JUND | 203,665 (HCT116), 179,999 (HeLa-S3), 183,558 (HepG2), 193,814 (K562), 92,905 (MCF-7), 222,013 (SK-N-SH) |
| MAFK | 34,054 (GM12878), 97,659 (H1-hESC), 62,124 (HeLA-S3), 291,337 (HepG2), 201,157 (IMR90) |
| MAX | 301,615 (A549), 98,327 (GM12878), 224,379 (H1-hESC), |
| 321,501 (HCT116), 211,590 (HeLa-S3), 317,579 (HepG2), 318,318 (K562), 250,775 (SK-N-SH) | |
| MYC | 57,512 (A549), 91,325 (HeLa-S3), 183,627 (K562), 151,748 (MCF-7) |
| REST | 71,251 (H1-hESC), 47,654 (HeLa-S3), 67,453 (HepG2), 59,640 (MCF-7), 48,946 (Panc1), 94,082 (SK-N-SH) |
| RFX5 | 161,689 (GM12878), 22,948 (HeLa-S3), 54,961 (MCF-7) |
| SRF | 21,495 (GM12878), 40,201 (H1-hESC), 176,158 (HCT116), 22,593 (HepG2), 18,895 (K562) |
| TAF1 | 87,109 (GM12878), 185,027 (H1-hESC), 93,824 (HeLa-S3), 110,385 (K562), 83,276 (SK-N-SH) |
| TCF12 | 51,798 (GM12878), 104,834 (H1-hESC), 82,102 (MCF-7) |
| TCF7L2 | 100,926 (HCT116), 165,264 (HeLa-S3), 143,025 (Panc1) |
| TEAD4 | 66,198 (A549), 103,483 (H1-hESC), 174,716 (HCT116), 125,917 (HepG2), 186,759 (K562) |
| YY1 | 136,621(GM12878), 195,489 (H1-hESC), 63,293 (HCT116), 133,943 (HepG2) |
| ZNF143 | 197,385 (GM12878), 178,088 (H1-hESC), 48,154 (HeLA-S3), 103,755 (HepG2) |
Test data used in this article, shown per transcription factor (TF) and tissue.
| TF | Tissues |
|---|---|
| CTCF | PC-3, Induced
|
| E2F1 | K562 |
| EGR1 | liver |
| GABPA | liver |
| JUND | liver |
| MAX | liver |
| REST | liver |
| TAF1 | liver |
Figure 1. ( a) Data pre-processing workflow using DNase1-seq Hypersensitive Sites (DHSs). Using JAMM, DHSs are called considering all available replicates for a distinct tissue. Transcription factor (TF) affinities in the identified DHSs are computed using TRAP for 557 TFs, the median signal of DHSs is assessed using bedtools. ( b) An alternative data pre-processing workflow without DHSs: TF affinities and median DNase1-seq signal are computed per bin.
Figure 2. a) An overview of model training for a distinct transcription factor, TF, with multiple training tissues. Using the full feature matrices T 1, T 2, T 3, depicted in ( b), TF and tissue-specific random forest (RF) classifiers are trained. From those RF classifiers ( RF 1, RF 2, RF 3), we determine the union of the top 20 features from each RF. In this example, the union of top TFs is comprised of 24 TFs. Next, we design reduced tissue-specific feature matrices T’ 1, T’ 2, T’ 3, as shown in ( c) based on the union of the top TF features. Subsequently, tissue-specific RF classifiers ( RF’ 1, RF’ 2, RF’ 3) are trained on these reduced feature sets. The tissue-specific RF classifiers are applied to all training tissues and their predictions are aggregated to form the feature matrix T’ , visualized in ( d), which is used to train an ensemble model ( RF ). At the testing phase the feature matrix T' is fed to the trained ensemble model RF to predict the labels for the unseen data ( e). Note that the column Tissue in d) is not included in the model but only shown here for illustration purposes. The feature matrices shown represent feature setup (1) using DNase1 Hypersensitive (DHS) sites.
Figure 3. a) PR-AUC and ROC-AUC for different sets of features: considering all features, the top 10, and the top 20 features. One can see that the difference in model performance between the top 20 and all feature cases is only marginal. b) Comparison of the out of bag (OOB) error between ensemble models and tissue-specific random forest (RF) classifiers. The ensemble models show superior performance compared to the tissue-specific RF classifiers. c) PR-AUC and ROC-AUC computed on unseen test data for ensemble and tissue-specific RF classifiers. Due to the imbalanced nature of the test data, the ROC-AUC values are overly optimistic, as they are biased by the numerous unbound sites. However, the PR-AUC represents a more realistic view on the actual performance of the models. Note that the scale of the y-axes are different for the sub-figures.
Figure 7. Top 5 features obtained from the average importance ranking of all tissue-specific classifiers for a given target TF shown on the x- axis for the peak setup ( a) and the bin setup ( b). Features related to DNase1 are the dominant ones.
Figure 4. Comparison of tissue number and classifier setups for the three TFs E2F6, MAX, and TEAD4.
a) Model performance as a function of number of tissues used for training. The OOB reduces if more tissues are included in the ensemble learning. Red dots represent the mean classification error across all tissue-specific classifiers. The black points represent individual models. b) Comparison between two ensemble models: averaging (takes the average of all individual RF predictions) and the RF ensemble model. In addition, one RF classifier was trained on pooled data sets comprised of training data for all available tissues for one target TF. The ensemble models perform better than the models based on aggregated data
Figure 5. a) Log transformed PPI scores computed for a set of TFs. In the Random case, we show the mean PPI score across 100 random draws and its standard deviation. The smaller the PPI score the better. Only for three TFs ( MAX, TAF1, ZNF143), the randomly sampled PPI score is better than or equal to the score derived for the TFs selected by the RF classifiers. b) PPI network obtained from STRING centered around the TF MAFK, highlighting proteins that interact with MAFK with high confidence. Proteins colored in green were identified as important features in the RF classifiers, proteins shown in grey could not be retrieved by our model, because they are DNA-binding proteins, or we do not have a PWM for them in our set. Regulators shown in red could have been detected by the RF, but were not included in the top set of regulators.
Figure 6. Comparison of PR-AUC and ROC-AUC for both feature setups computed on test data.
In terms of PR-AUC, the peak based models clearly perform better than the bin based models. In terms of ROC-AUC it is less clear, however, as the test data is highly unbalanced, ROC-AUC is less reliable than PR-AUC.