| Literature DB >> 28750606 |
Sheng Liu1, Cristina Zibetti2, Jun Wan1, Guohua Wang1, Seth Blackshaw1,2,3,4,5, Jiang Qian6.
Abstract
BACKGROUND: Computational prediction of transcription factor (TF) binding sites in different cell types is challenging. Recent technology development allows us to determine the genome-wide chromatin accessibility in various cellular and developmental contexts. The chromatin accessibility profiles provide useful information in prediction of TF binding events in various physiological conditions. Furthermore, ChIP-Seq analysis was used to determine genome-wide binding sites for a range of different TFs in multiple cell types. Integration of these two types of genomic information can improve the prediction of TF binding events.Entities:
Keywords: Chromatin accessibility; Feature selection; Machine learning; Transcription factor binding prediction
Mesh:
Substances:
Year: 2017 PMID: 28750606 PMCID: PMC5530957 DOI: 10.1186/s12859-017-1769-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Features used in the prediction
| Features | Description |
|---|---|
| PWM score | The score DNA sequence against position weight matrix |
| Conservation score | PhastCons conservation score for multiple alignments of 99 vertebrate genomes to the human genome |
| Distance to TSS | Distance to transcription start site |
| Reads at site | Average reads at the binding site |
| Cut counts at site | Average cut counts at the binding site |
| Upstream reads | Average reads upstream of the binding site |
| Downstream reads | Average reads downstream of the binding site |
| Upstream cut counts | Average cut counts upstream of the binding site |
| Downstream cut counts | Average cut counts downstream of the binding Site |
| Reads footprint score | Average footprint score based on reads profile |
| Cut counts footprint score | Average footprint score based on cut profile |
Fig. 1Cut profiles around motif sites show different patterns. The left panel shows the average cut counts around binding sites for bounded sites (positive) and unbounded sites (negative) respectively. The right panel shows cut counts for each individual site from positive set
Fig. 2Different scenarios of prediction using ChIP-Seq as ground truth
Fig. 3Combination of static and dynamic features increases prediction performance. Boxplot of AUC of 34 different TF motifs using selected features
Fig. 4Performance of cross-TF predictions. The TFs shown in Y-axis were used for training and the binding sites of TFs shown in X-axis were predicted. The cells highlighted in blue boxes are the self-prediction, which were used as a benchmark. The models were constructed from a TF motif in GM12878. The color showed the AUC for each prediction. The bottom panel shows the results using CENTIPEDE, DNASE2TFand HINT-BC for these TFs
Fig. 5Average AUC increases with number of training motifs. As the number of motifs used for training increases, the average AUC of prediction of all motifs increases
Fig. 6Combination of mutliple TF motifs. Prediction combining the profiles of multiple TF motifs is significantly better than prediction using the profile of a single TF motif. Boxplot is cross-TF prediction using single TF for training. Green asterisks denote the cross-TF prediction multiple TF motifs for training. Diamonds are the self-prediction
Fig. 7Result on cross-cell type prediction. Cross-cell prediction for 19 TFs. As comparison, the performance of the self-prediction was indicated by green square
Fig. 8Mixed prediction is also comparable with prediction using profiles of self-transcription factor. 100 random repeats using data from single TF motif for training regardless cell line were made for each target TF motif. Green Square is result of single TF motif binding prediction from model constructed from 34 TFs together