| Literature DB >> 31665223 |
Yichao Li1, Yating Liu1, David Juedes1, Frank Drews1, Razvan Bunescu1, Lonnie Welch1.
Abstract
MOTIVATION: De novo motif discovery algorithms find statistically over-represented sequence motifs that may function as transcription factor binding sites. Current methods often report large numbers of motifs, making it difficult to perform further analyses and experimental validation. The motif selection problem seeks to identify a minimal set of putative regulatory motifs that characterize sequences of interest (e.g. ChIP-Seq binding regions).Entities:
Mesh:
Substances:
Year: 2020 PMID: 31665223 PMCID: PMC7703758 DOI: 10.1093/bioinformatics/btz697
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Motif selection evaluation pipeline using ENCODE datasets. The blue boxes represent the motif discovery steps in (Kheradpour and Kellis, 2014). The discovered motifs were obtained from Kheradpour and Kellis (2014). All the ChIP-Seq datasets from the same transcription factor group (defined in Kheradpour and Kellis, 2014) were combined and duplicate peaks were removed. The evaluation datasets contain 10 000 random selected peaks, 10 000 random selected background sequences and the discovered motifs. Two new motif selection algorithms (i.e. the tabu search algorithm and the RILP algorithm), the greedy algorithm (Al-Ouran ), and the enrichment method (Kheradpour and Kellis, 2014) were evaluated using nested CV
Fig. 2.Boxplots of the four evaluation metrics. Median values and all the data points are shown. Each data point represents the dataset of a transcription factor group. Enrch: the enrichment method (Kheradpour and Kellis, 2014). Greedy: the greedy algorithm for motif selection (Al-Ouran ). RILP: the RILP algorithm for motif selection. Tabu: the tabu search algorithm for motif selection
Shared motifs between the three set cover-based methods and the enrichment method
| Motif name | Motif Logo | ForeCov | BackCov |
|---|---|---|---|
| TFAP2_disc2 |
| 76.3% | 9.7% |
| POU5F1_disc1 |
| 71.8% | 12.2% |
| REST_disc3 |
| 60.7% | 8.4% |
| TAL1_disc1 |
| 47.3% | 7.0% |
| ZNF143_disc3 |
| 39.8% | 11.9% |
| PAX5_disc1 |
| 37.8% | 5.6% |
| PBX3_disc2 |
| 37.3% | 7.4% |
Note: Motif names used in this table are adopted from Kheradpour and Kellis (
Putative cofactors discovered by the three set cover-based methods
| Factor group | Discovery tool | Motif logo | ForeCov | BackCov | Fisher | JASPAR match | TOMTOM |
|---|---|---|---|---|---|---|---|
| HEY1 | MEME |
| 67.6% | 22.0% | 0 | MA0528.1 (ZNF263) | 3.1E-13 |
| BRCA1 | AlignACE |
| 46.8% | 2.8% | 0 | MA0527.1 (ZBTB33) | 4.0E-06 |
| PBX3 | AlignACE |
| 31.8% | 8.3% | 7.8E-281 | MA0516.1 (SP2) | 3.2E-08 |
| RXRA | MEME |
| 39.1% | 22.8% | 2.7E-137 | MA1149.1 (RXRG) | 1.8E-11 |
| GATA | MEME |
| 40.8% | 34.9% | 1.6E-17 | MA0528.1 (ZNF263) | 6.0E-08 |
| EP300 | MEME |
| 30.1% | 25.0% | 1.0E-15 | MA0528.1 (ZNF263) | 9.2E-09 |
Note: These six motifs were matched to known TFBSs and were not reported by the enrichment method (Kheradpour and Kellis, 2014). The significance of motif enrichment (i.e. Fisher P-value) in the bound regions versus background sequences was calculated based on a Fisher exact test (Lin ). The top known motif matches based on TOMTOM (Gupta ) from the JASPAR (Khan ) database are shown.