| Literature DB >> 29510689 |
Shuxiang Ruan1, Gary D Stormo2.
Abstract
BACKGROUND: Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site's activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use k-mers (DNA "words" of length k) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models.Entities:
Keywords: ChIP-seq; DNA shape features; Motif; Motif optimization; Position weight matrix
Mesh:
Substances:
Year: 2018 PMID: 29510689 PMCID: PMC5840810 DOI: 10.1186/s12859-018-2104-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Descriptions of the motif optimization algorithms evaluated
| Algorithm | Output | Description |
|---|---|---|
| JASPAR | Position frequency matrix | PFMs from the JASPAR database |
| DAMO | Position weight matrix | Modified DiMO program that outputs PWMs instead of PFMs |
| DAMO_PFM | Position frequency matrix | PFMs derived from the DAMO single-nucleotide PWMs |
| DAMO_dinuc | Position weight matrix | The adjacent di-nucleotide mode of DAMO |
| DNAshapedTFBS_4bit | Gradient boosting classifier | DNAshapedTFBS with 4-bit encoding |
| DNAshapedTFBS_4bit + shape | Gradient boosting classifier | DNAshapedTFBS_4bit plus DNA shape features |
| Shape_only | Gradient boosting classifier | The feature vector contains only DNA shape features |
| JASPAR + shape | Gradient boosting classifier | JASPAR PFM score plus DNA shape features |
| DAMO + shape | Gradient boosting classifier | DAMO single-nucleotide PWM score plus DNA shape features |
Mean AUPRC (and standard deviation) on ChIP-seq data
| Algorithm | Training | Testing |
|---|---|---|
| JASPAR | 0.812 (0.132) | 0.812 (0.132) |
| DAMO | 0.834 (0.119) | 0.832 (0.120) |
| DAMO_PFM | 0.825 (0.120) | 0.824 (0.122) |
| DAMO_dinuc | 0.844 (0.114) | 0.839 (0.119) |
| DNAshapedTFBS_4bit | 0.854 (0.105) | 0.842 (0.115) |
| DNAshapedTFBS_4bit + shape | 0.875 (0.090) | 0.845 (0.113) |
| Shape_only | 0.871 (0.089) | 0.840 (0.112) |
| JASPAR + shape | 0.878 (0.089) | 0.846 (0.112) |
| DAMO + shape | 0.879 (0.090) | 0.846 (0.113) |
Fig. 1Differences in AUPRC between training and testing datasets. For each model the differences are shown for each of the 396 datasets. The box represents 1st, 2nd (median indicated with line) and 3rd quartiles and the whiskers represent 1.5 interquartile range (IQR) below or above 1st or 3rd quartiles
Fig. 2Comparison of AUPRC scores for different models. a-h JASPAR + shape on vertical axis and each of the other eight models on the horizontal axis. i Difference in AUPRC for DAMO PWM with and without di-nucleotides. j Difference in AUPRC for DAMO PWM and DAMO PFM. k-l Differences with adding shape features to the 4bit model and the JASPAR PFM model