| Literature DB >> 23144830 |
David Simcha1, Nathan D Price, Donald Geman.
Abstract
A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify "motifs" that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery-searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA "background" sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are "too null," resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where "ground truth" is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced "over-fitting" in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.Entities:
Mesh:
Year: 2012 PMID: 23144830 PMCID: PMC3492406 DOI: 10.1371/journal.pone.0047836
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The properties and sizes of the datasets used.
| Dataset | N Clusters | N Seqs | Clustering Method | Sequences |
| Beer et al. | 49 | 48 | K-means, Pearson Correlation | 800 BP upstream of coding start |
| Harbison et al. | 175 | 128 | ChIP-Chip TF Binding | Binding seqs provided by Harbison et al. |
| Human CMap | 100 | 100 | K-Means, Kendall’s Tau | 2000 BP upstream of coding start |
N Seqs is the number of clusters that contained at least ten sequences.
Clusters with fewer than ten sequences were excluded from the analysis due to excessively small sample size.
Figure 1Generative models are too null.
Panel (a): Quantile plot of Meme E-values for approximately 15,000 random runs, with E-values excluded. The X-axis represents the E-value as reported by MEME. The Y-axis represents the quantile. For example, under our null model E-values below are reported with probability slightly more than . Panels (b) and (c): Quantile plots of LR false discovery rates, similar to the Meme E-value quantile plots, for the Beer et al. and Human Cmap datasets respectively. Panel (d): Z-score plots of A/T fraction of yeast and human intergenic sequences relative to the distribution expected under a 6th order Markov model, with the standard normal distribution (red) shown for reference.
The fraction of variance in dimer frequency across sequences explained by expression profile or transcription factor binding sequence set and associated F statistic P-value.
| Dimer | Beer et al. | Harbison et al. | Human Cmap (Upstream) | Human Cmap (Introns) |
| AA/TT | 0.076 (8.94e-21) | 0.083 (4.47e-77) | 0.110 (8.39e-208) | 0.267 (0) |
| AC/GT | 0.053 (7.38e-11) | 0.057 (1.27e-32) | 0.030 (6.09e-29) | 0.068 (2.02e-102) |
| AG/CT | 0.033 (0.000485) | 0.056 (1.94e-30) | 0.076 (2.24e-127) | 0.210 (0) |
| AT | 0.070 (6.34e-18) | 0.148 (1.33e-222) | 0.088 (1.84e-155) | 0.231 (0) |
| CA/TG | 0.037 (3.12e-05) | 0.056 (5.35e-30) | 0.078 (5.43e-131) | 0.141 (2.01e-277) |
| CC/GG | 0.115 (2.73e-40) | 0.156 (7.28e-245) | 0.101 (1.28e-186) | 0.228 (0) |
| CG | 0.093 (4.85e-29) | 0.158 (2.15e-249) | 0.081 (1.85e-139) | 0.095 (4.04e-166) |
| GA/TC | 0.047 (1.44e-08) | 0.051 (1.25e-22) | 0.041 (2.92e-49) | 0.164 (0) |
| GC | 0.078 (1.36e-21) | 0.149 (1.56e-227) | 0.081 (3.88e-139) | 0.207 (0) |
| TA | 0.051 (5.05e-10) | 0.131 (5.53e-183) | 0.098 (1.37e-180) | 0.265 (0) |
For the Human Cmap data, this was assessed both for the 2,000 nucleotides upstream of the coding start site and for the intron sequences.
The mean AUROC of all algorithms on all datasets using independent holdout data.
| Mean AUROC (Holdout) | ||||
| Beer et al. | Harbison et al. | HumanCMap | Synthetic | |
| LR | 0.591 | 0.600 | 0.530 | 0.677 |
| ALR | 0.620 | 0.629 | 0.569 | 0.683 |
| MEME | 0.598 | 0.536 | 0.521 | 0.718 |
| AlignAce | 0.561 | 0.524 | 0.524 | 0.660 |
| DEME | 0.613 | 0.557 | 0.541 | 0.677 |
This validation is unbiased.
The mean AUROC of all algorithms on all datasets based on training and testing on the same data.
| Mean AUROC (Resubstitution) | ||||
| Beer et al. | Harbison et al. | HumanCMap | Synthetic | |
| LR | 0.776 | 0.771 | 0.731 | 0.814 |
| ALR | 0.836 | 0.857 | 0.799 | 0.858 |
| MEME | 0.753 | 0.784 | 0.637 | 0.809 |
| AlignAce | 0.657 | 0.693 | 0.584 | 0.831 |
| DEME | 0.835 | 0.848 | 0.799 | 0.894 |
The optimistic bias reveals massive overfitting.
The mean holdout AUROC of the LR and ALR algorithms for motifs for non-significant (FDR0.05) and significant (FDR0.05) motifs respectively.
| Mean AUROC (Non-Significant/Significant) | ||||
| Beer et al. | Harbison et al. | HumanCMap | Synthetic | |
| LR | 0.531/0.722 | 0.562/0.656 | 0.510/0.569 | 0.536/0.796 |
| ALR | 0.571/0.727 | 0.580/0.697 | 0.562/0.587 | 0.521/0.790 |