| Literature DB >> 24708618 |
Limor Leibovich, Zohar Yakhini1.
Abstract
BACKGROUND: Statistics in ranked lists is useful in analysing molecular biology measurement data, such as differential expression, resulting in ranked lists of genes, or ChIP-Seq, which yields ranked lists of genomic sequences. State of the art methods study fixed motifs in ranked lists of sequences. More flexible models such as position weight matrix (PWM) motifs are more challenging in this context, partially because it is not clear how to avoid the use of arbitrary thresholds.Entities:
Year: 2014 PMID: 24708618 PMCID: PMC4021615 DOI: 10.1186/1748-7188-9-11
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Assessment of tightness. (A) Four lines are shown for N = 10: the mmHG score, which also serves as a lower bound for the p-value; the exact p-value calculated by enumerating all 10! permutations; our refined upper bound B2; and the Bonferroni corrected p-value. (B) Here again the four lines are shown - for N = 20. However, instead of an exact p-value, which cannot be calculated exhaustively, an empirical p-value is produced by randomly sampling 107 permutations. (C) In addition to the four lines shown in B, the upper bound B1 is shown (N = 20). (D) Four lines are shown for N = 100: the mmHG score, our upper bound B2, the bound B3 and the Bonferroni corrected p-value. The exact p-value line is positioned between the green and the blue lines. An empirical p-value was not calculated here as even if we sample 107 permutations, a p-value smaller than 10-7 cannot be obtained.
Performance of various bounds
| TNWMNG | 500 | 2.17e-18 | 2.75e-14 | 7.5e-14 | 6.28e-14 | 5.43e-13 |
| 0.274 min | 0.0028 min | 0.079 min | ||||
| CTNNNAT | 500 | 2.86e-27 | 1.32e-28 | 3.66e-28 | 2.37e-28 | 2.86e-27 |
| 0.155 min | 0.0029 min | 0.059 min | ||||
| MMMMMMMM | 500 | 1.08e-43 | 1.07e-39 | 3.47e-39 | 1.69e-39 | 2.71e-38 |
| 0.104 min | 0.003 min | 0.048 min | ||||
| REB1 | 4000 | 1.66e-137 | 9.18e-133 | 1.19e-131 | 1.54e-132 | 1.67e-131 |
| 17.25 min | 0.04 min | 2.753 min | ||||
| CBF1 | 4000 | 1.95e-80 | 9.15e-76 | 4.62e-75 | 1.84e-75 | 1.96e-74 |
| 26.05 min | 0.03 min | 3.409 min | ||||
| UME6 | 4000 | 5.42e-88 | 2.62e-83 | 3.04e-82 | 5.11e-83 | 5.43e-82 |
| 23.81 min | 0.03 min | 3.374 min | ||||
| TYE7 | 4000 | 1.62e-43 | 5.63e-39 | 2.83e-38 | 1.39e-38 | 1.62e-37 |
| 34.25 min | 0.02 min | 4.05 min | ||||
| GCN4 | 4000 | 2.04e-50 | 7.66e-46 | 4.62e-45 | 1.80e-45 | 2.04e-44 |
| 35.43 min | 0.03 min | 3.95 min | ||||
| Puf5 | 4795 | 7.91e-85 | 3.38e-80 | 5.60e-79 | 6.95e-80 | 7.93e-79 |
| 31.51 min | 0.027 min | 4.51 min | ||||
| Pub1 | 4251 | 1.49e-84 | 6.86e-80 | 1.33e-78 | 1.37e-79 | 1.5e-78 |
| 27.74 min | 0.033 min | 3.81 min | ||||
| Pab1 | 4142 | 2.46e-11 | 3.57e-7 | 5.17e-7 | 1.37e-6 | 2.46e-5 |
| 48.46 min | 0.007 min | 5.41 min | ||||
| Khd1 | 4773 | 2.74e-20 | 5.09e-16 | 1.46e-14 | 1.73e-15 | 2.74e-14 |
| 47.58 min | 0.015 min | 5.84 min | ||||
| Nab2 | 4101 | 2.09e-11 | 3.08e-7 | 1.48e-5 | 1.18e-6 | 2.09e-5 |
| 48.7 min | 0.016 min | 5.34 min | ||||
| Vts1 | 1787 | 1.44e-10 | 4.74e-6 | 1.33e-5 | 1.4e-5 | 1.45e-4 |
| 21.94 min | 0.003 min | 2.07 min | ||||
| Pin4 | 4261 | 8.16e-14 | 1.32e-9 | 8.08e-9 | 4.83e-9 | 8.18e-8 |
| 49.38 min | 0.011 min | 5.48 min | ||||
| Nrd1 | 3947 | 5.72e-12 | 9.09e-8 | 5.71e-6 | 3.36e-7 | 5.74e-6 |
| 47.67 min | 0.014 min | 5.11 min | ||||
| Yll032c | 2286 | 1.06e-9 | 2.62e-5 | 1.61e-4 | 8.3e-5 | 0.001 |
| 35.58 min | 0.003 min | 2.77 min |
Four bounds are compared over 17 datasets (3 synthetic and 14 biological). For each dataset, the number of sequences (N) and the mmHG score are indicated, together with the performance of each bound (in terms of tightness and running time).
Figure 2Comparison between mmHG-Finder and other motif discovery tools. We evaluated the performance of mmHG-Finder in comparison to other state-of-the-art methods: MEME, DREME and XXmotif. Almost all input examples consisted of ranked lists, except for p53 (comprising target and background sets). Since MEME, DREME, and XXmotif expect to get a target set as input, we converted the ranked lists into target sets by taking the top 100 sequences for MEME (restricted by MEME’s limitation of 60,000 characters) and the top 20% sequences for the other tools. In the synthetic examples the entire ranked lists were taken as they are sufficiently small (to reflect useful comparison with MEME, as the motif is planted in top sequences, we had provided MEME, as input, with the ranking information by adding weights to the sequences, decreasing from 1 to 0 proportionally with the ranking). We used the default parameters in all comparison to other tools (e.g. zero-or-one-occurrence per sequence in MEME) and defined the expected motif length as the range 6 to 8 where possible (specifically, DREME and XXmotif do not have an input parameter for the motif length). Data and consensus motifs for p53 were taken from [31]; for REB1, CBF1, UME6, TYE7, GCN4 from [32]; and for the RNA binding proteins from [33]. Selected results are shown.
Figure 3Motifs in tissue-specific lncRNA promoter sequences. We analysed the promoter sequences of lncRNAs that are ranked according tissue-specificity. The motifs returned by mmHG-Finder are shown in the figure together with their p-value. We compared those motifs to known consensus motifs of transcription factors using TOMTOM [37] (motif database = JASPAR Vertebrates and UniPROBE Mouse) and the most significant results are shown (specifically, all similarity p-values are better than 0.018).
CpG hypo-methylation in tissue-specific lncRNA promoters
| Thyroid | 3.89e-31 | No methylation data | No methylation data |
| Prostate | 5.76e-22 | 4.16e-11 (PrEC) | 0.002 (LNCaP) |
| Adrenal | 5.46e-20 | No methylation data | No methylation data |
| Brain | 1.57e-14 | 1.21e-8 (NH-A) | 5.55e-5 (U87) |
| Ovary | 8.80e-12 | No methylation data | 0.0085 (ovcar-3) |
| Lymph node | 3.64e-6 | No methylation data | No methylation data |
| Adipose | 9.25e-6 | No methylation data | No methylation data |
| Foreskin | 2.25e-5 | 0.72 (BJ) | No methylation data |
| Breast | 4.40e-5 | 5.08e-5 (HMEC) | 8.45e-5 (MCF7) |
| 2.0e-10 (MCF10A) | 0.0065 (T-47D) | ||
| Kidney | 6.34e-5 | 1.56e-5 (HEK293) | No methylation data |
| White blood cell | 3.78e-4 | 0.6 (GM12878) | 0.21 (Jurkat) |
| Placenta | 0.011 | No methylation data | No methylation data |
| Colon | 0.012 | No methylation data | 1.0 (Caco-2) |
| Skeletal muscle | 0.04 | 0.34 (SKMC) | No methylation data |
| Lung | 0.33 | No methylation data | No methylation data |
| Heart | 1.0 | 1.0 (HCM) | No methylation data |
| 1.0 (HCF) | |||
| Liver | 1.0 | 1.0 (Hepatocytes) | 1.0 (HepG2) |
| Testes | 1.0 | No methylation data | 1.0 (NT2-D1) |
| Lung fibroblasts | 1.0 | 1.0 (IMR90) | No methylation data |
| 1.0 (AGO4450) | |||
We calculated the mutual enrichment between DNA hypo-methylation and tissue specificity for the lncRNA promoters. CpG methylation data was taken from UCSC Table Browser [39] (ENCODE/HAIB).