| Literature DB >> 26099518 |
Yizhe Zhang, Yupeng He, Guangyong Zheng, Chaochun Wei.
Abstract
BACKGROUND: Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically. RESULT: Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods.Entities:
Mesh:
Year: 2015 PMID: 26099518 PMCID: PMC4474412 DOI: 10.1186/1471-2164-16-S7-S13
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1The pipeline of MOST+ system. A set of target genomic sequences are extracted from a genome then indexed by a suffix tree to count occurrence of each word (or K-mer). If under MOST+ mode, histone modification marks and/or DNase I hypersensitivity (referred as tag signals in this schema or mark distribution) of each word are used to yield mark distribution scores. Top ranked words are put into clustering and motifs are generated from the resulted clusters. The strategy for clustering is illustrated on the right panel of this figure.
Figure 2Comparison of different motif finding methods. X-axis is the running time in logarithmic scale while the Y-axis is the total size (Mbps) of input sequences.
The impact of genome-wide signals on prediction accuracy (MOST vs. MOST+, K=9)
| TF | ChIP-seq Peaks | Ranking | Predicted sites | Co-factors |
|---|---|---|---|---|
| CTCF | 39,609 | 1→1 | 27,458→43,150 | 0→2 |
| ESRRB | 21,647 | 1→1 | 17,144→18,998 | 3→4 |
| KLF4 | 10,875 | 1→1 | 7,662→10,900 | 5→8 |
| OCT4 | 3,761 | 1→1 | 2,051→2,802 | 3→7 |
| cMyc | 3,422 | 2→1 | 1,120→1,342 | 3→6 |
| nMyc | 7,182 | 2→1 | 1,853→2,519 | 3→4 |
| SMAD1 | 1,126 | 9→4 | 119→135 | 4→8 |
| E2F1 | 20,699 | - | - | 3→5 |
| NANOG | 10,343 | 2→1 | 2,844→3,012 | 2→3 |
| SOX2 | 4,526 | 1→1 | 3,515→3,490 | 3→6 |
| STAT3 | 2,546 | 1→1 | 1,486→1,560 | 4→7 |
| TCFCP2L1 | 26,910 | 1→1 | 1,4568→1,4780 | 3→7 |
| ZFX | 10,338 | 1→1 | 4,269→7,684 | 3→4 |
Column 2-5: 2) the number of ChIP-seq peaks; 3) the change of ranks for the major TFBSs (from MOST to MOST+); 4) number of predicted binding sites for each TF; 5) the change of numbers of co-factors predicted by MOST and MOST+.
Figure 3Distributions of several highly enriched word instances found in CTCF and ESrrb's ChIP-seq dataset: (A). The upper 3 figures are from CTCF dataset. Spurious words show irregular or flat patterns (CTCF word "CTGCCCTCT" versus repeat words: "CTCTCTCTC", "TTTTAAAAA". All three words have odds ratio scores ranging from 3.4 to 4, i.e. in the same level of over-representative ratio), indicating one can make use of tag signals to discriminate motif words from their background. (B). The lower 3 figures are from ESrrb dataset. Distributions of word from Esrrb motif ("CCAAGGTCA" and "CAGAGGTCA", both contains core 'AGGTCA') strongly resemble to each other, while MYF motif component word (lower right corner: "CGGGAGGGG") shows a distinct pattern in distribution (dotted lines show distributions smoothed by a DFT with the top 5/8 higher frequency components removed).
Accuracy comparison with existing methods
| Algorithms | Detection ratio | Succeeded | Co-factors | Motif cluster |
|---|---|---|---|---|
| MOST | 43% | 8 | 3.7 | Y |
| MOST+ | 45% | 11 | 4.3 | Y |
| DREME | 25% | 10 | 5.6 | N |
| Trawler | 11% | 8 | 0.6 | N |
| nestedMICA | 21% | 10 | 2.1 | N |
| MEME | 5% | 10 | 0.9 | N |
| WEEDER | 6% | 10 | 0.5 | N |
| CisFinder | 76% | 10 | 3.6 | Y |
| HOMER | 38% | 11 | 3.0 | Y |
Columns 2-5: 2) Detection ratio: the number of clusters aligned to motifs in the database divided by the total cluster number found over 13 TFs; 3) Succeeded: numbers of major TFBSs ranked first in the results; 4) Co-factors: the average numbers of unique co-factors found in databases (e-value<0.05 given by TOMTOM) and 5) Motif cluster: whether a clustering step is used to merge results of a method.
Co-factors found by MOST+
| TF | Co-factors uniquely found by MOST+ | Co-factors both found by DREME and MOST+ | Co-factors uniquely found by DREME |
|---|---|---|---|
| CTCF | E2F3a, Myf | Myc, GABPA, STAT | |
| ESRRB | Myfa, Sp1a, Srf | Klf4a | STAT3, Oct4, Myc, Sox2 |
| KLF4 | FEVa, CREB, CTCFa, Egr1a | Esrrb, STAT3, sp1a, Sox2, Oct4 | Oct4, Gata3,Myc, Zfp161a |
| Oct4 | GABPAa, Zfxa, CTCFa, Stat3, sp1a | Sox2a, Esrrba, Klf4a | CREB/ATF |
| cMyc | CREBa, GABPAa, Klf4a, Sox2, YY1a | STAT3a | Egr1a |
| nMyc | Elf3a, GABPAa, Zfp161a, | CREB/ATFa | STAT3, Smad1, sfpid |
| SMAD1 | Sp1a, Sox11a, REST, FEV,Spib | Sox2a, Oct4a, Klf4a,Esrrba | Zic3a, Zfp740a |
| E2F1 | Sp1a, Myf, GABPAa | STAT3a, CREB/ATF | Myca, FOX |
| NANOG | Zic3a, Klf4a, Esrrba | Elf5a, Tead1 | |
| SOX2 | Sox10a, CTCFa, Myfa, Runx1a | Oct4a, Klf4a, Esrrba | Zic3a, STAT3 |
| STAT3 | Zic3a, Jundm2, FEVa | Esrrba, Klf4a, Oct4, Sox2, sp1a | Myc, Irf4 |
| TCFCP2L1 | Sox4a, Zic3a, Myf | Klf4a, Esrrba, Sox2,Oct4, Sp1a | Egr1a, Foxa, Myc, Tead1, CREB/ATF, STAT3 |
| ZFX | Myfa, Egr1a, sp1a | STAT3 | Myca, Esrrb |
a Supported by CisFinder or HOMER
Columns 2-4: Co-factors found by 2) MOST+ only; 3) both MOST+ and DREME; and 4) DREME only for each dataset. When more than one motifs in the database were aligned, only the one with lowest p-value was counted.
Figure 4The diagram of pipeline for parameter optimization and method comparison. A motif-finding step is followed by a TFBS identification step (by CisFinder) using motifs and genomic sequences as input. Training data (8 of 10 folds) are fed into motif finding tools, and then accuracy is evaluated based on how well the motifs recovered can pinpoint TFBSs. AUROC is used to represent the accuracy of each method.
Figure 5Comparison of site-level accuracy for different methods. AUROC of each method on recovering motifs for different essayed TFs were shown in the figure.
Figure 6(A). DNase I hypersensitivity signal shows evident cleavage pattern around ChIP-seq peaks in human LCL datasets. (B). With the help of DNase I hypersensitivity signals, additional motifs were found by MOST+ in VDR datasets). Some long motifs are similar to those reported in Xie et al. (2005,2007) [2]36.
Figure 7Motifs discovered by MOST+ in all promoters of mouse genome. Left panel: A motif discovered by MOST+ that resembles GABPA motif in JASPAR. Right panel: Examples of some unknown motifs with obvious kurtosis pattern in histone modification distributions.
Impact of parameter K on MOST/MOST+.
| Mode | K | Time(s) | Motifs predicted | E-value |
|---|---|---|---|---|
| MOST | 7 | 110.34 | 2 | 0.09 |
| 8 | 111.85 | 5 | 4.87e-12 | |
| 9 | 133.06 | 5 | 6.35e-21 | |
| 10 | 448.22 | 5 | 1.02e-22 | |
| 11 | 2786.11 | 6 | 4.78e-20 | |
| MOST+ | 7 | 481 | 5 | 3.03e-05 |
| 8 | 510 | 8 | 3.95e-13 | |
| 9 | 527 | 12 | 3.77e-21 | |
| 10 | 753 | 14 | 7.50e-27 | |
| 11 | 3244 | 6 | 1.07e-18 | |
The test was run on mouse CTCF dataset (8 Mbp). Columns 3-5: 3) Times: running time under each word width K; 4) Motifs predicted: number of unique motifs found by each schema; and 5) E-value: the lowest e-value of alignments between a major motif found by each schema and motifs in mouse motif database.