| Literature DB >> 17130169 |
Chien-Hua Peng1, Jeh-Ting Hsu, Yun-Sheng Chung, Yen-Jen Lin, Wei-Yuan Chow, D Frank Hsu, Chuan Yi Tang.
Abstract
The identification of regulatory elements recognized by transcription factors and chromatin remodeling factors is essential to studying the regulation of gene expression. When no auxiliary data, such as orthologous sequences or expression profiles, are used, the accuracy of most tools for motif discovery is strongly influenced by the motif degeneracy and the lengths of sequence. Since suitable auxiliary data may not always be available, more work must be conducted to enhance tool performance to identify transcription elements in the metazoan. A non-alignment-based algorithm, MotifSeeker, is proposed to enhance the accuracy of discovering degenerate motifs. MotifSeeker utilizes the property that variable sites of transcription elements are usually position-specific to reduce exposure to noise. Consequently, the efficiency and accuracy of motif identification are improved. Using data fusion, the ranking process integrates two measures of motif significance, resulting in a more robust significance measure. Testing results for the synthetic data reveal that the accuracy of MotifSeeker is less sensitive to the motif degeneracy and the length of input sequences. Furthermore, MotifSeeker has been tested on a well-known benchmark [M. Tompa, N. Li, T.L. Bailey, G.M. Church, B. De Moor, E. Eskin, A.V. Favorov, M.C. Frith, Y. Fu, W.J. Kent, et al. (2005) Nat. Biotechnol., 23, 137-144], yielding a correlation coefficient of 0.262, which compares favorably with those of other tools. The high applicability of MotifSeeker to biological data is further demonstrated experimentally on regulons of Saccharomyces cerevisiae and liver-specific genes with experimentally verified regulatory elements.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17130169 PMCID: PMC1702486 DOI: 10.1093/nar/gkl658
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1(a) Take the TATA-box as an example. The black strings are true transcription elements and the red strings are false motif occurrences. These 10 strings [in G(W | X)] are collected from the initial position-restricted selection. In this group, the weakly conserved letters (denoted by underlines) can be observed in the fifth and the seventh positions. Obviously, motif occurrences that have weakly conserved letters are likely to be noise. (b) The matrix for relative frequency. Background letter probabilities are PA = 0.22, PT = 0.22, PC = 0.28 and PG = 0.28. A negative (p, q)-entry means that the letter p at position q is weakly conserved in G(W | X). Occurrences with weakly conserved letters are called pseudo-occurrence in this paper. (c) ‘TATAWAW’ is derived from the remaining occurrences.
Figure 2Comparison of performance coefficients. The lengths of embedded motifs range from 6 to 20 and the degeneracy is drawn from 10 to 50%. The average sequence identity is 0.65. The point in the figure for each degree of degeneracy represents the average performance coefficient for the various motif lengths tested for the degree of degeneracy. Among the five tools, only MotifSeeker and MEME have consistent performance when motif degeneracy is beyond 35%.
Figure 3(a) Comparisons of specificities. (b) Comparisons of sensitivities. The average specificity of MotifSeeker is 1.0 and the average sensitivity is also close to 1.0. Sensitivities and specificities of the other tools except MEME show a clear trend of degradation over this range of degeneracy.
Figure 4Performance coefficient, specificity and sensitivity of MotifSeeker are evaluated in highly degenerate cases to examine the limitations of MotifSeeker.
Figure 5An example for a strongly uneven content distribution. Three types of occurrences (CCTAT, CATAT and CGTAT) are allowed by the degenerate form CVTAT (IUPAC code symbol V represents nucleic acid A, G or C). The number of occurrences for each type is randomly determined in the experiment on synthetic data. Since only 2 out of 20 occurrences are CGTAT, the distribution of the occurrences is strongly uneven.
Comparison of MotifSeeker with other systems on S.cerevisiae datasets with consensus given in SCPDa
| Group | Published motif | MotifSeeker | YMF | Projection | Consensus | Gibbs Sampler | MEME | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Pattern | Rank | Pattern | Rank | Pattern | Rank | Pattern | Rank | Pattern | Rank | Pattern | Rank | ||
| CPF1 | TCACGTG | CACGTG | 1 | CACGTGGC | 1 | CACGTG | 1 | CACGTGR | 1 | — | — | CACGTGRC | 1 |
| GCN4 | TGANTN | TGABTC | 4 | TGACTS | 2 | TGAVTC | 2 | TGABTC | 1 | TGACTC | 1 | ABTGACTC | 2 |
| CAR1 | AGCCGCCR | AGCCGCCR | 5 | AGCCGCCG | 5 | AGCCGCCG | 1 | KAGCCGCC | 1 | GCCGCCR | 1 | KAGCCGCSSRVR | 3 |
| CSRE | YCGGAYRRAW | GGAVRRATK | 10 | CGGAYGRA | 2 | TCCGGATA | 9 | — | — | CGGATRR | 1 | CGGRCSGAKG | 3 |
| GCR1 | CWTCC | GCWTCCA | 3 | — | — | CTTCC | 6 | — | — | — | — | CGDSTTCC | 4 |
| HSE | GAANNTT | GAACSTTC | 2 | GAACSTTC | 8 | GAACSTT | 2 | — | — | — | — | YCYMGAAMBTYM | 2 |
| MATa | CRTGTWWWW | CRTGTAWW | 5 | CATGTWW | 5 | CATGTMWWW | 5 | — | — | CATGTAAWT | 1 | — | — |
| MCB | WCGCGW | ACGCGW | 4 | ACGCGW | 2 | ACGCGT | 1 | — | — | ACGCGT | 1 | MCGCGT | 1 |
| PDR3 | TCCGYGGA | TCCGYGGA | 1 | TCCGYGGA | 1 | TCCGCGGA | 1 | TCCGYGGA | 1 | TCCGCGGA | 1 | WSDTTCCGYGGA | 1 |
| PHO4 | CACGTK | CACGTG | 1 | CACGTGS | 1 | CACGTG | 1 | CACGTG | 1 | CACGTG | 1 | CACGTKSR | 2 |
| RAP1 | RMACCCA | RCACCCA | 1 | ACCCAGAC | 2 | RCACCCA | 2 | RMACCMA | 9 | — | — | RMACCCANACM | 3 |
| REB1 | YYACCCG | YTACCCG | 2 | YYACCCG | 6 | TACCCGC | 2 | YYACCCG | 1 | — | — | MTTACCCG | 7 |
| ROX1 | YYNATTGTTY | CCATTGTTS | 5 | — | — | GCCYATTGTT | 6 | SCCYATTGTT | 10 | — | — | CMTTGTTC | 3 |
| SCB | CNCGAAA | WCGAAAT | 5 | CRCGAAA | 1 | CKCGAAA | 3 | CGCGAAA | 9 | GTCACGAA | 1 | HCDCGAAA | 2 |
| SFF | GTMAACAA | GGTMAACAA | 7 | — | — | AGGTCAACA | 2 | ASGTMAAC | 8 | — | — | — | — |
| STE12 | ATGAAA | ATGAAACR | 9 | ATGAAAC | 1 | TGAAACA | 4 | TGAAAC | 5 | TGAAAC | 1 | TGAAAC | 6 |
| TBP | TATAWAW | TATAWAW | 5 | — | — | ATATAWA | 7 | — | — | — | — | — | — |
| MIG1 | CCCCRNNWWWWW | CCCCRSDHWW | 4 | CCCCRGR | 1 | b | — | CCCCRSA | 4 | — | — | CCCCRS | 1 |
| ABF1,BAF1 | TCRNNNNNNACG | TCAHDRHDVACG | 9 | TCANNNNNNACG | 1 | — | — | — | — | — | — | TCWCBNHWBACG | 7 |
| GAL4 | CGGNNNNNNNNNNNCCG | CGGVVVV | 1 | CGGNNNNNNNNNNNCCG | 1 | AGGCWSA | 7 | CGGMRSDCTBTY | 1 | — | — | CGGMVVDWBTY | 2 |
| HAP1 | CGGNNNTANCGG | CGGKRTTWMCGG | 5 | CGGNNNNNNCGG | 4 | TTATYY | 10 | CGDTMWYWSC | 1 | — | — | CCGDTMTYTCC | 1 |
| MCM1 | CCNNNWWRGG | TTWCCBD | 8 | — | — | c | — | MCNDNWNNGG | 2 | CCNNNWWVGK | 1 | CCYDHTWRGGAA | 1 |
| UASPHR | CTTCCT | SGWGGH | 5 | — | — | — | — | GTGGNN | 2 | — | — | — | — |
| SWI5 | KGCTGR | KGCTGR | 9 | GCTGRC | 4 | GGCTGA | 9 | TGCTGG | 7 | — | — | YGCTGG | 1 |
| HSTF | TTCNNGAA | TTCYAGAA | 1 | CYAGAA | 3 | TTCTAGAA | 2 | TTCYAGAA | 2 | TTCTVGAA | 1 | TTCTRGAA | 2 |
aWe used (6,12) as the range of lengths (l) to be investigated for all tools. In both MotifSeeker and Projection, [0, 0.6l] is used as the range of d. The parameter k of MotifSeeker and parameter M of Projection is always set to m/2 (m is the number of sequences in the input). The number of degenerate symbols used in YMF is 4. For iterations where l ≤ 6, the maximum number w of middle spacers of YMF is set to 11. For those runs with l > 6, w = 5. For MEME, the total number of sites is set to the range [2, 100], and any number of repetitions is allowed on each sequence. For every tool except MEME, many runs are needed for each dataset. Other parameters required by YMF, Consensus, Projection, Gibbs Sampler and MEME are set to their default values. Ranks are evaluated in each single run. For each tool, the top 10 motifs in each run are considered, and the one with the highest performance coefficient is shown in this table. The estimated running time of Projection on MIG1 and MCM1 exceeds 60 days and we do not complete the execution. Missing entries represent that no motif with non-zero performance coefficient can be found among the top 10 predictions.
bThe estimated running time of MIG1 is 69 days and its execution cannot be completed.
cThe estimated running time of MCM1 is 256 days and its execution cannot be completed.
The performance coefficients of the predicted motifs for the 25 regulons with consensus given in SCPD
| Group | MotifSeeker | YMF | Projection | Consensus | Gibbs sampler | MEME |
|---|---|---|---|---|---|---|
| CPF1 | 0.78 | 0.70 | 0.78 | 0.76 | 0.00 | 0.74 |
| GCN4 | 0.67 | 0.58 | 0.50 | 0.67 | 0.44 | 0.49 |
| CAR1 | 0.61 | 0.30 | 0.30 | 0.41 | 0.45 | 0.43 |
| CSRE | 0.54 | 0.32 | 0.16 | — | 0.14 | 0.20 |
| GCR1 | 0.74 | — | 0.32 | — | — | 0.49 |
| HSE | 0.60 | 0.60 | 0.56 | — | — | 0.52 |
| MATa | 0.55 | 0.48 | 0.62 | — | 0.47 | — |
| MCB | 0.83 | 0.83 | 0.42 | — | 0.67 | 0.75 |
| PDR3 | 0.86 | 0.86 | 0.52 | 0.86 | 0.52 | 0.50 |
| PHO4 | 0.64 | 0.74 | 0.64 | 0.64 | 0.64 | 0.73 |
| RAP1 | 0.83 | 0.30 | 0.83 | 0.75 | — | 0.37 |
| REB1 | 0.78 | 0.83 | 0.78 | 0.83 | — | 0.67 |
| ROX1 | 0.54 | — | 0.40 | 0.51 | — | 0.34 |
| SCB | 0.71 | 0.82 | 0.55 | 0.18 | 0.34 | 0.88 |
| SFF | 0.75 | — | 0.38 | 0.67 | — | — |
| STE12 | 0.89 | 0.78 | 0.44 | 0.76 | 0.76 | 0.76 |
| TBP | 0.66 | — | 0.56 | — | — | — |
| MIG1 | 0.83 | 0.29 | — | 0.44 | — | 0.38 |
| ABF1,BAF1 | 0.70 | 0.80 | — | — | — | 0.55 |
| GAL4 | 0.41 | 0.84 | 0.15 | 0.32 | — | 0.35 |
| HAP1 | 0.50 | 0.67 | 0.17 | 0.61 | — | 0.53 |
| MCM1 | 0.69 | — | — | 0.73 | 0.67 | 0.36 |
| UASPHR | 0.31 | — | — | 0.24 | — | — |
| SWI5 | 0.75 | 0.70 | 0.25 | 0.50 | — | 0.50 |
| HSTF | 0.80 | 0.63 | 0.71 | 0.80 | 0.71 | 0.50 |
| Average | 0.68 | 0.48 | 0.40 | 0.43 | 0.23 | 0.44 |
Each entry shows the highest performance coefficient among the 10 highest ranked predictions. An entry marked by ‘—’ indicates that the tool fails to find any motif with non-zero performance coefficient for the corresponding regulon within top 10. The average performance coefficients are computed by treating the missing entries as predictions with performance coefficients of 0.
Comparison of MotifSeeker with other systems on S.cerevisiae datasets without consensus given in SCPDa
| MotifSeeker | YMF | Projection | Consensus | Gibbs Sampler | MEME | |
|---|---|---|---|---|---|---|
| BAS1,PHO2 | 0.2 (3) | 0.13 (9) | 0.13 (1) | 0.07 (2) | — | 0.17 (1) |
| MATalpha1 | 0.23 (10) | — | 0.09 (2) | 0.09 (1) | — | — |
| TAF | — | — | — | 0.13 (5) | — | — |
| PHO2 | 0.27 (6) | — | 0.22 (7) | 0.17 (1) | — | 0.24 (4) |
| RP-A | 0.6 (4) | — | 0.6 (3) | 0.69 (1) | 0.15 (1) | 0.53 (4) |
| UASH | 0.18 (9) | — | 0.05 (4) | — | — | — |
| URSIH | 0.63 (3) | 0.41 (1) | 0.24 (1) | 0.63 (1) | 0.17 (1) | 0.81 (1) |
| GATA | 0.82 (1) | 0.72 (9) | 0.24 (1) | 0.29 (1) | — | 0.55 (1) |
| HAP2 | 0.44 (8) | 0.37 (5) | 0.22 (10) | 0.25 (4) | — | 0.4 (3) |
| PRE | 1 (2) | — | 1 (1) | 1 (1) | — | 0.51 (4) |
| UASCAR | 0.68 (4) | 0.12 (9) | 0.68 (2) | 0.23 (3) | — | 0.29 (1) |
| UIS | 0.65 (10) | — | 0.27 (1) | 0.46 (3) | — | 0.46 (2) |
| GLN3 | 0.44 (5) | 0.32 (3) | 0.84 (3) | 0.88 (5) | — | 0.43 (9) |
| PDR1 | 0.78 (4) | 0.74 (1) | 0.11 (2) | 0.3 (1) | 0.54 (1) | 0.36 (1) |
| Average | 0.49 | 0.22 | 0.33 | 0.37 | 0.06 | 0.34 |
aThe performance coefficients and the ranks (in parentheses) of the predicted motifs for the 14 regulons without consensus given in SCPD. Each entry shows the highest performance coefficient among the 10 highest ranked predictions. An entry marked by ‘—’ indicates that the tool fails to find any motif with non-zero performance coefficient for the corresponding regulon within top 10. The average performance coefficients are computed by treating the missing entries as predictions with performance coefficients of 0.
Performance coefficients and ranks of the best close matches to the published motifs in three stagesa
| Transcription elements | Stage 1 | Stage 2 | Stage 3 |
|---|---|---|---|
| Analysis only in human species | Integration of comparative sequences analysis | Further refinement by negative set | |
| HNF-1 | 0.7 (5) | 0.73 (3) | 0.73 (3) |
| HNF-3 | 0.43 (9) | 0.58 (6) | 0.58 (2) |
| HNF-4 | 0.65 (3) | 0.74 (1) | 0.74 (1) |
| C/EBP | — | 0.44 (10) | 0.44 (6) |
aThe performance coefficients and the ranks (in parentheses) of the predicted motifs for the four liver-specific regulons. Each entry shows the highest performance coefficient among the 10 highest ranked predictions. An entry marked by ‘—’ indicates that MotifSeeker fails to find any motif with non-zero performance coefficient for the corresponding regulon within top 10.
Figure 6The best close matches to the published HNF-1, HNF-3, HNF-4 and C/EBP motifs. The left column lists logos of the published motifs from JASPAR (). The published HNF-1, HNF-3 and HNF-4 motifs contain conserved sub-patterns DGTTAWD, TRTTKRY and HCTTTGBHM, respectively (IUPAC code D: A, T or G; W: A or T; R: A or G; Y: T or C; H: A, T or C; B: C, T or G; M: A or C). On the other hand, the published C/EBP motif is very weakly conserved. The right column lists logos of the corresponding best close matches from MotifSeeker without the proposed post-processes.