Literature DB >> 17130169

Identification of degenerate motifs using position restricted selection and hybrid ranking combination.

Chien-Hua Peng¹, Jeh-Ting Hsu, Yun-Sheng Chung, Yen-Jen Lin, Wei-Yuan Chow, D Frank Hsu, Chuan Yi Tang.

Abstract

The identification of regulatory elements recognized by transcription factors and chromatin remodeling factors is essential to studying the regulation of gene expression. When no auxiliary data, such as orthologous sequences or expression profiles, are used, the accuracy of most tools for motif discovery is strongly influenced by the motif degeneracy and the lengths of sequence. Since suitable auxiliary data may not always be available, more work must be conducted to enhance tool performance to identify transcription elements in the metazoan. A non-alignment-based algorithm, MotifSeeker, is proposed to enhance the accuracy of discovering degenerate motifs. MotifSeeker utilizes the property that variable sites of transcription elements are usually position-specific to reduce exposure to noise. Consequently, the efficiency and accuracy of motif identification are improved. Using data fusion, the ranking process integrates two measures of motif significance, resulting in a more robust significance measure. Testing results for the synthetic data reveal that the accuracy of MotifSeeker is less sensitive to the motif degeneracy and the length of input sequences. Furthermore, MotifSeeker has been tested on a well-known benchmark [M. Tompa, N. Li, T.L. Bailey, G.M. Church, B. De Moor, E. Eskin, A.V. Favorov, M.C. Frith, Y. Fu, W.J. Kent, et al. (2005) Nat. Biotechnol., 23, 137-144], yielding a correlation coefficient of 0.262, which compares favorably with those of other tools. The high applicability of MotifSeeker to biological data is further demonstrated experimentally on regulons of Saccharomyces cerevisiae and liver-specific genes with experimentally verified regulatory elements.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Transcription Factors

Year: 2006 PMID： 17130169 PMCID： PMC1702486 DOI： 10.1093/nar/gkl658

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

One of the current challenges in biological research is to understand the regulatory mechanisms of gene expression. Transcription initiation, generally controlled by interactions between transcription elements and proteins, is at the top of the hierarchy of gene expression control. The same transcription elements are frequently present in the regulatory regions of co-regulated genes and are conserved among the orthologous genes of closely related species. Identification of the transcription elements in the regulatory regions is critical to deciphering the transcriptional regulation. A transcription factor may recognize a highly diversified set of transcription elements, which share only a conserved core sequence. Such conserved core sequences are what we called motif in this study. Degeneracy often tends to occur at specific positions of transcription elements. The significance of position specificity can be demonstrated by the severe effect of a point mutation in the Sp1-binding sequence present in PNH patients. The Sp1 factor recognizes 5′-YCCGYCCS-3′ where only the 5′- and 3′-most positions as well as the Y in the central core are tolerant of specific variations. A mutation of hTERC, the human telomerase RNA gene, changes the sequence from 5′-YCCGYCCS-3′ to 5′-YCCGYCGS-3′ (the letter denoted by underline is the mutation site). The C–G mutation located at the non-variable position disrupts Sp1 binding to the hTERC promoter and results in reduced transcription of hTERC (1). A number of motif-finding algorithms have been developed over the past few years. Many recent successful works are based on sequence alignment or are aided by the use of orthologous genes or microarray expression data (2–7). Some effective methods do not require auxiliary data and are not alignment-based. Of these, some ensure or favor the position specificity of output motifs, while others do not. Examples of the latter class include MEME (8), Consensus (9) and Gibbs Sampler (10,11). Other well-known members of the latter class include (l, d)-motif finders which identify significant patterns over {A, T, C, G} of length l with at most d mismatches. The (l, d)-motif-finding algorithms include WINNOWER (12), SP-STAR (12), Multiprofiler (13), Projection (14) and PatternBranching (15). An (l, d)-motif is not position-specific because d or fewer mutations can occur at arbitrary positions in a motif occurrence. By employing (a subset of) IUPAC codes, YMF (16,17) ensures the position specificity of output motifs. YMF considers each possible motif over {A, T, C, G, R, Y, S, W, N} and evaluates motifs using the Z-score. Despite its exponential running time, YMF has been recognized as a highly effective tool. Wolfertstetter et al. (18) observed that functional motifs show a preferred pattern of mismatch locations. Position-specific motifs are awarded higher scores of consensus index (Ci) and are thus favored by CoreSearch (18). Unlike YMF, CoreSearch does not fix the positions or contents of mutations in an early stage. Consequently, CoreSearch is more sensitive to noise. Motif lengths and degeneracy, as well as sequence lengths, seriously limit the analysis of CoreSearch. Specifically, the motif must be ∼7 bases long with at most one mismatch within each of its occurrences to achieve reasonable performance of CoreSearch. Under such conditions, the recommended average input sequence length is 600 bp. Some of these restrictions arise from the high computational cost. All of the aforementioned algorithms work well in practice but have limitations. Increases in motif degeneracy, motif length and the lengths of input sequences strongly affect the performance of most tools because of increased random matches and computational costs. Most can tolerate 35% motif degeneracy and handle only a few thousand input nucleotides. However, the transcription elements of multicellular eukaryotes are often highly degenerated, with lengths of 5–25 bp, and are embedded in long regulatory regions. Hence, more sensitive methods are always sought to locate accurately meaningful transcription elements. This work presents a new method, MotifSeeker. Both the ability to separate noise from true motif occurrences and computational efficiency must be enhanced to ensure the successful detection of degenerate motifs embedded in long regulatory regions. The success of YMF reveals the possibility that noise may be reduced by restricting both the degenerate positions and the sequence contents in these positions. Since YMF exhaustively enumerates all possible IUPAC code motifs of some particular length, the computational demand is rather high when motifs of longer than 10 bp are analyzed. However, this issue may not be a problem for identifying motifs in the yeast genome, for which YMF was originally designed. MotifSeeker is designed to be applicable to motifs over a wider range of lengths. Each possible set of d variation positions, rather than each possible l-string over IUPAC code, is processed by MotifSeeker to reduce noise without sacrificing efficiency. Since the sequence contents of the degenerate positions are not initially restricted, MotifSeeker includes a further filtration step to take into account the degree of conservation among the collected candidate occurrences. Another approach for reducing noise uses the concept of data fusion (19,20), which is effective in providing a robust ranking method in a variety of application domains. This idea is realized by combining two different measures of motif significance: they are the degree of conservation relative to the background sequences and the copy number of the motif relative to the expected copy number of a random string of the same length. The combined ranking method reduces the variation of the performance of each measure and is shown to be effective. Synthetic datasets of various motif degeneration rates and sequence lengths are used to demonstrate the effectiveness and limits of MotifSeeker. Experiments on a commonly used yeast dataset and the well-known benchmark (21) yield results that show that MotifSeeker outperforms the other tools to which it is compared. Experiments also indicate that MotifSeeker can be successfully applied to identify highly degenerate transcription elements that direct liver-specific gene expression. Theseelements have lengths of at least 12 bp and are embedded in regulatory regions whose average length is ∼2.5 kb. These results demonstrate that MotifSeeker can be applied without the aid of auxiliary data; hence, it is more generally applicable than the other methods. However, as presented in Results, if suitable auxiliary data are available, including such auxiliary data can certainly enhance the performance.

METHODS

MotifSeeker consists of two main phases—motif generation and motif scoring, which integrates two scoring schemes. They are discussed in the following subsections.

Definition and properties of degenerate (l, d)-motifs

A degenerate (l, d)-motif is defined as a pattern of length l over the IUPAC code with no more than d degenerate positions. A degenerate position is a position occupied by a character other than A, G, C or T. A match of a degenerate (l, d)-motif in the input sequences is called an occurrence of this motif. We also require that each possible character over {A, T, C, G} allowed by the symbol at the i-th position of a degenerate (l, d)-motif must appear at position i of some occurrence of the motif. The important property of a degenerate (l, d)-motif is that its occurrences can differ only in the d degenerate positions of the motif. Unlike (l, d)-motifs (12–15), the mismatched positions in each occurrence of a degenerate (l, d)-motif are not independent of those in other occurrences. Consequently, a degenerate (l, d)-motif cannot be derived directly from the occurrences of an (l, d)-motif. Furthermore, (l, d)-motifs have been defined over {A, G, C, T} instead of the IUPAC code. Therefore, algorithms specialized for (l, d)-motifs cannot be expected to perform well for degenerate (l, d)-motifs and vice versa.

Generating significant degenerate motifs

The discovery problem is addressed first. Degenerate motif discovery problem. Given a set of sequences S = {S1, S2, … , S ∣ S belongs to {A, G, C, T}* for all i} and three non-negative integers k, l and d, find all degenerate (l, d)-motifs, each of which has occurrences in at least k sequences in S. For convenience of presentation, assume that all sequences in S are of length n. An l-substring of a sequence S is a substring of length l of S. The number of l-substrings in any sequence S is then r = n − l + 1. For convenience these l-substrings are denoted by W1, … , W. The Hamming distance between two l-substrings is the number of positions that the two substrings disagree. For each W, where 1 ≤ i ≤ m − k + 1, the Hamming distance between W and each of the l-substrings of all sequences in S is computed. Meanwhile, the sets of mismatched positions between each of these W and all other l-substrings are kept. The Hamming distance between W and W is denoted as dH(W, W), and the set of mismatched positions between them is V(W, W). Then, for each possible set X = {p1, … , p} of degenerate positions, all W with V(W, W) ⊆ X are collected. The set of l-substrings collected in this way is denoted by G(W ∣ X). Then, whether G(W ∣ X) contains l-substrings from at least k different sequences is determined. If so, then G(W ∣ X) is kept and used further to derive a degenerate motif. If not, it is discarded. Notably, any degenerate (l, d)-motif is required to have occurrences in at least k sequences. Hence, constructing G(W ∣ X) for 1 ≤ i ≤ m − k + 1 suffices. Since all of the occurrences of any degenerate (l, d)-motif must form a subset of some G(W ∣ X), with these G(W ∣ X) all possible degenerate motifs can be identified. The sets V(W, W) and X can be conveniently stored in computer words so that bit operations can be utilized to achieve good efficiency. The time complexity of the above procedure is . Once each G(W ∣ X) is computed, it can be moved to secondary storage, and after G(W ∣ X) has been computed for all X, dH(W, W) and V(W, W) can be discarded before the method proceeds to W,+1 (or W+1,1). Therefore, the demand for primary memory is linear in the input size, O(mn). In each remaining G(W ∣ X), noise may still exist. Suppose that G(W ∣ X) has 10 motif occurrences, and that TATAWAW is the correct motif for the TATA-box. Clearly, some noise (any occurrence that contains underlined letter) exists among those 10 motif occurrences (Figure 1a).

Figure 1

(a) Take the TATA-box as an example. The black strings are true transcription elements and the red strings are false motif occurrences. These 10 strings [in G(W | X)] are collected from the initial position-restricted selection. In this group, the weakly conserved letters (denoted by underlines) can be observed in the fifth and the seventh positions. Obviously, motif occurrences that have weakly conserved letters are likely to be noise. (b) The matrix for relative frequency. Background letter probabilities are PA = 0.22, PT = 0.22, PC = 0.28 and PG = 0.28. A negative (p, q)-entry means that the letter p at position q is weakly conserved in G(W | X). Occurrences with weakly conserved letters are called pseudo-occurrence in this paper. (c) ‘TATAWAW’ is derived from the remaining occurrences.

Transcription elements evolve more slowly than non-functional background sequences and may co-exist in regulon sequences. Based on this premise, noise was reduced further by the following process. First, the background letter probabilities in the original set S are computed. Let PA, PT, PC and PG be the background probabilities of nucleotides A, T, C and G, respectively. Then, G(W ∣ X) is transformed into a 4 × l matrix L (Figure 1b). The entries L, representing the log-odds weights, are defined by the following formula. where p ∈ {A, T, C, G}. A positive L means that the probability of letter p at position q in G(W ∣ X) exceeds the background probability of the letter. Since real motif instances are generally more conserved than arbitrary background sequences, a word x1x2⋯xl in G(W ∣ X) is said to be a pseudo-occurrence if for some p and q, x = p and L < 0. All pseudo-occurrences in G(W ∣ X) are removed (Figure 1c), and the remaining elements are considered to be significant. The positive entries in L are taken into consideration in deriving the motif. The motif is generated column-wise from these positive entries. For example, if only LA3 and LG3 are positive, then the symbol in the third position of the resulting motif is ‘R’ (in IUPAC code).

Motif scoring methods R1 and R2

Two scoring methods R1 and R2 are proposed to evaluate all reported motifs generated by the previously described degenerate motif discovery method. On the basis that the regulatory motifs are more conserved than surrounding background sequences, the first scoring function s1 for the method R1 is defined as where the summation is over i ∈ {A, T, C, G} and j satisfying 1 ≤ j ≤ l such that L > 0, and p is the number of positive entries in column j. The more the letter frequencies exceed the background probabilities, the higher the score is. This fact is used to measure the conservation and the significance of each reported motif. The second scoring function s2 for method R2 is formulated based on the observation that transcription elements often appear in statistically significant concentrations, suggesting that the transcription elements for a particular transcription factor may appear in multiple locations of a promoter region. Thus, the copy number of each reported motif is another important indicator of the significance of the motif. The second ranking method is as follows. The frequency of all possible consecutive two-symbol combinations in the input sequences is computed and used to compute background transition probabilities. For each predicted motif M = x1x2⋯x, background transition probabilities are applied to compute the probability that M occurs randomly in the background following a Markov process: where the summation is over all possible instantiations y1⋯y of M (recall that M can contain degenerate symbols), and P(y ∣ y1) is the probability that y is present in the background given that the preceding nucleic acid is y−1. The value E can be computed using dynamic programming in O(l) time for each M. The scoring function s2 for method R2 is defined as follows: where N is the observed copy number, including overlapping occurrences, of the reported motif x1x2⋯xl. A higher value implies a greater concentration of motif occurrences. Thus, motifs of higher value are considered more important. A Markov process is frequently used to distinguish regulatory sequences from other neutral sequences (22). A general problem with the Markov model is that it cannot appropriately reflect local sequence composition, which is important for short motifs. This shortcoming of the Markov model is inevitable, but experiments herein demonstrated that it does not strongly affect the performance of MotifSeeker. These two scoring schemes R1 and R2 are adopted to assign scores to each of the reported motifs. These measures of motif significance, R1 and R2, are combined by the method of data fusion, which is presented in the next subsection.

MotifSeeker uses methods of data fusion and hybrid ranking

The features of transcription elements are often fuzzy, making the elements hard to predict by computation. The two scoring methods R1 and R2 capture different properties of motifs. Intuitively, the use of two properties helps to evaluate the significance of motifs more thoroughly. A combination of different measures is also expected to be more robust than a single ranking method. Indeed, in the study of data fusion, a general observation is that one can often benefit from combining different methods when they exhibit ‘diversity’ [for a survey and a general framework see (20,21,23,24)]. The hybrid ranking method of MotifSeeker involves the following procedure: For each degenerate motif derived from the corresponding purified candidate set, two scoring functions s1 and s2 are obtained using the scoring methods R1 and R2, respectively. Sorting each of these scoring functions leads to the rank functions r1 and r2, respectively, where for a reported motif M, r(M) is the rank of M with respect to s. Combine r1 and r2. The score function s12 for the resulting combination is the equally weighted combination of r1 and r2: s12(M) = r1(M) + r2(M). For each motif M, s12(M) is taken as the new score for the combined method R12. Sorting s12(M) into ascending order (the less the sum of the ranks, the better in the combined list) leads to a rank function r12 of the combined method.

RESULTS

This section compares the performance of MotifSeeker with that of other methods, such as YMF, MEME, Projection, Consensus and Gibbs Sampler, using synthetic data and biological data for 39 regulons of Saccharomyces cerevisiae. A performance coefficient (12) is used to evaluate the performance of different methods. The accuracy of MotifSeeker is assessed by a well-known benchmark (21) and a more difficult but useful set of regulatory sequences of liver-specific genes.

Evaluation of performance on synthetic data

MotifSeeker is compared with several effective methods, including Consensus, Projection, Gibbs Sampler and MEME,to evaluate the relative degrees of influence of motif degeneracy and input sequence length. YMF is excluded from this first experiment since it takes too much computational time when the motif is longer than 10 bp. Various test samples, each of which consists of 20 random sequences of 600 bp, are generated. The sequence identity in the samples is between 50 and 70% with an average identity of 65%. In each sample, degenerate (l, d)-motifs are generated and embedded (with random variations allowed by the degenerate form) into random positions in the sequences. To focus on the influence of motif degeneracy, each sequence in the sample contains at least one motif occurrence. Various motif lengths and degeneracy (d/l) are used in the test samples. The generated motifs range from 6 to 20 bp long, each with 10–50% degeneracy. Herein, for all tools, all of the parameters that are related to l, d and k are set to the exact values used for generating motifs. In practice, the parameters are unknown, as in the cases of biological and the benchmark datasets used in the following subsections. For unknown parameters, MotifSeeker simply iterates over possible ranges of parameter settings, and the ranking methods are applied to pick up the significant motifs. The measure used for comparison is the performance coefficient ∣K ∩ P∣/∣K ∪ P∣ (12), where K is the set of positions of the known motif occurrences in the input sequences, and P is the set of predicted positions. The best performance coefficients among the top 10 motifs found by these tools are compared. Most motif finders perform well when the motif degeneracy ranges from 10 to 35%. However, motifs become too subtle to be identified by most methods as the number of degenerate positions increases. Figure 2 shows that the performance coefficients of most tools decline with the growing degrees of motif degeneracy. The performance of MEME and MotifSeeker tends to remain stable over this range of degeneracy. However, on average, the performance coefficient of MEME seems to be lower than that of the other tools tested in this experiment.

Figure 2

Comparison of performance coefficients. The lengths of embedded motifs range from 6 to 20 and the degeneracy is drawn from 10 to 50%. The average sequence identity is 0.65. The point in the figure for each degree of degeneracy represents the average performance coefficient for the various motif lengths tested for the degree of degeneracy. Among the five tools, only MotifSeeker and MEME have consistent performance when motif degeneracy is beyond 35%.

Figure 3 displays the test results for specificity ∣K ∩ P∣/∣P∣ (Figure 3a) and sensitivity ∣K ∩ P∣/∣K∣ (Figure 3b), as defined by Pevner and Sze (12). The sensitivity and specificity curves are similar to those in Figure 2. Among all tools, the sensitivity and specificity of MEME tend to decrease more slowly and those of MotifSeeker appear to remain steady. The average specificity of MotifSeeker is 1.0, and the average sensitivity is also very close to 1.0.

Figure 3

(a) Comparisons of specificities. (b) Comparisons of sensitivities. The average specificity of MotifSeeker is 1.0 and the average sensitivity is also close to 1.0. Sensitivities and specificities of the other tools except MEME show a clear trend of degradation over this range of degeneracy.

The degeneracy tolerance of MotifSeeker is further tested on samples with motif degeneracy of >50%. All the other test conditions are the same as the aforementioned ones. As presented in Figure 4, the performance of MotifSeeker gradually decreases as the degree of degeneracy exceeds 50%. The performance coefficient falls to 0.96 in the group with degeneracy between 51 and 55%. At degeneracy of >70%, the performance declines rapidly from 0.825 to 0.56. The average performance coefficient is 0.42 when the degree of degeneracy exceeds 0.75. Specificity and sensitivity also decrease sharply when the degeneracy exceeds 70%. However, transcription elements seldom display such a high degeneracy in biological systems, such as degenerate (8, 7)-, (9, 7)- and (12, 10)-motifs. Therefore, the poor performance of MotifSeeker on highly degenerate motifs affects only the identification of a restrictedly few transcription elements.

Figure 4

Performance coefficient, specificity and sensitivity of MotifSeeker are evaluated in highly degenerate cases to examine the limitations of MotifSeeker.

A test paradigm that consists of 20 input sequences with lengths of between 500 and 10 000 bp and an average sequence identity of 60% is established to test the influence of the sequence length on the five programs. The degeneration rates of the embedded motifs are 25–30%, and the lengths of the motifs are between 10 and 15 bp. Since the number of spurious motifs increases, the performance coefficients of Consensus, Gibbs Sampler and Projection are <0.5 as the length of each input sequence is 10 000 bp. These results are consistent with those of Wang and Stormo (25). For MEME, the performance coefficient averaged over all tests where the length of each input sequence is no longer than 3000 bp is 0.487; MEME prohibits inputs with a total length of >60 000 bp. On the contrary, MotifSeeker has a good performance coefficient, 0.93, even when the length of each sequence is 10 000 bp. Accordingly, the accuracy of MotifSeeker is not susceptible to the sequence length. The degree of motif degeneracy more importantly affects the performance of tools. As presented in Figures 2 and 4, even when the length of each input sequence is only 600 bp, the performance still decreases as the degrees of motif degeneracy increases. As expected, position restriction combined with further filtration by the pseudo-occurrence elimination step render MotifSeeker less susceptible to noise. Occasionally, the candidate set has a certain number of embedded occurrences containing the character i in position j with L < 0, especially when the distribution of the contents of the randomly generated occurrences is strongly uneven (Figure 5). Such an insignificant occurrence is regarded as a pseudo-occurrence and is wrongly discarded. Consequently, the specificity of MotifSeeker tends to be better than its sensitivity (Figure 4).When the degeneracy exceeds 70%, even the best possible occurrence set G(W ∣ X), where W is an embedded occurrence and X is the correct set of degenerate positions, would contain many random matches. Some embedded occurrences become insignificant relative to the background and may be wrongly eliminated, whereas random occurrences may be falsely kept. Hence, the number of false positives and false negatives considerably increase with the degree of degeneracy >70%. Similar phenomena are observed when the number of sequences with motif occurrences is less than one-third of the number of input sequences.

Figure 5

An example for a strongly uneven content distribution. Three types of occurrences (CCTAT, CATAT and CGTAT) are allowed by the degenerate form CVTAT (IUPAC code symbol V represents nucleic acid A, G or C). The number of occurrences for each type is randomly determined in the experiment on synthetic data. Since only 2 out of 20 occurrences are CGTAT, the distribution of the occurrences is strongly uneven.

The purpose of this experiment is to demonstrate potential limitations of MotifSeeker. However, the synthetic data are generated according to our motif model and inevitably favor MotifSeeker over other tools. To evaluate its true applicability we proceed to assess MotifSeeker on biological datasets.

Evaluation of performance on yeast promoters

MotifSeeker is applied to known regulons, which are sets of genes that are regulated by the same transcription factors. The material is obtained from the promoter database of S.cerevisiae (SCPD, ). SCPD (26) provides information on regulons of S.cerevisiae and lists the experimentally verified transcription elements. The degenerate motifs of the transcription elements in most of the regulons are also supplied. MotifSeeker, YMF, Consensus, Projection, Gibbs Sampler and MEME are tested on each of the regulons (see Tables 1 and 2) that has at least three genes in SCPD. Table 1 shows the results for the 25 regulons with a consensus reported in SCPD, whereas Table 2 shows those without. The parameters cannot be easily set because no prior knowledge of the transcription elements is available. Nevertheless, a majority of yeast regulatory motifs are between 6 and 12 bp long. We choose [6, 12] as the range of lengths to be investigated for each dataset. As by definition, a degenerate (l, r l)-motif is also a degenerate (l, 0.6l)-motif for r < 0.6, the default value of parameter d is set to 0.6l, instead of running over a range. A factor not considered in the previous synthetic model is that some of the regulatory motifs occur only in k out of m sequences in the input sample. A reasonable assumption is that most of the regulon sequences share the same consensus of transcription elements. The default value of k, therefore, is set to m/2 (where m is the number of sequences in the sample). If YMF (16,17), Consensus (9), Gibbs Sampler (10,11), Projection (14) and MEME (8) have any parameters that are related to l, d or k, then these parameters are also set according to the aforementioned ranges used in MotifSeeker. Other parameters in these five tools are set to default values. For each program, among the 10 highest ranked predictions, the motif with the highest non-zero performance coefficient is selected as its predicted motif. Such predictions are referred to as the ‘best close matches’ in this study.

Table 1

Comparison of MotifSeeker with other systems on S.cerevisiae datasets with consensus given in SCPDa

Group	Published motif	MotifSeeker		YMF		Projection		Consensus		Gibbs Sampler		MEME
		Pattern	Rank	Pattern	Rank	Pattern	Rank	Pattern	Rank	Pattern	Rank	Pattern	Rank
CPF1	TCACGTG	CACGTG	1	CACGTGGC	1	CACGTG	1	CACGTGR	1	—	—	CACGTGRC	1
GCN4	TGANTN	TGABTC	4	TGACTS	2	TGAVTC	2	TGABTC	1	TGACTC	1	ABTGACTC	2
CAR1	AGCCGCCR	AGCCGCCR	5	AGCCGCCG	5	AGCCGCCG	1	KAGCCGCC	1	GCCGCCR	1	KAGCCGCSSRVR	3
CSRE	YCGGAYRRAW	GGAVRRATK	10	CGGAYGRA	2	TCCGGATA	9	—	—	CGGATRR	1	CGGRCSGAKG	3
GCR1	CWTCC	GCWTCCA	3	—	—	CTTCC	6	—	—	—	—	CGDSTTCC	4
HSE	GAANNTT	GAACSTTC	2	GAACSTTC	8	GAACSTT	2	—	—	—	—	YCYMGAAMBTYM	2
MATa	CRTGTWWWW	CRTGTAWW	5	CATGTWW	5	CATGTMWWW	5	—	—	CATGTAAWT	1	—	—
MCB	WCGCGW	ACGCGW	4	ACGCGW	2	ACGCGT	1	—	—	ACGCGT	1	MCGCGT	1
PDR3	TCCGYGGA	TCCGYGGA	1	TCCGYGGA	1	TCCGCGGA	1	TCCGYGGA	1	TCCGCGGA	1	WSDTTCCGYGGA	1
PHO4	CACGTK	CACGTG	1	CACGTGS	1	CACGTG	1	CACGTG	1	CACGTG	1	CACGTKSR	2
RAP1	RMACCCA	RCACCCA	1	ACCCAGAC	2	RCACCCA	2	RMACCMA	9	—	—	RMACCCANACM	3
REB1	YYACCCG	YTACCCG	2	YYACCCG	6	TACCCGC	2	YYACCCG	1	—	—	MTTACCCG	7
ROX1	YYNATTGTTY	CCATTGTTS	5	—	—	GCCYATTGTT	6	SCCYATTGTT	10	—	—	CMTTGTTC	3
SCB	CNCGAAA	WCGAAAT	5	CRCGAAA	1	CKCGAAA	3	CGCGAAA	9	GTCACGAA	1	HCDCGAAA	2
SFF	GTMAACAA	GGTMAACAA	7	—	—	AGGTCAACA	2	ASGTMAAC	8	—	—	—	—
STE12	ATGAAA	ATGAAACR	9	ATGAAAC	1	TGAAACA	4	TGAAAC	5	TGAAAC	1	TGAAAC	6
TBP	TATAWAW	TATAWAW	5	—	—	ATATAWA	7	—	—	—	—	—	—
MIG1	CCCCRNNWWWWW	CCCCRSDHWW	4	CCCCRGR	1	^b	—	CCCCRSA	4	—	—	CCCCRS	1
ABF1,BAF1	TCRNNNNNNACG	TCAHDRHDVACG	9	TCANNNNNNACG	1	—	—	—	—	—	—	TCWCBNHWBACG	7
GAL4	CGGNNNNNNNNNNNCCG	CGGVVVV	1	CGGNNNNNNNNNNNCCG	1	AGGCWSA	7	CGGMRSDCTBTY	1	—	—	CGGMVVDWBTY	2
HAP1	CGGNNNTANCGG	CGGKRTTWMCGG	5	CGGNNNNNNCGG	4	TTATYY	10	CGDTMWYWSC	1	—	—	CCGDTMTYTCC	1
MCM1	CCNNNWWRGG	TTWCCBD	8	—	—	^c	—	MCNDNWNNGG	2	CCNNNWWVGK	1	CCYDHTWRGGAA	1
UASPHR	CTTCCT	SGWGGH	5	—	—	—	—	GTGGNN	2	—	—	—	—
SWI5	KGCTGR	KGCTGR	9	GCTGRC	4	GGCTGA	9	TGCTGG	7	—	—	YGCTGG	1
HSTF	TTCNNGAA	TTCYAGAA	1	CYAGAA	3	TTCTAGAA	2	TTCYAGAA	2	TTCTVGAA	1	TTCTRGAA	2

aWe used (6,12) as the range of lengths (l) to be investigated for all tools. In both MotifSeeker and Projection, [0, 0.6l] is used as the range of d. The parameter k of MotifSeeker and parameter M of Projection is always set to m/2 (m is the number of sequences in the input). The number of degenerate symbols used in YMF is 4. For iterations where l ≤ 6, the maximum number w of middle spacers of YMF is set to 11. For those runs with l > 6, w = 5. For MEME, the total number of sites is set to the range [2, 100], and any number of repetitions is allowed on each sequence. For every tool except MEME, many runs are needed for each dataset. Other parameters required by YMF, Consensus, Projection, Gibbs Sampler and MEME are set to their default values. Ranks are evaluated in each single run. For each tool, the top 10 motifs in each run are considered, and the one with the highest performance coefficient is shown in this table. The estimated running time of Projection on MIG1 and MCM1 exceeds 60 days and we do not complete the execution. Missing entries represent that no motif with non-zero performance coefficient can be found among the top 10 predictions.

bThe estimated running time of MIG1 is 69 days and its execution cannot be completed.

cThe estimated running time of MCM1 is 256 days and its execution cannot be completed.

Table 2

The performance coefficients of the predicted motifs for the 25 regulons with consensus given in SCPD

Group	MotifSeeker	YMF	Projection	Consensus	Gibbs sampler	MEME
CPF1	0.78	0.70	0.78	0.76	0.00	0.74
GCN4	0.67	0.58	0.50	0.67	0.44	0.49
CAR1	0.61	0.30	0.30	0.41	0.45	0.43
CSRE	0.54	0.32	0.16	—	0.14	0.20
GCR1	0.74	—	0.32	—	—	0.49
HSE	0.60	0.60	0.56	—	—	0.52
MATa	0.55	0.48	0.62	—	0.47	—
MCB	0.83	0.83	0.42	—	0.67	0.75
PDR3	0.86	0.86	0.52	0.86	0.52	0.50
PHO4	0.64	0.74	0.64	0.64	0.64	0.73
RAP1	0.83	0.30	0.83	0.75	—	0.37
REB1	0.78	0.83	0.78	0.83	—	0.67
ROX1	0.54	—	0.40	0.51	—	0.34
SCB	0.71	0.82	0.55	0.18	0.34	0.88
SFF	0.75	—	0.38	0.67	—	—
STE12	0.89	0.78	0.44	0.76	0.76	0.76
TBP	0.66	—	0.56	—	—	—
MIG1	0.83	0.29	—	0.44	—	0.38
ABF1,BAF1	0.70	0.80	—	—	—	0.55
GAL4	0.41	0.84	0.15	0.32	—	0.35
HAP1	0.50	0.67	0.17	0.61	—	0.53
MCM1	0.69	—	—	0.73	0.67	0.36
UASPHR	0.31	—	—	0.24	—	—
SWI5	0.75	0.70	0.25	0.50	—	0.50
HSTF	0.80	0.63	0.71	0.80	0.71	0.50
Average	0.68	0.48	0.40	0.43	0.23	0.44

Each entry shows the highest performance coefficient among the 10 highest ranked predictions. An entry marked by ‘—’ indicates that the tool fails to find any motif with non-zero performance coefficient for the corresponding regulon within top 10. The average performance coefficients are computed by treating the missing entries as predictions with performance coefficients of 0.

Tables 1–3 present the best close matches to each published motif. Missing entries represent that no motif with non-zero performance coefficient can be found among the top 10 predictions. Of all the tested regulons, 24 out of 25 regulons with consensus reported in SCPD are identified within the top 10 by MotifSeeker (Table 1). As to UASHPR, the consensus listed at SCPD is CTTCCT, but an alignment of all documented sites suggests another consensus of SGWGGH. This latter consensus is identified by MotifSeeker at top 5, with performance coefficient 0.31 (Table 2). A similar consensus, GTGGNN, is also discovered by Consensus at top 2 (Table 1).

Table 3

Comparison of MotifSeeker with other systems on S.cerevisiae datasets without consensus given in SCPDa

	MotifSeeker	YMF	Projection	Consensus	Gibbs Sampler	MEME
BAS1,PHO2	0.2 (3)	0.13 (9)	0.13 (1)	0.07 (2)	—	0.17 (1)
MATalpha1	0.23 (10)	—	0.09 (2)	0.09 (1)	—	—
TAF	—	—	—	0.13 (5)	—	—
PHO2	0.27 (6)	—	0.22 (7)	0.17 (1)	—	0.24 (4)
RP-A	0.6 (4)	—	0.6 (3)	0.69 (1)	0.15 (1)	0.53 (4)
UASH	0.18 (9)	—	0.05 (4)	—	—	—
URSIH	0.63 (3)	0.41 (1)	0.24 (1)	0.63 (1)	0.17 (1)	0.81 (1)
GATA	0.82 (1)	0.72 (9)	0.24 (1)	0.29 (1)	—	0.55 (1)
HAP2	0.44 (8)	0.37 (5)	0.22 (10)	0.25 (4)	—	0.4 (3)
PRE	1 (2)	—	1 (1)	1 (1)	—	0.51 (4)
UASCAR	0.68 (4)	0.12 (9)	0.68 (2)	0.23 (3)	—	0.29 (1)
UIS	0.65 (10)	—	0.27 (1)	0.46 (3)	—	0.46 (2)
GLN3	0.44 (5)	0.32 (3)	0.84 (3)	0.88 (5)	—	0.43 (9)
PDR1	0.78 (4)	0.74 (1)	0.11 (2)	0.3 (1)	0.54 (1)	0.36 (1)
Average	0.49	0.22	0.33	0.37	0.06	0.34

aThe performance coefficients and the ranks (in parentheses) of the predicted motifs for the 14 regulons without consensus given in SCPD. Each entry shows the highest performance coefficient among the 10 highest ranked predictions. An entry marked by ‘—’ indicates that the tool fails to find any motif with non-zero performance coefficient for the corresponding regulon within top 10. The average performance coefficients are computed by treating the missing entries as predictions with performance coefficients of 0.

The results obtained from the analysis of GCN4 are selected as an example to demonstrate the strength of data fusion. The ranks of GCN4 are 12 and 7 when scoring functions R1 and R2 are used, respectively. The rank of GCN4 is promoted to 4 after data fusion is performed. A similar improvement is also observed in the other regulons. Since each of the hypotheses used for our ranking functions cannot be expected to be effective in all cases, the method of data fusion provides the advantage of combining the merits of different scoring functions and balancing the individual demerits. With respect to the other tools tested on the regulons with consensus given in SCPD, Gibbs Sampler fails to report 14 out of 25 regulons within top 10. Although Projection can identify motifs with non-zero performance coefficients within the top 10 for most of the regulons, it takes longer execution time in most cases. In particular, the estimated running time of MIG1 and MCM1 exceeds 60 days and its execution cannot be completed (Table 1). Compared with the other five tools, YMF has better performance on most motifs with middle spacers, such as Gal4, ABF1 and Hap1, as a result of its motif model. For the regulons without consensus given in SCPD, only 1 out of 14 regulons does MotifSeeker not predict any motif with non-zero performance coefficient within top 10 (Table 3). For this and the previous sets of regulons, the average performance coefficients of MotifSeeker also compare favorably with the other tools (Tables 2 and 3). In summary, this experiment shows that MotifSeeker appears to be a reliable tool for discovering transcription elements in yeast promoters.

Evaluation of performance on tissue-specific regulatory elements

The identification of regulatory elements within the human genome is yet another challenge. Since published experimental data of liver-specific genes are abundant, a predictive procedure is presented here to identify transcription elements associated with liver-specific transcription. The data used were collected by Kriven and Wasserman (27). This dataset includes four liver-specific factors: HNF-1, HNF-3, HNF-4 and C/EBP. Each regulon consists of at least five genes. Longer promoter sequences are retrieved from GenBank based on the gene names listed in Kriven's data. The average length of the analyzed promoter sequences is ∼2.5 kb. MotifSeeker is run on each set of the four regulons, which contain only human liver-specific genes. Since human regulatory elements may be longer and more degenerate than those of yeast, [6, 15] is set as the range of the motif length l, and parameter d is set to 0.7l. Other parameter settings are the same as those used in the previous experiment on the yeast dataset. Despite a lack of comparative analyses across other species, matches to the published motifs of HNF-1, HNF-3 and HNF-4 can be identified within top 10 predictions (Table 4 and Figure 6). MotifSeeker has better performance on HNF-1, HNF-3 and HNF-4 than C/EBP because most of the corresponding transcription elements for these three transcription factors have conserved sub-patterns DGTTAWTNWWYDNH, MNTRTTKRYHY and NHCTTTGBHMND, respectively. As can be seen in Figure 6, the predicted motifs appear to contain these sub-patterns (Figure 6). For C/EBP, the best ranked prediction with non-zero performance coefficient is out of top 10, and only half of the binding sites are located by this prediction. However, under the range of parameter settings, the other five tools fail to identify any published motif; they also often give an empty output from all four co-regulated liver-specific gene sets. The degenerate rate of the motifs recognized by the four factors exceeds 50%, resulting in the failure of the five tools. Additionally, long and highly degenerate motifs detrimentally affect the running time of YMF and Projection. These results are consistent with those observed on synthetic models. Although MotifSeeker already performs well without auxiliary data, utilizing suitable auxiliary data further improves its performance. Two optional simple post-processes are proposed below.

Table 4

Performance coefficients and ranks of the best close matches to the published motifs in three stagesa

Transcription elements	Stage 1	Stage 2	Stage 3
	Analysis only in human species	Integration of comparative sequences analysis	Further refinement by negative set
HNF-1	0.7 (5)	0.73 (3)	0.73 (3)
HNF-3	0.43 (9)	0.58 (6)	0.58 (2)
HNF-4	0.65 (3)	0.74 (1)	0.74 (1)
C/EBP	—	0.44 (10)	0.44 (6)

aThe performance coefficients and the ranks (in parentheses) of the predicted motifs for the four liver-specific regulons. Each entry shows the highest performance coefficient among the 10 highest ranked predictions. An entry marked by ‘—’ indicates that MotifSeeker fails to find any motif with non-zero performance coefficient for the corresponding regulon within top 10.

Figure 6

The best close matches to the published HNF-1, HNF-3, HNF-4 and C/EBP motifs. The left column lists logos of the published motifs from JASPAR (). The published HNF-1, HNF-3 and HNF-4 motifs contain conserved sub-patterns DGTTAWD, TRTTKRY and HCTTTGBHM, respectively (IUPAC code D: A, T or G; W: A or T; R: A or G; Y: T or C; H: A, T or C; B: C, T or G; M: A or C). On the other hand, the published C/EBP motif is very weakly conserved. The right column lists logos of the corresponding best close matches from MotifSeeker without the proposed post-processes.

Integrating knowledge from co-regulated genes in a single species and sequence conservation among orthologous genes of different organisms has been shown to help in identifying weak motifs (25). Therefore, the first post-process is to combine the original results with phylogenetic footprints. The sequences of orthologous genes are included in the input set. The differential conservation facilitates the identification of evolutionary conserved functional elements and the filtration of false occurrences. This evolution-based filter improves the rank and the performance coefficient of the best close match to each published motif (Table 4). Most of the false occurrences of these motifs are also removed. Nevertheless, C/EBP remains the most difficult case because transcription elements for C/EBP, unlike those for the other three transcription factors, have no obvious conserved sub-patterns. To further improve the performance on C/EBP, promoters of liver-specific genes are used as positive sequences and promoters of genes not expressed in liver are regarded as negative sequences. A motif found both in positive and negative sequences cannot generally be specific to positive set. Therefore, MotifSeeker is run on another well-known collection, which is a set of high-quality annotations of transcription elements for muscle-specific genes (28). Most of the sequences in this collection are ∼300 bases long, and may be too short to be an effective negative set. Hence, longer promoter sequences are retrieved from GenBank based on the gene names listed in the dataset. Motifs found in the muscle-specific genes are used as markers. The original outputs for liver-specific datasets are refined by eliminating each motif M1 whose edit distance to some marker M2 is at most 0.2 × max{∣M1∣, ∣M2∣}. (The scoring function is detailed in the Supplementary Data.) The rank of the best close match to the published C/EBP motif is advanced from 10 to 6 following the refinement (Table 4). In summary, these results indicate that MotifSeeker performs better on S.cerevisiae and liver-specific datasets than do the other tools tested herein. With the aid of suitable data, the output of MotifSeeker can be further improved by the two optional post-processes proposed above.

Evaluation of performance on a well-known benchmark

Finally, MotifSeeker is tested on a public well-known benchmark (21). All the datasets in the benchmark are used. This benchmark has three types of background sequences. One-third of all of the datasets used are ‘real’; one-third are ‘generic’ and the rest are ‘Markov’. The ranges of parameter settings over which the iterations are performed are the same as those for the experiments on liver-specific genes. For each tested dataset, the output of MotifSeeker is refined by taking all other sequences of the same species in the benchmark as a negative set to remove the spurious motifs. After this refinement, motif scores are adjusted by favoring a higher number of similar motifs found in all top 20 ranking lists in the iterations. (Please see the Supplementary Data for a detailed description of the post-processing and the results.) The highest-scoring motif is then selected. The same process is adopted for all datasets. The average correlation coefficient (nCC) of MotifSeeker is 0.262, which compares favorably with the winner, Weeder (29), of the assessment on all 52 datasets. With respect to the human portion, the highest nCC that has been obtained to date is 0.149 (30). Although the nCC of MotifSeeker for humans, 0.212, is lower than those of all other species considered, MotifSeeker improves nCC by 42%.

DISCUSSION

Applications of synthetic and biological data have indicated that MotifSeeker appears to have promising applications in identifying degenerate transcription elements with specific variable sites directly from sequences of a single species. The accuracy of MotifSeeker is less sensitive to the length of input sequences and the degree of motif degeneracy. Without auxiliary data, the performance of MotifSeeker is already satisfactory, enabling more general applications. If suitable auxiliary data are available, two optional post-processes can be incorporated. One direction combines the original input with orthologous sequences. The other refines the predicted results by the output for gene sets with expression patterns that differ from those of the input genes. As shown in investigation of liver-specific gene sets, both refinement steps can further enhance performance. In the evaluation on the benchmark, of all the datasets used, the set of type ‘real’ is the most difficult for MotifSeeker. This difficulty was also encountered by 13 tools that were evaluated by Tompa et al. (21). These evaluations should not be taken as an indictment of the motif discovery tools, which point was also made by Tompa et al. (21). Despite this difficulty, MotifSeeker still compares favorably with other motif finders. The reasons for the good performance of MotifSeeker are as follows. MotifSeeker is designed based on the position specificity of transcription elements, since transcription elements often prefer certain variation patterns. Restricting variation positions makes the candidate occurrence less likely to be obscured by random matches. Pseudo-occurrences in candidate sets are further eliminated by considering the weakly conserved letters. In this manner, both the positions and the contents of the variations are explicitly considered, and noise is thus reduced. Furthermore, two scoring measures are combined by data fusion to improve the performance on ranking motifs. Since the performance of a single measure often varies from case to case, the hybrid ranking method provides a more general scheme with consistently high performance. To ensure position specificity, the distance between each pair of occurrences of a motif must be within d. A clique finding procedure is usually necessary to meet this requirement while keeping the search in the input sequences. However, clique finding takes exponential time with respect to the length of input sequences. If, instead, all possible patterns over IUPAC code or merely {A, T, C, G} are to be enumerated, then the computation time would at least be proportional to 4. MotifSeeker represents a compromise between these two extremes. All possible sets of degenerate positions are considered, but . That is, is much smaller than 4, which is the number of all nucleic acid patterns of length l, by an exponential factor. For each set of degenerate positions, MotifSeeker searches only in the input sequences and can find cliques in a manner that is as simple and efficient as finding stars. In experiments on synthetic samples, with an of <103, in most cases, the time required is only several seconds to minutes. When is ∼104 as in (16, 8) or (20, 15) cases or the total length of the input sequences is ∼105 bp, the running time is ∼1 h or above. MotifSeeker is designed for single motifs. However, extending MotifSeeker to identify composite motifs is not difficult. One reasonable step is to incorporate information on the distance between adjacent sites and the interactive relationships among transcription factors. A current trend for finding motifs involves genome-wide sequence analysis (31,32). Since large-scale situations have much more noise than others, generally, no single ranking method can satisfactorily reflect the significance of motifs. Data fusion is expected to provide a robust ranking method in genome-wide applications. Extensions in this direction include (i) considering the different properties of motifs (such as copy number and motif conservation considered herein, and the position of the transcription initiation site, etc.) and different measures for each property (such as various measures of motif conservation, including information content and s1); (ii) given these multiple scoring functions, finding the best subset to combine so as to optimize performance in reasonable running time; and (iii) finding the best method of combination (such as by rank or by score.) All such studies should help to find regulatory motifs more accurately and will allow us to have a better understanding of the mechanism of gene regulation.

PROGRAM AVAILABILITY

MotifSeeker is written in C and Perl. It is available at . Each output motif can be further explored on Patch™, a web-based tool integrated in TRANSFAC.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

25 in total

Identification of degenerate motifs using position restricted selection and hybrid ranking combination.

INTRODUCTION

METHODS

Definition and properties of degenerate (l, d)-motifs

Generating significant degenerate motifs

Motif scoring methods R1 and R2

MotifSeeker uses methods of data fusion and hybrid ranking

RESULTS

Evaluation of performance on synthetic data

Evaluation of performance on yeast promoters

Evaluation of performance on tissue-specific regulatory elements

Evaluation of performance on a well-known benchmark

DISCUSSION

PROGRAM AVAILABILITY

SUPPLEMENTARY DATA

1. Finding motifs in the twilight zone.

2. Combining phylogenetic data with co-regulated genes to identify regulatory motifs.

3. Gibbs Recursive Sampler: finding transcription factor binding sites.

4. Finding subtle motifs by branching from sample strings.

5. Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm.

6. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

7. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals.

8. CONREAL web server: identification and visualization of conserved transcription factor binding sites.

9. WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar.

10. Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach.

1. Combining multiple ChIP-seq peak detection systems using combinatorial fusion.

Review 2. A survey of DNA motif finding algorithms.