Literature DB >> 18606616

MOPAT: a graph-based method to predict recurrent cis-regulatory modules from known motifs.

Abstract

The identification of cis-regulatory modules (CRMs) can greatly advance our understanding of eukaryotic regulatory mechanism. Current methods to predict CRMs from known motifs either depend on multiple alignments or can only deal with a small number of known motifs provided by users. These methods are problematic when binding sites are not well aligned in multiple alignments or when the number of input known motifs is large. We thus developed a new CRM identification method MOPAT (motif pair tree), which identifies CRMs through the identification of motif modules, groups of motifs co-occurring in multiple CRMs. It can identify 'orthologous' CRMs without multiple alignments. It can also find CRMs given a large number of known motifs. We have applied this method to mouse developmental genes, and have evaluated the predicted CRMs and motif modules by microarray expression data and known interacting motif pairs. We show that the expression profiles of the genes containing CRMs of the same motif module correlate significantly better than those of a random set of genes do. We also show that the known interacting motif pairs are significantly included in our predictions. Compared with several current methods, our method shows better performance in identifying meaningful CRMs.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Transcription Factors

Year: 2008 PMID： 18606616 PMCID： PMC2490743 DOI： 10.1093/nar/gkn407

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Identifying cis-regulatory modules (CRMs) is an important problem in this postgenomic era. CRMs are short DNA regions of a few hundred base pairs that contain multiple transcription factor-binding sites (TFBSs). It is estimated that there are five-to-ten times as many CRMs in a genomes as there are genes (1). In high eukaryotes, CRMs instead of individual TFBSs often determine the spatial temporal expression patterns of neighboring genes. Therefore, identification of the CRMs is important not only for the understanding of gene transcriptional regulation but also for the annotation of high eukaryotic genomes. However, to identify CRMs in high eukaryotes is challenging. The difficulty lies in the following two facts. First, the possible residing regions of the CRMs in one gene can be as long as thousands of base pairs or even hundreds of thousands of base pairs. Second, the TFBSs are in general 6–14 bp long and there is some degeneracy at almost every position of the TFBSs of a transcription factor (TF). Thus, if we scan a DNA sequence even with one known motif, we will obtain many false motif hits; if we scan a sequence with a large number of known motifs, we will find motif hits at nearly every position of the sequence, which is full of false ones. In the past several years, a number of computational approaches for CRM identification have been developed (2–12). Of them, those with high degrees of success are based on known motifs, represented by position weight matrices (PWMs). There are two types of methods based on known PWMs. The first type predicts CRMs from a small set of known PWMs that are expected to form a motif module (6,7,13–16). Here, a motif module is a group of motifs with instances co-occurring in many CRMs. The second type (9) depends on multiple sequence alignments to predict CRMs from a large set of PWMs, such as all motifs deposited in Transfac database (17,18). In practice, it is often difficult to provide the motifs in a motif module since most motif modules are unknown. On the other hand, TFBSs are so short that the counterpart TFBSs in orthologous sequences are often not aligned in multiple sequence alignments. Thus, many CRMs can be missed by the current computational CRM identification methods based on known PWMs. Here, we developed a new method, motif pair tree (MOPAT) that identifies CRMs from known PWMs. Our method can handle a large number of input motifs. It does not rely on the multiple sequence alignments either. The major difference between our method and the published methods is that we predict CRMs through the identification of motif modules, while many current CRM prediction methods predict CRMs in one region independently from other regions. The identification of a motif module, the recurrent combination of motifs shared by many CRMs, greatly improves the accuracy of CRM predictions. By applying this method to mouse developmental genes, we found many motif modules that significantly overlap with known interacting motifs. We also found that the expression profiles of genes containing CRMs of the same motif modules significantly correlate better than those of random genes do. Compared with several available CRM identification methods based on known motifs, our method shows better performance in identifying motif modules. The software based on our method can be freely downloaded from the following link http://evolution.compbio.iupui.edu/li/page/software.

MATERIALS AND METHODS

Transfac motifs and upstream sequences of mouse developmental genes

All 522 vertebrate PWMs from Transfac 9.2 (17,18) was extracted for the CRM analysis. Pseudo counts are introduced to regularize these PWMs, as is described subsequently. The mouse developmental genes are obtained from three sources: (i) those annotated with GO:0032502 (developmental process); (ii) those annotated by the offspring of GO:0032502 and (iii) those with their orthologs in other 10 species (human, rat, dog, chicken, frog, fugu, zebrafish, nematode, sea squirt and fly) annotated with GO:0032502 or the offspring of GO:0032502. The 5-kb long noncoding sequences around the transcription start sites (TSS) of the mouse developmental genes are extracted from Ensembl website (http://www.ensembl.org/biomart/, release 46) by BioMart software (19) according to the Ensembl gene IDs. If the annotated gene start codon is within 2.5 kb of the TSS, we only use the sequence from the −2 kb position to the start codon. The repeat sequences in these 5-kb long sequences are masked by RepeatMasker software (http://www.repeatmasker.org). The reason that we apply our method to mouse is that there are many mouse developmental microarray data available for the validation of our predicted CRMs and motif modules.

Score of a DNA segment given a PWM

where f(b) is the average frequency of nucleotide b in the above mouse developmental gene upstream sequences, f(b,i) is the frequency of nucleotide b at position i of the motif PWM under consideration, d is the width of the motif. A pseudo count 0.375 has been added to all the computation of frequency, as was used by Claverie and Audie (20).

Identification of candidate motif hits in each sequence

We scan the nonrepetitive regions of each sequence to identify hits of each of the known motifs, using the above defined log-likelihood ratio score. We first compute the nucleotide distribution in the above developmental gene sequences, and produce a 100-kb long random sequence with the same nucleotide distribution. We then compute the score of every motif at each position of the random sequence to obtain the score distribution of the motif. Finally, for every motif, we take the 99.99% quartile of the score distribution of this motif as the score cutoff of this motif. Some motifs may have the tendency to occur together, merely due to the similarity of their PWMs. To deal with it, many software predicting CRMs (7,14,15,21,22) or predicting interacting motif pairs (23–25) require that the motifs do not overlap with each other. Similarly, we require that the start positions of any two motif hits must be separated by at least 4 bp, which is also used by Sharan et al. (28). We sorted the motif hits of candidate motifs by their start positions and deleted the overlapped motif hits based on following rule: when the start positions of two motif hits are <4 bp, the motif hit with the lower score will be discarded.

Identification of CRMs and motif modules in a motif pair tree

We define the following parameters in our method. Kmin and Kmax are the minimum number and the maximum number of motifs in a motif module, respectively. The w is the allowed maximal length of a CRM. The g is the required minimal number of genes that contain the instances of a motif module. Our method carries out the following three steps to identify CRMs and motif modules: First, we extract the motif pair information from the motif hits by hashing. The motif pair information includes which two motifs have motif hits co-occurring within a w-bp window and how many sequences contain the instances of such co-occurring motif pairs. For every such motif pair, we also store the list of the gene names where the corresponding sequences contain instances of both motifs in the motif pair. For example, in seq1, assuming pos1 is the start position of the motif m6 hit and pos2 is the end position of the motif m0 hit, m6 and m0 form a motif pair since pos2–pos1 is less than w. Therefore, we add the gene name seq1 to the gene list of m6m0. Please note that seq1 is added to the gene list of m6m0 just one time no matter how many times m6m0 occurs in seq1 (Figure 1b). After all sequences are analyzed, we scan the hash table to delete the motif pairs that occur in less g genes and modify the motif degree accordingly. For every motif in the remaining motif pairs, we then store the motif degree, which is the number of the distinct motifs that form motif pair with this motif under consideration. For example, in Figure 1a, the motif degree for motif m4 is one, because motif pair m4m3 occur in no less than g = 2 genes. In the following, motifs always mean those in the motif pairs that occur in at least g genes.

Figure 1.

Construction of motif pair tree. (A) Motif hits of ten motifs in eight sequences. The motifs can overlap with each other as long as their start positions are separated by at least 4 bp. The motifs in the same box are all paired with each other. (B) Motifs and their paired motif list. The number in the parenthesis is the number of the genes that contain instances of the motif pair. (C) Motif pair tree. Each node in motif pair tree represents a motif. Each path in the motif pair tree represents a potential motif module. Second, we construct a motif pair tree and output potential motif modules. We define the motif degree of a node in the motif pair tree as the motif degree of the motif represented by the node. We first take null as the root node of the tree. We then construct a node for every motif as a child node of the root node. Here and in the following, the child nodes of a node are always sorted from left to right with the increment of the motif degree. For every leaf node in the current tree, we then take every other node with higher motif degree as its child nodes, if the motif with higher motif degree and the motif represented by the leaf node under consideration form motif pairs at the first step. We repeat this procedure until no child node can be added to any current leaf nodes. For instance, the tree in Figure 1c is the motif pair tree for the motif hits in Figure 1a. It is obvious that any path in the motif pair tree starting from the root is a potential motif module. Each potential motif module is represented by and only by one path of the motif pair tree. In practice, we will construct the motif pair tree in a depth-first format. We will also add a gene list to each node when we construct a path of the motif pair tree. The gene list of a child node is a subset of the gene list of its parent nodes. Taking the path null–m1–m2–m3–m0 in Figure 1c as an example, the gene list of the node m0 is the set of genes that contain instances of motif pairs m1m2, m1m3, m1m0, m2m3, m2m0, and m3m0. With the gene list for each node, we can also stop to extend a path to include a node if the number of the genes in its gene list is less than g, which increases the efficiency of our method. In this way, we only need to store one branch of the tree at one time, which enables our method to handle a large number of known motifs and the input gene noncoding sequences. Third, we check whether the potential motif modules really have instances in at least g genes. For each motif module, we check the genes in its gene list one by one to see whether the motif module can be found in the genes. For each gene, we scan sequence to see whether any w-bp window contains hits of all motifs of the potential motif module. The gene is claimed to contain instances of the motif module if such a w-bp window exists. A motif module is output if the number of genes containing instances of this motif module is no less than g. The w-bp windows that contain instances of this motif module in these genes are the CRMs of this motif module.

Statistical evaluation

We use Poisson clump heuristic to compute the P-value of a motif module, with the assumption that each motif occurs independently according to a Poisson process (26). Independent motif occurrence is assumed by many CRM identification methods to analyze the significance of a motif cluster or CRM (13,27,28). Let N be the total number of motifs, G be the total number of sequences, L be the average length of sequence and λk be the rate parameter of the Poisson process for the motif k. Suppose m1, m2, …, m is a motif module with instances co-occurring in g sequences. For each position of a sequence, the probability that m occurs is λmi. For a window covering this position and containing m, the probability that instances of other n − 1 motifs occur at least one time is . Therefore, the probability that this motif module occurs in this sequence is bound by, The probability that this motif module occur in at least g sequences is, N motifs can produce different motif clusters containing n distinct motifs. So if we take 0.05 as the significance level, after Bonferroni correction for multiple comparisons, the probability that we can find a motif module containing n distinct motifs and occurring in no less than g sequences, Pgc, should be smaller than

RESULTS

We developed a method and software, MOPAT, to search for CRMs and motif modules in DNA sequences. MOPAT includes two parts. The first part, written by C++, takes a set of sequences in FASTA format and a set of motifs in count-matrix format as input to predict the hits of motifs in the input sequences. The second part, written by Perl, takes the set of motif hits as input to find CRMs and motif modules. Given a set of sequences, a set of motifs and a motif probability cutoff, the number of predicted motif modules and the running time of MOPAT are mainly influenced by two factors: the window size cutoff w and the gene number cutoff g. The larger the w and the smaller the g, the more motif modules we can find and the more time required to run the program. Our program also allows the user to set the minimum and maximum number of motifs in a motif module. The constraint by these two parameters can further improve the speed of program. When using our program, the user can set a loose parameter set (such as w = 200, g = 20) and spend more time to identify as many motif modules as possible; or they can set a stringent parameter set (such as w = 100, g = 20) and spend less time to identify only the motif modules with high significance. We have applied MOPAT to the 5-kb long sequences around the TSS of the 5530 mouse developmental genes by using the following parameters: P = 0.0001, w = 200, g = 20, Kmin = 3 and Kmax = 8. We use 3 as the minimum number of motifs allowed in a motif module, since motif modules composing of only two motifs occur very frequently. We use window size as 200 bp and gene number as 20 to ensure most predicted motif modules are significant according to our analysis above. With these parameters, we have predicted 144 490 motif modules, which cover 494 motifs and 5492 genes (Supplementary data). Taking 0.05 as the Bonferroni corrected P-value cutoff, we divide the predicted motif modules into two parts according to the significance. The first part, Result I, contains all the motif modules with Bonferroni corrected P < 0.05. The second part, Result II, contains the rest motif modules. Result I contains 33 361 motif modules, which cover 489 distinct motifs and 5486 distinct genes. Result II contains 111 129 motif modules, which cover 471 motifs and 5491 genes. Our analyses in the following are based on the two groups of predictions. Note that many motif modules in Result II are also significant although their Bonferroni corrected P-values are >0.05.

Validate predicted CRMs and motif modules by expression data

We attempt to validate our predicted motif modules by using microarray expression data here. This validation strategy has been used in the previous study to support the predicted motif modules in upstream 1-kb regions by other group (28). The basic assumption of this validation strategy is that the genes containing the instances of the same motif module should have more similar expression patterns. That is, the CRMs with the same combination of the motifs will control similar temporal spatial expression patterns (29,30). Three microarray datasets have been downloaded from GEO database (http://www.ncbi.nlm.nih.gov/geo/): mouse epididymis development (GDS2202), mouse ovary development (GDS2203) and mouse cochlear nucleus postnatal development (GDS2144). We have evaluated our results by using the three microarray datasets respectively, and the performance of our method is similar for different microarray datasets. Therefore, we will use GDS2202 as an example to show the quality of our predicted motif modules. As the previous study (28), we first calculate the Pearson's correlation coefficient to measure the similarity of the gene expression. Note that some genes may have multiple expression measurements, and there are multiple similarities between each pair of these genes. In this case, we take the maximal one as the expression similarity of the gene pair. In the following, we justify the significance of the predicted motif modules from two different aspects. In the first way, we first extract all gene pairs in GDS2202 and compute the expression similarity of gene pairs, and draw the histogram of these similarities. We call this distribution the background distribution. The background distribution is like a normal distribution, with 0.14 as the mean (Figure 2a). The nonzero mean shows that the dataset is not completely random and some genes coexpressed with each other. We next extract the gene pairs containing the CRMs of the same motif modules in our predicted results and calculate the expression similarities of gene pairs. We first calculate the similarities of expression profiles of gene pairs in every gene set containing instances of same motif modules and then plot the histogram of these similarities from all gene sets (Figures 2b and 3c). It can be seen from Figure 2b and c, relative to the background, the distributions of the expression similarity of the genes containing instances of the same motif modules are apparently skew toward to the high correlation end. The mean expression similarity of the genes containing instances of the same motif modules from Result I is 0.31, higher than that from Result II, 0.27. The mean similarities from Result I and Result II are both significantly higher than that from the background distribution, 0.14 (P < 0.0001 by t-test). Note that 39.7 and 14.5% of the mean similarities of the expression of the genes containing instances of our predicted motif modules in Result I and Result II are higher than that of the typical example in ref. (28), 0.341, respectively, although Sharan et al. make use of the prior biological knowledge and the sequence conservation information to identify motif modules.

Figure 2.

Figure 3.

Histogram of the median expression similarity of gene pairs in gene sets. (A) 166 805 random gene sets generate from the same distribution of the size of the gene sets in Result I. The 95% quartile of this histogram is 0.263. (B) 555 645 random gene sets generate from the same distribution of the size of the gene sets in Result II. The 95% quartile of this histogram is 0.277. (C) 33 361 target gene sets of the predicted motif modules in Result I. Ninety-three percent of the target genes have a median gene pair expression similarity larger than 0.263. (D) 111 129 target gene sets from the predicted motif modules in Result II. Sixty-nine percent of the target genes have a median gene pair expression similarity larger than 0.277.

Histogram of the expression similarities of gene pairs. (A) Expression similarity of gene pairs in GDS2202, average = 0.14. (B and C) Expression similarity of gene pairs containing instances of the same motif modules in Results I and II, mean = 0.31 and 0.27, respectively. In the second way, we first construct two groups of random gene sets from GDS2202, Random I and Random II, such that Random I and Random II have the same distribution of the size of the gene sets in Result I and Result II, respectively. For instance, the percentage of gene sets containing 25 genes are 4.4% in both Result I and Random I. Random I includes 166 805 (33 361 × 5) gene sets, and Random II includes 555 645 (111 129 × 5) gene sets, five times as many gene sets as that in Result I and Result II, respectively. We next calculate the median gene pair expression similarity in each gene set in Random I and Random II, and draw the histogram of the median similarities, respectively (Figure 3a and b). For Random I, only 5% of median similarities are higher than 0.263, and for Random II, only 5% of median similarities are higher than 0.277. We thus take 0.263 and 0.277 as threshold to define the significant motif modules in Result I and Result II, respectively. Under such significance levels, 31 125 of 33 361 motif modules (93%) of Result I are significant (Figure 3c), and 76 641 of 1 11 129 motif modules (69%) of Result II are significant (Figure 3d). Histogram of the median expression similarity of gene pairs in gene sets. (A) 166 805 random gene sets generate from the same distribution of the size of the gene sets in Result I. The 95% quartile of this histogram is 0.263. (B) 555 645 random gene sets generate from the same distribution of the size of the gene sets in Result II. The 95% quartile of this histogram is 0.277. (C) 33 361 target gene sets of the predicted motif modules in Result I. Ninety-three percent of the target genes have a median gene pair expression similarity larger than 0.263. (D) 111 129 target gene sets from the predicted motif modules in Result II. Sixty-nine percent of the target genes have a median gene pair expression similarity larger than 0.277.

Validate predicted CRMs and motif modules by experimentally verified composite regulatory elements in Transfac

Composite regulatory elements or composite elements (CEs) contain two closely located TFBSs of two different TFs and represent minimal functional units that provide combinatorial transcriptional regulation. Both factor–DNA and factor–factor interactions contribute to the function of CE (31,32). In the past decades, many CEs have been experimentally verified and deposited in public databases. These CE can be used to verify the predicted motif modules. In Transfac database, CE is represented as interacting factors. If some TFs interact with certain factor TF0, then we can find these TFs from TF0's interacting factors in Transfac database. We have thus extracted the TF interacting information for each TF from the Transfac database and translated them into motif pairs according to the map between TF and PWMs. We call the motif pair as CE motif pair if the two factors that bind the two motifs interact with each other. In total, 2515 vertebrate CE motif pairs are found in Transfac database, after the motif pairs consisting of the same motif PWM are removed. Without considering the order of the motifs, the predicted 144 490 motif modules above contain 27 546 distinct motif pairs, of which 625 are CE motif pairs. In detail, 33 361 motif modules in Result I contain 15 263 distinct motif pairs, of which 440 are CE motif pairs; 111 129 motif modules in Result II contain 22 479 distinct motif pairs, of which 481 are CE motif pairs. Our results only contain a small part of CE motif pair, which may be due to the following facts. First, we only analyzed mouse genes here, while the CE motif pairs are from many vertebrates. Second, we only analyzed developmental genes in mouse, which account for <20% of the total mouse genes. Third, not all the CE motif pairs in Transfac database require the physical binding of the TFs with DNA in order to interact. On the other hand, although not all CE motif pairs are included in our predictions, our predicted motif modules still significantly overlap with the CE motif pairs. The significance of the overlap between the CE motif pairs and our predicted result can be tested by a hypergeometric distribution: Here, N = 522 × (522 − 1)/2 = 135 981 is the total number of distinct motif pairs constituted by 522 vertebrate motifs; M = 2515 is the number of CE motif pairs deposited in Transfac database; n is the number of distinct motif pairs included in our predicted motif modules; m is the number of CE motif pairs included in our predicted motif modules. The P-value of observing so many CE motif pairs in all 144 490 motif modules, 33 361 motif modules in Result I and 111 129 motif modules in Result II is PA (135 981, 2515, 27 546, 625) = 2.29 × 10−9, PI (135 981, 2515, 15 263, 440) = 1.32 × 10−21 and PII (135 981, 2514, 22 479, 481) = 5.00 × 10−5, respectively (Table 1).

Table 1.

Significance of proportion of CE motif pairs in predicted motif modules

Result	N	M	n	m	P
I	135 981	2515	15 263	440	1.32E-21
II	135 981	2515	22 479	481	5E-5
I + II	135 981	2515	27 546	625	2.29E-09

See text for the meaning of N, M, n and m.

Significance of proportion of CE motif pairs in predicted motif modules See text for the meaning of N, M, n and m.

An example of the predicted motif modules

To further show that the predicted CRMs and motif modules make sense, here we give an example of the predicted CRMs of a motif module. This motif module comprises three motifs, M00380, M00423 and M00724, which corresponds to the TF Pax4, Fox-j2 and hepatocyte nuclear factor-3alpha (Hnf-3α), respectively (Figure 4).

Figure 4.

One example of the predicted recurrent motif modules.

One example of the predicted recurrent motif modules. The three TFs all involve in the metabolism process. Pax4 is a well-known diabetes-linked TF, and plays an important role in diabetes-related metabolism process (33–37). Fox-j2 belongs to the forkhead TF family, a large gene family known for their key function in development and metabolism (38). It has been reported that the deregulation of the forkhead family genes can lead to congenital disorders, diabetes mellitus or carcinogenesis (39). Hnf3 (also known as Fox-a2) is also a member of the forkhead TF family, and functions in diabetes-related metabolism (40,41). There are 21 genes containing instances of this motif module. Our GO analysis demonstrates 19 of them relate to cellular metabolism (GO:0044237). Further analysis indicates that a majority of them relate to diabetes. Ten of them are annotated to be related to diabetes in past literature: ctnnb1 (42,43), tgfbr1 (44), prox1 (45), bmp4 (46), mcm6 (47), pou2f1 (48), hspa5 (49), Nr4a3 (50), cdk5 (51,52) and hdac6 (53). Five of the rest, lcor, pcgf2, htatip, dzip1l and rere locate in a region that has been implicated in susceptibility to type 1 diabetes (54). Lrrk2, hand2, vprbp, tcf21, sertad2 and hist1h1b, which are not well annotated, may represent the unknown genes related to diabetes as well.

Compare our method with other methods

We compared MOPAT with three popularly used methods, Cbust (http://zlab.bu.edu/cluster-buster/cbust.html), Compel (http://compel.bionet.nsc.ru/FunSite/CompelPatternSearch.html) and Dire (http://dire.dcode.org/). These three methods can be classified into two groups according to the input data. Cbust and Compel, belong to the first group. Cbust takes a set of sequences and a set of candidate motifs as input to predict CRMs. Compel takes a sequence set as input, then it uses the CE motif pair information deposited in TRANScompel database to predict CRMs. Dire belongs to the second group that takes a list of gene names as input. To compare our method with the methods in the first group, we generate random sequences with implanted CRMs. We use random sequences rather than real sequences because it is difficult to know what motif modules and CRMs may be contained in real sequences. We first generate 10 motif modules, with the motifs randomly selected from the 522 vertebrate motifs and the number of motifs in a motif module varying from 3 to 8. We then generate a sequence set for each motif module, and randomly inserted instances of each motif in the motif module into a 200-bp window in each sequence. The sequence sets are produced completely randomly with the same nucleotide frequency as the developmental sequences we used. The number of the sequences for one motif module varies between 15 and 30, and the length of each sequence is 1 kb. We then mix the sequences for every motif module together as our test dataset. Totally, our test dataset includes implanted instances of 51 distinct motifs and 202 1-kb long sequences (Table 2). Taking the 202 sequences and 522 vertebrate motifs as input, we then compare our method with other methods. The default parameters are used for all methods.

Table 2.

Ten groups of randomly generated CRMs and genes

Group	No. of motifs	No. of genes
1	3	16
2	3	24
3	4	15
4	4	29
5	5	26
6	5	17
7	6	25
8	6	17
9	7	17
10	8	16
Total	51	202

Ten groups of randomly generated CRMs and genes MOPAT predicted 43 motif modules and 633 CRMs, which totally include 61 distinct motifs (42 of them are true positives). The number of distinct motifs in CRMs is four on average, with three as the minimum and eight as the maximum (Table 3). Cbust predicted 190 CRMs, which totally include 515 distinct motifs and 51 of them are true positive. The number of distinct motifs in CRMs is 94 on average, with 17 as the minimum and 249 as the maximum. Thus, the result of Cbust includes many false positives. For Compel, 61 distinct motifs are predicted and 10 of them are true. The number of distinct motifs in a CRM is 8.5 on average, with 5 as the minimum and 13 as the maximum.

Table 3.

Comparison between MOPAT and others

Method	No. of motif inserted	No. of motif candidate	No. of CRMs predicted (true)	No. of motifs predicted (true)	No. of motifs in CRM
MOPAT	51	522	633 (109 ^a, 401^b)	60 (42)	4 (3 ^c, 8 ^d)
Cbust	51	522	190 (0,0)	515 (51)	94 (17, 249)
Compel	51	–	202 (0,0)	61 (10)	8.5 (5, 13)

aThe number of CRMs that match the implanted CRMs perfectly.

bThe number of CRMs that match the implanted CRMs with one mismatch.

cMinimum number of motifs in a CRM.

dMaximum number of motifs in CRM.

See text for the details.

Comparison between MOPAT and others aThe number of CRMs that match the implanted CRMs perfectly. bThe number of CRMs that match the implanted CRMs with one mismatch. cMinimum number of motifs in a CRM. dMaximum number of motifs in CRM. See text for the details. We further judge whether MOPAT can predict true motif modules and CRMs according to two criteria. First, if the genes contain CRMs of a predicted motif module are included in the group of genes containing CRMs of one implanted motif module, we call the predicted motif module a true motif module. Second, if a CRM contains exactly the same instances compared with those in an implanted CRM, we call the predicted CRM a true CRM. For MOPAT, all predicted motif modules satisfy the first criterion and 109 predicted CRMs satisfy the second criterion (Table 3). Furthermore, for the second criterion, if we allow to delete or to add at most one instance to the predicted CRMs, 401 predicted CRMs satisfy the second criteria. For Cbust and Compel, none of predicted motif modules or CRMs satisfies the first or the second criterion even we allow one mismatch. Note that there are also false positives in MOPAT results. This is because there are always a few new motifs occurring frequently in random sequences, given a large number of motifs input. Given that Dire just accepts a list of input gene names and does not accept the input of sequence sets and motif sets, we have to try another way to compare it with MOPAT. We mixed the 21 diabetes-metabolism related genes in our example with other 200 mouse developmental genes, and take these 221 gene names as input of Dire to see what shared motifs can be found in the 21 genes output from Dire. We found that no motif is shared by the 21 genes according to Dire. Even with the 21 diabetes metabolism-related genes as input, Dire cannot identify any motif shared by the 21 genes, which may be because of the requirement that corresponding motif instances must be aligned. Note that MOPAT identifies a motif module shared by these 21 genes from the 5530 developmental genes input.

DISCUSSION

We have developed a method to identify CRMs based on a large set of known motifs deposited in Transfac. We show that our predicted motif modules significantly overlap with the known interacting motif pairs. We also show that the expression profiles of the genes containing CRMs of the same motif modules correlate significantly better than the genes in random gene sets do, which suggests many predicted CRMs could be used to predict the gene expression patterns. Our method has utilized the observation that instances of the motifs in the same motif module co-occur in many CRMs. This observation helps us to differentiate ‘true’ CRMs from the false ones, especially when the number of candidate motifs is very large, which can be shown from the comparisons between our method and others that do not utilize this observation. This observation also helps us find gene sets with similar expression pattern, which can be seen from the much higher correlations of gene expression of the gene pairs containing instances of a motif module. Our method provides an option to identify CRMs without multiple sequence alignments. Many current CRM identification methods based on known motifs depend on sequence alignment. They first align noncoding sequences of orthologous genes to find conserved regions, then search motif hits or CRMs in the conserved regions. In most cases, it is valid. However, in some cases, it does not work well. There are reports showing that many functional elements in the human genome are seemingly unconstrained across mammalian evolution (55). Therefore, a reasonable way should be like this, it can utilize ortholog information without multiple alignment, thus can find conserved CRMs that cannot be aligned well in multiple alignments. Our method provides such an option to utilize the ortholog information. When extracting motif pair information (step one), we can require the motif pairs must occur in some or all its ortholog genes, which enable the identification of the conserved CRMs that are shuffled during the genome rearrangements.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

52 in total

1. TRANSCompel: a database on composite regulatory elements in eukaryotic genes.

Authors: Olga V Kel-Margoulis; Alexander E Kel; Ingmar Reuter; Igor V Deineko; Edgar Wingender
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

2. Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model.

Authors: Marc S Halfon; Yonatan Grad; George M Church; Alan M Michelson
Journal: Genome Res Date: 2002-07 Impact factor: 9.043

3. Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences.

Authors: Martin C Frith; John L Spouge; Ulla Hansen; Zhiping Weng
Journal: Nucleic Acids Res Date: 2002-07-15 Impact factor: 16.971

4. Cluster-Buster: Finding dense clusters of motifs in DNA sequences.

Authors: Martin C Frith; Michael C Li; Zhiping Weng
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

5. SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation.

Authors: Mark Rebeiz; Nick L Reeves; James W Posakony
Journal: Proc Natl Acad Sci U S A Date: 2002-07-09 Impact factor: 11.205

Review 6. Forkhead transcription factors: key players in development and metabolism.

Authors: Peter Carlsson; Margit Mahlapuu
Journal: Dev Biol Date: 2002-10-01 Impact factor: 3.582

7. Phylogenetic shadowing of primate sequences to find functional regions of the human genome.

Authors: Dario Boffelli; Jon McAuliffe; Dmitriy Ovcharenko; Keith D Lewis; Ivan Ovcharenko; Lior Pachter; Edward M Rubin
Journal: Science Date: 2003-02-28 Impact factor: 47.728

8. Beta-cell dysfunction in late-onset diabetic subjects carrying homozygous mutation in transcription factors NeuroD1 and Pax4.

Authors: Azuma Kanatsuka; Yoshiharu Tokuyama; Osamu Nozaki; Kana Matsui; Tohru Egashira
Journal: Metabolism Date: 2002-09 Impact factor: 8.694

9. TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors: V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. Cross-species comparison significantly improves genome-wide prediction of cis-regulatory modules in Drosophila.

Authors: Saurabh Sinha; Mark D Schroeder; Ulrich Unnerstall; Ulrike Gaul; Eric D Siggia
Journal: BMC Bioinformatics Date: 2004-09-09 Impact factor: 3.169

22 in total

1. Motif analysis unveils the possible co-regulation of chloroplast genes and nuclear genes encoding chloroplast proteins.

Authors: Ying Wang; Jun Ding; Henry Daniell; Haiyan Hu; Xiaoman Li
Journal: Plant Mol Biol Date: 2012-06-26 Impact factor: 4.076

2. Transcriptional regulation of co-expressed microRNA target genes.

Authors: Ying Wang; Xiaoman Li; Haiyan Hu
Journal: Genomics Date: 2011-10-02 Impact factor: 5.736

3. Tissue-specific prediction of directly regulated genes.

Authors: Robert C McLeay; Chris J Leat; Timothy L Bailey
Journal: Bioinformatics Date: 2011-06-30 Impact factor: 6.937

4. Thousands of cis-regulatory sequence combinations are shared by Arabidopsis and poplar.

Authors: Jun Ding; Haiyan Hu; Xiaoman Li
Journal: Plant Physiol Date: 2011-11-04 Impact factor: 8.340

5. Expression of brain-derived neurotrophic factor is regulated by the Wnt signaling pathway.

Authors: Hyun Yi; Jianfei Hu; Jiang Qian; Abigail S Hackam
Journal: Neuroreport Date: 2012-02-15 Impact factor: 1.837

6. Systematic prediction of cis-regulatory elements in the Chlamydomonas reinhardtii genome using comparative genomics.

Authors: Jun Ding; Xiaoman Li; Haiyan Hu
Journal: Plant Physiol Date: 2012-08-22 Impact factor: 8.340

7. Modeling regulatory cascades using Artificial Neural Networks: the case of transcriptional regulatory networks shaped during the yeast stress response.

Authors: Maria E Manioudaki; Panayiota Poirazi
Journal: Front Genet Date: 2013-06-20 Impact factor: 4.599

8. Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs.

Authors: Bartek Wilczynski; Norbert Dojer; Mateusz Patelak; Jerzy Tiuryn
Journal: BMC Bioinformatics Date: 2009-03-10 Impact factor: 3.169

9. The dependence of expression of NF-κB-dependent genes: statistics and evolutionary conservation of control sequences in the promoter and in the 3' UTR.

Authors: Marta Iwanaszko; Allan R Brasier; Marek Kimmel
Journal: BMC Genomics Date: 2012-05-11 Impact factor: 3.969

10. Using graph models to find transcription factor modules: the hitting set problem and an exact algorithm.

Authors: Songjian Lu; Xinghua Lu
Journal: Algorithms Mol Biol Date: 2013-01-16 Impact factor: 1.405