Literature DB >> 18411204

Accuracy and application of the motif expression decomposition method in dissecting transcriptional regulation.

Zhihua Zhang1, Jianzhi Zhang.   

Abstract

Understanding transcriptional regulation is a major goal of molecular biology. Motif expression decomposition (MED) was recently introduced to describe the expression level of a gene as the sum of the products of the binding strengths of its cis-regulatory motifs and the activities of the corresponding trans-acting transcription factors (TFs). Here, we use computer simulation to examine the accuracy of MED. We found that although MED accurately rebuilds gene expression levels from decomposed motif binding strengths and TF activities, estimates of motif binding strengths and TF activities are unreliable. Nonetheless, MED provides accurate estimates of relative binding strengths of the same motif in different genes and relative activities of the same TF under different conditions. We found that reasonably accurate results are achievable with genome-wide expression data from only 30 conditions and that MED results are robust to the existence of unknown occurrences of known motifs, although they are less robust to the presence of unknown motifs. With these understandings, judicious use of MED will likely provide useful information about eukaryotic transcriptional regulation. As an example, MED results are used to demonstrate that motifs generally have higher binding strengths when appearing in multiple copies than appearing in one copy per promoter.

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18411204      PMCID: PMC2425491          DOI: 10.1093/nar/gkn127

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Understanding how gene expression is regulated is a major task of molecular biology. Jacob and Monod (1) pioneered the study of transcriptional regulation at the level of interaction between cis-regulatory motifs (or elements) in a gene's promoter region and trans-acting transcription factors (TFs) in the cell. Based on their idea, one may describe the log-transformed expression level of a gene at a given cellular condition by a function of the motifs present in the gene's promoter region and the TF activities present in the condition, as given in Equation (1) in Methods section [see also (2–4)]. The availability of several high-throughput technologies such as gene-expression microarrays and chromatin immunoprecipitation on microarray chips (ChIP-chip), and rapid progress in genomics and computational biology make it possible to study patterns of transcriptional regulation at the genomic scale (5–8). For example, large architectural differences in the yeast regulatory network among different cellular conditions have been identified (7,9). Recently, Nguyen and D’Haeseleer used Jacob and Monod's model to analyze microarray gene expression data obtained from multiple conditions in order to decipher principles of transcriptional regulation (10). Their method, called motif expression decomposition (MED), decomposes a matrix () of gene expression levels at multiple conditions into the product of two matrices: the first () contains the condition-independent binding strength of each motif (in each promoter) with its corresponding TF, while the second () contains the activity of each TF at each condition studied. Some interesting patterns were observed from the analysis of the matrix. For instance, the same motif with different orientations relative to the transcriptional direction may have different binding strengths, and the same motif with different physical distances from the transcriptional starting site may also have different strengths. Such findings, if correct, are invaluable for understanding the structure, function and evolution of promoters as well as those of transcriptional regulatory networks (11). Nguyen and D’Haeseleer examined the performance of MED by a cross-validation procedure, showing that the product of the decomposed and matrices is reasonably well correlated with the microarray gene expression levels. Although this result suggests that the method can be used to predict the expressions of some genes at a given condition when the expressions of many other genes are known at the same condition, it does not necessarily mean that the decomposed and matrices are accurate, as the same may be decomposed into many different combinations of and (see subsequently). Because it is the and matrices that are of interest to most biologists, we decide to examine whether these matrices decomposed by the MED method are reliable. Because the true values of and matrices are unknown for any organism, here we employ a computer simulation approach. Our simulation results show that MED-derived and matrices are unreliable. Although this limitation of MED prohibits the direct use of and matrices, we find that MED accurately predicts the relative binding strengths of the same motif in different genes and relative activities of the same TF under different conditions. The performance of MED was also examined under limited expression data or partial knowledge of motifs. With improved understanding of MED, we applied MED in yeast to demonstrate at the genomic scale that motifs with >1 copy per motif have significantly higher binding strengths than the same motifs with 1 copy per motif.

METHODS

Generation of gene expression data

Based on Jacob and Monod's model of transcriptional regulation (1), the log-transformed expression level (E) of gene g under condition c equals the sum of the products of the binding strength of each motif and the activity of its corresponding TF. That is, Here, Ω is the set of motifs occurring in gene g's promoter region, M is the binding strength of motif j in the promoter of gene g, A is the activity of TF j, which binds to motif j, under condition c. A positive M indicates an enhancer motif, whereas a negative M indicates a repressor motif. Similarly, a positive A means activation, whereas a negative A means suppression. Following Nguyen and D’Haeseleer, we write Equation (1) in a matrix format for all genes, all motifs and all conditions, as where is a m × n matrix that gives m genes’ expression levels at n conditions, is a m × k matrix that gives the condition-independent binding strengths of k motifs in m genes’ promoter regions and is a k × n matrix that gives the activities of k TFs under n conditions. We randomly generate a m × k matrix designated as O; each element in column i of O is a random variable drawn from the normal distribution N(b), where i = 1, 2, 3, …, k, and b and σ are the mean and standard deviation of the normal distribution, respectively. Each b is a random variable drawn from the normal distribution N(B, σ). We set Hg, the number of motifs in gene g, by drawing a Poisson random variable with mean equal to 3. We then randomly pick Hg of the k motifs in gene g and leave their corresponding entries in row g of O unchanged but set zero to all other entries in row g of O. We further make sure that each row and each column has at least one non-zero entry. If there is a row or column that contains all zeros, we randomly choose an entry and reverse the value to that in the original O. The matrix generated after these steps is referred to as . We randomly generate a k × n matrix designated as . The elements in the ith row of are random variables drawn from the normal distribution N(C, ϕ), where i = 1, 2, 3, …, k, and C is a random variable drawn from the normal distribution N(C, ϕ). We then generate gene expression data using Equation (2). Because gene expression has stochastic variations (12) and because measurement of gene expression has errors, the observed gene expression level will differ from the above computed . Hence, we add an error term to each expression value. For entry E, the error is a random variable drawn from N(0, ϵE), where ϵ is the noise level fixed in each simulation. We have used ϵ = 0, 5, 10, 20, 30, 40, 50, and 100% in different simulations. After this step, the matrix is referred to as the observed or true expressions. MED requires an initial matrix designated as I to start the decomposition. We generate I by replacing all non-zero entries in to 1. Unless otherwise stated, this I is used in our simulations. As will be described later, in some occasions, we also used an I where each non-zero entry is −1 and an I where each non-zero entry is either 1 or −1, with equal probabilities.

Simulation

Because Nguyen and D’Haeseleer's study focused on the yeast Saccharomyces cerevisiae, we use parameters appropriate for yeast in our simulation. Using the approach outlined in the above section, we randomly generate expression data for 4500 genes under 300 conditions. The total number of TFs in the organism is set to be 100. In the dataset analyzed by Nguyen and D’Haeseleer, there were expression data from 5719 genes under 255 conditions and the total number of TFs was 62. Using the MED method (10), we decompose the expression data (matrix ) into and matrices and then compute using = . We then compare with , with and with , as they represent the MED-derived matrices and the true matrices, respectively. At each noise level, we repeat the simulation 10 times. This number of replications is sufficient because our results are highly reproducible.

RESULTS

Performance in predicting expression levels

Using computer simulation as described in Methods section, we generated motif binding strength () and TF activity () matrices for 4500 genes under 300 conditions, including information for 100 different TFs and their corresponding motifs. We first used B = 2.5 and σ = 10 in generating the matrix and used C = 0 and ϕ = 10 in generating . Our B and σ values are similar to the matrix decomposed from the yeast expression data (10). Our C and ϕ are different from the decomposed values in (10), because MED has a normalization step that artificially equalizes the average activity of each TF such that the actual TF activities cannot be seen from the decomposed in (10). Nonetheless, even when we use C = 0 and ϕ = 0.1, similar to those observed from the decomposed in (10), our results remain unchanged. We then generated the gene expression matrix by multiplying and matrices followed by addition of different levels of expression noise. The matrix was decomposed into and matrices using the MED method. We conducted a total of 10 simulation replications. Because the results are essentially identical among the replicates, subsequently we describe our findings from the first replication. There are three expectations if the MED method performs well. First, predicted gene expressions (, or the product of and ) should be close to the observed expressions (). Second, predicted motif binding strengths () should be close to their true values (). Third, predicted TF activities () should be close to their true values (). To measure the agreement between predicted and true values of expression levels, we computed Pearson's correlation coefficient (r) between and for each gene (row), and then computed the average r value across the 4500 genes and the standard deviation of r. Similarly, to measure the agreement between predicted and true values of motif binding strengths and TF activities, we computed r between and for each motif (column) and r between and for each TF (row), and then take averages across all motifs and all TFs, respectively. As shown in Table 1, r between and gradually declines as the noise level rises. Nonetheless, r > 0.80 even when the noise is as high as 50% of the true value and is greater than 0.90 when the noise level is <30%. These results suggest that expression levels predicted by MED are reliable. Indeed, for individual genes under individual conditions, Figure 1 shows that the predicted expression levels match the true values for the majority of genes under the majority of conditions. Figure 1 is based on the simulation results with a noise level of 30%. Qualitatively similar patterns were obtained when different levels of noise (5–100%) were introduced.
Table 1.

Pearson's correlation coefficients (± standard deviation) between the true values and MED-predicted values of expression levels (), motif binding strengths () and TF activities ()

Noise level (%)EMM ratio (within-column)aM ratio (between-column)bAA ratio (within-row)cA ratio (between-row)d
01.000 ± 0.0000.120 ± 0.9970.9980.2890.120 ± 0.9970.996−0.044
50.997 ± 0.0010.179 ± 0.9880.986−0.0280.179 ± 0.9880.9920.200
100.991 ± 0.0050.119 ± 0.9970.9880.1010.119 ± 0.9970.9640.020
200.962 ± 0.0260.119 ± 0.9960.9420.0040.119 ± 0.9950.9300.048
300.929 ± 0.0360.199 ± 0.9810.9040.0810.199 ± 0.9810.862−0.045
400.872 ± 0.0630.059 ± 0.9970.848−0.0280.059 ± 0.9950.7710.103
500.834 ± 0.0670.178 ± 0.9790.812−0.0310.179 ± 0.9770.7140.110
1000.606 ± 0.0990.300 ± 0.8900.5870.0640.303 ± 0.8930.4350.170

Note: The simulated expression data are from 300 conditions.

aRelative binding strengths of the same motif in two genes.

bRelative binding strengths of two different motifs.

cRelative activities of the same TF under two different conditions.

dRelative activities of two different TFs.

Figure 1.

Comparison between the true () and MED-predicted (′) gene expression levels. The noise level is 30%. Note that the expression levels are log-transformed and thus can be negative.

Comparison between the true () and MED-predicted (′) gene expression levels. The noise level is 30%. Note that the expression levels are log-transformed and thus can be negative. Pearson's correlation coefficients (± standard deviation) between the true values and MED-predicted values of expression levels (), motif binding strengths () and TF activities () Note: The simulated expression data are from 300 conditions. aRelative binding strengths of the same motif in two genes. bRelative binding strengths of two different motifs. cRelative activities of the same TF under two different conditions. dRelative activities of two different TFs.

Performance in predicting motif binding strengths and TF activities

To our disappointment, however, the r values between and matrices are low (<0.3) regardless of the level of noise (Table 1). Figure 2A shows that the motif binding strength values in and are dramatically different. Similarly, the r values between and matrices are low (Table 1) and the TF activity values in and are quite different (Supplementary Figure S1A). These observations suggest that although is close to , is not close to and is not close to . It is easy to show that if and form one solution, multiplying column i of by a and row i of by 1/a (a ≠ 0) generates another solution. Because a can be 1, −1 or any non-zero number, there are infinite numbers of decomposition solutions. The original proof of the uniqueness of the MED decomposition solution was based on the arbitrary assumption that each TF has a mean activity of 1 across all conditions (i.e., the mean of each row in the matrix is fixed at 1) (10). Although there is only one decomposition solution under this arbitrary assumption, the solution is not guaranteed to be the right one. In fact, our simulations showed that it is generally not the right solution. Nonetheless, our above consideration predicts that the ratio of any two entries within the same column (motif) of can still be close to the corresponding ratio in , while the ratio of any two entries from different columns of should not correlate with the corresponding ratio in . Similar predictions can be made for rows (TFs) of and . These predictions were indeed confirmed in our simulations. That is, between and , within-column ratios are highly correlated, whereas between-column ratios are not (Table 1; Figure 2B and C). In parallel, between and , within-row ratios are highly correlated, whereas between-row ratios are not (Table 1; Supplementary Figure S1B and C). Note that in this article, we measured Pearson's correlation between true and predicted ratios by using only ratios falling in the range of [−20, 20], which account for >95% of all ratios. This treatment is preferred over the use of all ratios because of the existence of a small number of ratios with extreme values, which affects the measure of Pearson's correlation coefficient. Similar results were obtained when all ratios were considered in Spearman's rank correlation.
Figure 2.

Comparison between true () and MED-predicted (′) motif binding strengths. The noise level is 30%. (A) The scatter plot for true and predicted motif binding strengths. Note the difference in scale between X-axis and Y-axis. (B) True and predicted relative binding strengths of the same motifs in different genes. (C) True and predicted relative binding strengths of pairs of different motifs.

Comparison between true () and MED-predicted (′) motif binding strengths. The noise level is 30%. (A) The scatter plot for true and predicted motif binding strengths. Note the difference in scale between X-axis and Y-axis. (B) True and predicted relative binding strengths of the same motifs in different genes. (C) True and predicted relative binding strengths of pairs of different motifs. As stated earlier, if and form one solution, multiplying column i of by a and row i of by 1/a (a ≠ 0) generates another solution. Because a can be either positive or negative, it is expected that the r between a column in and its corresponding column in should be close to 1 or −1 when the noise level is low. This is indeed the case. For example, in the simulation with 30% noise, between and , 60% of columns have r > 0.98, while 40% of columns have r lower than −0.98 (same for rows between and ). This is why we observed low average r values and high standard deviations for both motif biding strengths and TF activities (Table 1). Because MED only supplies one of infinite numbers of solutions of and , which particular solution does it provide? This question is equivalent to asking what a values MED uses. We found that the initial matrix (I) used to start the decomposition process affects a. We conducted three sets of simulations, each containing 50 individual simulations. In the first set of 50 simulations, we started with an I where every non-zero entry was set to be 1, as used by the original authors of MED (10). The matrix was generated with parameter B changing from −5 to 5 in a step size of 0.2 in the 50 simulations. The matrix was generated as usual. In the second set of 50 simulations, we started with an I where every non-zero entry was set to be −1. In the third set of 50 simulations, we started with a I where every non-zero entry was randomly set to be either 1 or −1, with equal probabilities. Figure 3A–C shows the distributions of Pearson's correlation coefficients between columns of and for all the simulations in the three sets, respectively. They clearly show that the entries in tend to have the same sign as in I. For example, when B is positive and most entries in are positive, use of the I with positive entries tends to give more positive r values (Figure 3A) than use of the I with negative entries (Figure 3B). Similar patterns are observed in (Supplementary Figure S2).
Figure 3.

The distribution of Pearson's correlation coefficient between columns (motifs) of and , when all non-zero entries in I are (A) 1, (B) −1, and (C) randomly assigned to be either 1 or −1, with equal probabilities. B is the mean motif binding strength in .

The distribution of Pearson's correlation coefficient between columns (motifs) of and , when all non-zero entries in I are (A) 1, (B) −1, and (C) randomly assigned to be either 1 or −1, with equal probabilities. B is the mean motif binding strength in . Combining all the simulation results, we now have a better understanding of MED. The MED algorithm is designed in such a way that only one of infinite numbers of solutions is provided and this solution depends on the initial values used in decomposition. Knowing this property, it becomes clear that the MED-decomposed binding strengths for a given motif (across genes) are not true strengths, but are expected to be true strengths multiplied by an unknown number. Furthermore, this unknown number can be different for different motifs. The relative binding strengths of the same motif in different genes can be reliably estimated by MED. However, MED cannot distinguish between enhancers and repressors, neither can it distinguish between activation and suppression TF activities. Moreover, MED-predicted binding strengths cannot be compared among different motifs, and MED-predicted TF activities cannot be compared among different TFs.

Robustness of MED

MED relies on the input of gene expression data and cis-motif information. It is important to examine the influences of these factors on the performance of MED. In the above simulations, we simulated expression data from 4500 genes at 300 conditions. A practical question is how large the expression data have to be for MED to produce reliable values of , and . We do not reduce the gene number because most eukaryotes have >4500 genes. Rather, we reduce the number of conditions from 300 to 100 and 30, respectively, with the rationale that the cost for generating expression data can be significantly reduced if 100 or even 30 conditions are sufficient for predicting motif bind strengths and TF activities. Table 2 gives the results for 30 and 100 conditions, in comparison with 300 conditions. One can see that the reliability of the MED method in rebuilding is not reduced when fewer conditions are used. But, for predicting relative binding strengths and TF activities, use of fewer conditions worsens the MED performance. However, if the noise level is <10%, use of 30 conditions can still provide reasonably good predictions (Table 2).
Table 2.

Pearson's correlation coefficients between true values and MED-predicted values of expression levels (), relative motif binding strengths () and relative TF activities (), when the expression data are obtained from 300, 100 and 30 conditions, respectively

Noise level (%)EM ratio (within-column)aA ratio (within-row)b



300 conditions100 conditions30 conditions300 conditions100 conditions30 conditions300 conditions100 conditions30 conditions
01.000 ± 0.0000.997 ± 0.0041.000 ± 0.0000.9980.9930.9760.9960.9980.996
50.997 ± 0.0010.997 ± 0.0010.998 ± 0.0010.9860.9890.9460.9920.9870.993
100.991 ± 0.0050.990 ± 0.0060.989 ± 0.0090.9880.9330.8670.9640.9560.976
200.962 ± 0.0260.964 ± 0.0250.967 ± 0.0270.9420.8450.6990.9300.9060.916
300.929 ± 0.0360.930 ± 0.0370.934 ± 0.0490.9040.8400.5860.8620.8730.818
400.872 ± 0.0630.880 ± 0.0610.887 ± 0.0760.8480.7600.5790.7710.7980.744
500.834 ± 0.0670.833 ± 0.0780.841 ± 0.0980.8120.6110.4040.7140.6800.623
1000.606 ± 0.0990.595 ± 0.1250.652 ± 0.1640.5870.3610.2240.4350.3590.314

aRelative binding strengths of the same motif in two genes.

bRelative activities of the same TF under two different conditions.

Pearson's correlation coefficients between true values and MED-predicted values of expression levels (), relative motif binding strengths () and relative TF activities (), when the expression data are obtained from 300, 100 and 30 conditions, respectively aRelative binding strengths of the same motif in two genes. bRelative activities of the same TF under two different conditions. Detection of TF-binding sites is a much studied topic in the past decade (13–16). However, not all cis-regulatory motifs can be detected by current methods (13). We examined the accuracy of MED in two situations when some motifs in the genome are undetected. In the first situation, for a given TF, a fraction of its corresponding cis-motifs in the genome are assumed to be undetected. In the simulation, we fixed a random set of non-zero entries in I at 0. We repeated the simulation 10 times, as in each replication a different set of non-zero entries from the same I were fixed at 0. We examined r between and for relative binding strengths of the same motif in two genes. Note that presumably undetected motifs were not considered in computing r. We assumed that 0, 5, 10, 20, 30, 40 and 50% of motifs are undetected in seven sets of simulations, respectively. The results show that undetected motifs slightly worsen the performance of MED in predicting relative motif binding strengths (Figure 4A). The same is true for the relative TF activities (Supplementary Figure S3A).
Figure 4.

Performance of the MED method in predicting relative motif binding strength when some motifs in the genome are undetected. The mean correlation coefficient from 10 simulations and the associated standard deviation are presented for each condition examined. In (A), a fraction of motifs (from 0% to 50%) for each TF are undetected in the genome. In (B), all motifs of a fraction of TFs (from 0% to 50%) are undetected in the genome. Different colors show different fractions.

Performance of the MED method in predicting relative motif binding strength when some motifs in the genome are undetected. The mean correlation coefficient from 10 simulations and the associated standard deviation are presented for each condition examined. In (A), a fraction of motifs (from 0% to 50%) for each TF are undetected in the genome. In (B), all motifs of a fraction of TFs (from 0% to 50%) are undetected in the genome. Different colors show different fractions. In the second situation, we assumed that for most TFs, all of their corresponding motifs are known, while for the rest of the TFs, none of their motifs are known. In the simulation, we fixed all the entries of a random set of columns in I at 0. We repeated the simulation 10 times, as in each replication a different set of columns from the same I were fixed at 0. We examined r between and for relative binding strengths of the same motif in two genes. Again, presumably undetected motifs were not considered in computing r. We also assumed that 0, 5, 10, 20, 30, 40 and 50% of motifs are undetected in seven sets of simulations, respectively. The results show that this type of ignorance of motifs has a great impact on the prediction of relative motif binding strengths (Figure 4B). The same is true for the relative TF activities (Supplementary Figure S3B). Nonetheless, the predictions are not too bad (mean r > 0.65) when motifs corresponding to up to 10% of TFs are completely unknown and the noise level is not >30%.

An application of MED

After knowing what MED can do and cannot do, we decided to use MED to address an important question in gene regulation. It is frequently observed in eukaryotic promoters that a motif appears with multiple tandem copies (6). Although it has been frequently assumed that a motif with multiple copies in a promoter has stronger binding strength than the same motif with only one copy (2,17), whether this assumption is valid at the genomic scale has not been empirically tested. This question is ideal for MED to tackle, because it only requires the mean binding strength of a given motif in one set of genes, relative to that in another set of genes. Using the same yeast dataset used by Nguyen and D’Haeseleer (10), we separated the genes into two groups for each motif. The first group includes genes that each has only one copy of this motif, whereas the second group includes genes that each has multiple copies of the motif. Of the 62 motifs that can be separated into two groups, we found 18 motifs for which the average binding strengths for the two groups have opposite signs (i.e. one is positive and other is negative). These inconsistent results are likely due to MED errors and thus are removed. For each of the remaining 44 motifs, we calculated the ratio (R) between the average binding strength of the second group and that of the first group. We then tested the null hypothesis that R = 1, against the alternative hypothesis that R > 1. We found that the average R of the 44 motifs is 4.517 ± 0.897, significantly greater than 0 (P < 10−5; t-test; Figure 5). Furthermore, 37 motifs, significantly more than half of the 44 motifs, have R > 1 (P = 3 × 10−6; binomial test; Figure 5). These results indicate that motifs with multiple copies in promoters generally have greater binding strengths than the same motifs with single copies (Figure 5).
Figure 5.

Frequency distribution of the ratio () between the mean binding strength of a motif in promoters where it has multiple copies to the mean binding strength of the same motif in promoters where it has one copy. The distribution is from 44 different motifs in yeast.

Frequency distribution of the ratio () between the mean binding strength of a motif in promoters where it has multiple copies to the mean binding strength of the same motif in promoters where it has one copy. The distribution is from 44 different motifs in yeast.

DISCUSSION

The exponential growth of available functional genomic data opens the possibility to understand biological processes at the genomic and systems levels (6,18,19). One major advance in this endeavor is the development of methods for identifying cis-regulatory motifs in promoters of all genes in a genome. Using genome-wide microarray gene expression data and motif information, Nguyen and D’Haeseleer invented the MED method, which decomposes the gene expression data into motif binding strength data and TF activity data (10). The knowledge of binding strengths and TF activities can be used to decipher principles of transcriptional regulation. Thus, it is important to know how well MED performs. In this work, we conducted computer simulations to evaluate the MED method. Our results showed that at realistic levels of noise, which includes both expression stochasticity and microarray errors, MED-predicted gene expression levels are highly reliable. This result is not unexpected, as MED decomposes into and , which are then used to rebuild . For both binding strengths and TF activities, however, MED cannot provide accurate predictions. Furthermore, MED cannot differentiate between enhancer and repressor motifs and cannot differentiate between activation and suppression TF activities. MED results cannot be used to compare binding strengths among different motifs and compare activities among different TFs. Nevertheless, the relative binding strengths of the same motif in different genes and the relative activities of the same TF under different conditions can be estimated with fairly high accuracy. If we have external information that a motif is an enhancer or repressor or that a TF activity under a given condition is activation or suppression (relative to the control condition), such information can be combined with MED results to provide better predictions. We note that relative binding strengths of the same motif in different genes and relative activities of the same TF under different conditions can provide much information that is valuable to our understanding of principles of transcriptional regulation. One such example is the comparison between binding strengths of the same motif when it has one copy per promoter versus multiple copies per promoter. Using MED results, we demonstrated that for the majority of motifs (84%), the binding strength is greater when a motif appears in multiple copies than when it appears in one copy. This may explain why many motifs have multiple copies in a promoter. However, we caution that this result was based on an analysis of motifs corresponding to only 62 TFs, about a third of all TFs in yeast. Because our simulation showed that MED is not robust to the ignorance of all motifs of even 10% of TFs in the genome, the validity of our result should be further examined when larger data become available. An encouraging finding from our simulations is that at realistic levels of noise, MED requires expression data from as few as 30 conditions to provide reasonably accurate predictions of relative motif binding strengths and relative TF activities. Thus, even a small lab may be able to generate sufficient data for a genome-wide estimation of motif binding strengths in a non-model organism. Another encouraging finding is that even when some motifs (e.g. 20%) in the genome are undetected, MED can still make reasonable good predictions, as long as the majority of motifs are detected for each TF. When all motifs of some TFs are unknown, MED will have much reduced accuracy. Thus, from the perspective of MED performance, it is more important to identify most motifs for each TF than to identify all motifs for some TFs. It should be noted, however, that the simulation results presented here were based on a number of simplified assumptions that warrant discussion. First, we assumed a simple logic of transcriptional regulation as described by Equation (1) in Methods section. If this assumption is violated, MED predictions will be less accurate. One potentially important violation is interaction between motifs or interaction between TFs, which have been observed (20,21). Second, epigenetic factors are known to affect gene expression differently for different genes under different conditions (22). Third, we assumed a relatively simple form of expression stochasticity and microarray noise. If expression errors are much larger and/or more complex, MED predictions may be less accurate. We believe that a better understanding of the molecular mechanisms of gene expression regulation will assist the development of more powerful computational tools, which in turn help further understand gene expression regulation.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.
  21 in total

1.  Stochastic gene expression in a single cell.

Authors:  Michael B Elowitz; Arnold J Levine; Eric D Siggia; Peter S Swain
Journal:  Science       Date:  2002-08-16       Impact factor: 47.728

2.  A motif co-occurrence approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli.

Authors:  Martha L Bulyk; Abigail M McGuire; Nobuhisa Masuda; George M Church
Journal:  Genome Res       Date:  2004-02       Impact factor: 9.043

3.  Genetic regulatory mechanisms in the synthesis of proteins.

Authors:  F JACOB; J MONOD
Journal:  J Mol Biol       Date:  1961-06       Impact factor: 5.469

4.  Genomic analysis of regulatory network dynamics reveals large topological changes.

Authors:  Nicholas M Luscombe; M Madan Babu; Haiyuan Yu; Michael Snyder; Sarah A Teichmann; Mark Gerstein
Journal:  Nature       Date:  2004-09-16       Impact factor: 49.962

5.  gNCA: a framework for determining transcription factor activity based on transcriptome: identifiability and numerical implementation.

Authors:  Linh M Tran; Mark P Brynildsen; Katy C Kao; Jason K Suen; James C Liao
Journal:  Metab Eng       Date:  2005-03       Impact factor: 9.783

6.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.

Authors:  J van Helden; B André; J Collado-Vides
Journal:  J Mol Biol       Date:  1998-09-04       Impact factor: 5.469

7.  Transcriptional regulatory code of a eukaryotic genome.

Authors:  Christopher T Harbison; D Benjamin Gordon; Tong Ihn Lee; Nicola J Rinaldi; Kenzie D Macisaac; Timothy W Danford; Nancy M Hannett; Jean-Bosco Tagne; David B Reynolds; Jane Yoo; Ezra G Jennings; Julia Zeitlinger; Dmitry K Pokholok; Manolis Kellis; P Alex Rolfe; Ken T Takusagawa; Eric S Lander; David K Gifford; Ernest Fraenkel; Richard A Young
Journal:  Nature       Date:  2004-09-02       Impact factor: 49.962

8.  Transcriptional regulatory networks in Saccharomyces cerevisiae.

Authors:  Tong Ihn Lee; Nicola J Rinaldi; François Robert; Duncan T Odom; Ziv Bar-Joseph; Georg K Gerber; Nancy M Hannett; Christopher T Harbison; Craig M Thompson; Itamar Simon; Julia Zeitlinger; Ezra G Jennings; Heather L Murray; D Benjamin Gordon; Bing Ren; John J Wyrick; Jean-Bosco Tagne; Thomas L Volkert; Ernest Fraenkel; David K Gifford; Richard A Young
Journal:  Science       Date:  2002-10-25       Impact factor: 47.728

9.  Assessing computational tools for the discovery of transcription factor binding sites.

Authors:  Martin Tompa; Nan Li; Timothy L Bailey; George M Church; Bart De Moor; Eleazar Eskin; Alexander V Favorov; Martin C Frith; Yutao Fu; W James Kent; Vsevolod J Makeev; Andrei A Mironov; William Stafford Noble; Giulio Pavesi; Graziano Pesole; Mireille Régnier; Nicolas Simonis; Saurabh Sinha; Gert Thijs; Jacques van Helden; Mathias Vandenbogaert; Zhiping Weng; Christopher Workman; Chun Ye; Zhou Zhu
Journal:  Nat Biotechnol       Date:  2005-01       Impact factor: 54.908

10.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae.

Authors:  Kenzie D MacIsaac; Ting Wang; D Benjamin Gordon; David K Gifford; Gary D Stormo; Ernest Fraenkel
Journal:  BMC Bioinformatics       Date:  2006-03-07       Impact factor: 3.169

View more
  1 in total

1.  Predicting promoter activities of primary human DNA sequences.

Authors:  Takuma Irie; Sung-Joon Park; Riu Yamashita; Masahide Seki; Tetsushi Yada; Sumio Sugano; Kenta Nakai; Yutaka Suzuki
Journal:  Nucleic Acids Res       Date:  2011-04-12       Impact factor: 16.971

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.