| Literature DB >> 17378935 |
Mamoru Kato1, Tatsuhiko Tsunoda.
Abstract
BACKGROUND: A combination of multiple types of transcription factors and cis-regulatory elements is often required for gene expression in eukaryotes, and the combinatorial regulation confers specific gene expression to tissues or environments. To reveal the combinatorial regulation, computational methods are developed that efficiently infer combinations of cis-regulatory motifs that are important for gene expression as measured by DNA microarrays. One promising type of computational method is to utilize regression analysis between expression levels and scores of motifs in input sequences. This type takes full advantage of information on expression levels because it does not require that the expression level of each gene be dichotomized according to whether or not it reaches a certain threshold level. However, there is no web-based tool that employs regression methods to systematically search for motif combinations and that practically handles combinations of more than two or three motifs.Entities:
Mesh:
Year: 2007 PMID: 17378935 PMCID: PMC1838919 DOI: 10.1186/1471-2105-8-100
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1View of . (A) View at the step of motif cutting. The cluster figure shows the relationship between uploaded motifs. Users can cut out redundant similar motifs by simply clicking the button. (B) View at the step of combination search. Two figures are displayed on the web screen. The first figure ("History of evaluation score") shows the best goodness-of-fit value versus each of the iterations during the combination search. The dropped line in blue corresponds to finding the optimum motif combination here. The second figure ("Landscape of evaluation score") shows all goodness-of-fit values during the search versus the distance of all motif combinations from the reference motif (see the text). Different values of the distance are intended to represent different motif combinations. The point in blue near the bottom corresponds to the dropped line in the first figure and to the optimum motif combination here.
Figure 2Procedures of the genetic algorithm. Ms (M1, M2, ...) indicate motifs. A series of motifs (e.g., |M1|M7|M5|M2|) is a motif combination composed of the motifs. At the initialization step, motifs are randomly selected to generate motif combinations. For each motif combination, matching scores are calculated from upstream sequences, and expression levels of mRNAs with the upstream sequences are obtained from microarray data. Regression between the scores and the expression levels is performed to obtain the goodness-of-fit (AIC or GCV), and motif combinations with the best goodness-of-fit values are selected to take crossover. Then mutation is performed to replace motifs with other motifs. These procedures are iteratively executed for updated motif combinations.
Evaluation of the tool by simulated data sets
| Expression type | Noise (%) | Log10( | C.R. | Log10( | C.R. |
| Average | Best | ||||
| Linear | 0 | -2.9 | 0.83 | -4.8 | 0.89 |
| 20 | -2.9 | 0.80 | -4.9 | 0.85 | |
| 40 | -2.7 | 0.73 | -4.5 | 0.78 | |
| 60 | -2.7 | 0.63 | -4.3 | 0.68 | |
| 80 | -2.2 | 0.52 | -3.7 | 0.55 | |
| Quadratic | 0 | -3.9 | 0.82 | -5.8 | 0.88 |
| 20 | -3.7 | 0.78 | -5.8 | 0.85 | |
| 40 | -3.6 | 0.73 | -5.5 | 0.77 | |
| 60 | -3.0 | 0.63 | -4.7 | 0.69 | |
| 80 | -2.9 | 0.57 | -4.6 | 0.58 | |
We tested if the tool can correctly recover planted motif combinations, using simulated data sets. We evaluated the results based on P values of the hypergeometric test (see the text). "Linear" and "Quadratic" indicate types of simulated expression levels given by eq. 2 and eq. 3, respectively, and "Noise" indicates the degree of noises added into the simulated expression levels. "Average" and "Best" correspond to the cases of the P value (logarithmically) averaged across the top 10 motif combinations and the best P value among the top 10, respectively (see the text). "Log10(P)" indicates log10 of the P values. "C.R." indicates the contribution rate, which is the proportion of the variance of input expression levels explained by the scores of a motif combination. "Log10(P)" and "C.R." values in each row are averaged across ten data sets.
Evaluation of the tool by muscle-specific transcripts
| Examined tissue | Best log10( | C.R. | Selected muscle motifs |
| Muscle, abdominal | -2.02 | 0.43 | MEF2, SRF |
| -2.02 | 0.42 | MYF, SRF | |
| Muscle, right calf | -3.75 | 0.52 | MEF2, SP1, SRF |
We used upstream sequences and expression levels of muscle-specific transcripts [24] to see if the tool can select a motif combination composed of motifs (MEF2, MYF, SP1, SRF, and TEF) involved in muscle-specific expression from all JASPAR [18] motifs. We employed MARS as the regression method. We evaluated the results based on P values of the hypergeometric test (see the text). "Examined tissue" indicates tissues that the study [24] examined as tissues of skeletal muscle. We list here results on a motif combination(s) with the best P value among a dozen of the top motif combinations selected by the tool. "Log10(P)" indicates log10 of the P values. "C.R." indicates the contribution rate, which is the proportion of the variance of input expression levels explained by the scores of a motif combination. "Selected muscle motifs" indicates muscle-related motifs that were included in the motif combination selected by the tool.