| Literature DB >> 16438730 |
Ryung S Kim1, Hongkai Ji, Wing H Wong.
Abstract
BACKGROUND: Many statistical algorithms combine microarray expression data and genome sequence data to identify transcription factor binding motifs in the low eukaryotic genomes. Finding cis-regulatory elements in higher eukaryote genomes, however, remains a challenge, as searching in the promoter regions of genes with similar expression patterns often fails. The difficulty is partially attributable to the poor performance of the similarity measures for comparing expression profiles. The widely accepted measures are inadequate for distinguishing genes transcribed from distinct regulatory mechanisms in the complicated genomes of higher eukaryotes.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16438730 PMCID: PMC1403805 DOI: 10.1186/1471-2105-7-44
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Correlation of median expression distance with the regulatory similarity in seven data sets. Each point is the observed median expression distance of gene pairs as a function of the number of common TFBMs in the pairs. Two expression distance measures are used: (a) 1 minus correlation, and (b) the new expression distance measure. For each data set, the correlation between median expression distance and regulatory similarity is computed. To calculate the significance of such correlations, the mapping between genes and their promoter regions were permuted 500 times. When fewer than 5 gene pairs have certain regulatory similarity, the median expression distance is computed after combining nearest regulatory similarities to make each point in the plots represent at least 5 gene pairs. The genes that share large number of common TFBMs are more likely to have correlated expression patterns: sometimes, the effect is present only when they share enough common TFBMs. Table 1 summarizes the results with 7 different distance measures. The figure and the Table 1 show that, while all other distance measures perform similar, the new distance measure correlates best with the regulatory similarity. Only the new distance measure correlates significantly with all seven data sets.
Correlations between median expression distance and regulatory similarity. The performance of different distance measures were compared in each of seven mouse experiments: Su et al. (Su), Storch et al. (Circadian), the neocortex development (Cortex), the murine model of human asthma (Lung), the hippocampus samples from neurofibromin-1 heterozygous study (NF), Zhao et al. (Muscle), and Wang et al.(PI). The number of microarrays used in each data set are shown in the first row. The p-values in the parentheses are obtained by permuting the mapping between genes and their promoter regions 500 times.
| Expression distance measure | Su 89 | Circadian 24 | Cortex 17 | Lung 39 | NF 30 | Muscle 54 | PI 35 |
| 1 – correlation | -0.165 (0.422) | -0.261 (0.158) | -0.636 (0.008) | -0.423 (0.124) | -0.826 (0.000) | -0.358 (0.170) | -0.384 (0.198) |
| 1 – cosine correlation | -0.802 (0.000) | -0.392 (0.106) | -0.679 (0.002) | -0.456 (0.066) | -0.878 (0.000) | -0.047 (0.488) | 0.016 (0.666) |
| Square root 1 – correlation | -0.177 (0.416) | -0.280 (0.230) | -0.636 (0.008) | -0.412 (0.162) | -0.836 (0.000) | -0.401 (0.148) | -0.396 (0.200) |
| Square root 1 – cosine correlation | -0.783 (0.000) | -0.401 (0.166) | -0.683 (0.002) | -0.464 (0.064) | -0.869 (0.000) | -0.104 (0.490) | -0.007 (0.670) |
| 1 – correlation after log2 transformation | -0.178 (0.310) | -0.254 (0.124) | -0.459 (0.032) | -0.534 (0.030) | -0.798 (0.000) | -0.136 (0.314) | -0.035 (0.428) |
| 1 – cosine correlation after log2 transformation | -0.685 (0.006) | -0.026 (0.346) | -0.833 (0.000) | -0.830 (0.000) | -0.874 (0.000) | -0.540 (0.032) | 0.346 (0.812) |
| The new distance measure | -0.777 (0.000) | -0.789 (0.000) | -0.735 (0.002) | -0.776 (0.000) | -0.859 (0.000) | -0.774 (0.000) | -0.656 (0.000) |
Figure 2Typical co-expressed gene cluster with high correlation. The tightest gene cluster on the mice cortex developmental data is shown as a heatmap diagram; a sophisticated clustering algorithm is used with one minus correlation as the distance measure. The cluster consists 65 down regulated genes. The green column on the right side of the diagram shows the fold-change between two cortex samples at embryonic 8 days and adult age. The expression level matrix is standardized: mean subtracted and standard deviation divided; the color scheme ranges from -3 (blue, below the mean) to 3 (red, above the mean). The white color represents mean (0 value). The rows correspond to different genes, and the columns represent the experimental samples. The genes have tight linear expression pattern but their fold-changes between samples are highly variable. Such variability is a general phenomenon when one minus correlation is the distance measure.
Figure 3Overview of the binding data. (a) The histogram of the number of the known TFBMs in the promoter region of 12,079 non-redundant genes. (b) The distribution of the number of common known TFBSs in the promoter regions of all 72,945,081 gene pairs in 2 mouse chips.
Distance measures between two expression profiles. Two expression profiles x = (x1, ..., xn, y = (y1, ..., yn) have medians m(x), m(y) and the means , . The arithmetic relationship between the measures is as following: , and E(x', y') = d(x, y) where , , and = log2 (x/m(x)) and , , defined similarly.
| Distance Measures | Definition |
| Correlation | |
| Cosine correlation | |
| Euclidian distance | |
| New distance measure |