| Literature DB >> 20529901 |
Richard S Savage1, Zoubin Ghahramani, Jim E Griffin, Bernard J de la Cruz, David L Wild.
Abstract
MOTIVATION: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20529901 PMCID: PMC2881394 DOI: 10.1093/bioinformatics/btq210
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Graphical representation of the model presented in this article. The parameters are defined in Section 2.
The BHI scores for galactose utilization with Harbison et al. ChIP data, for comparison with Table 1
| Similarity matrix | w | No. of genes | BHI (all) | BHI (bp) | BHI (mf) | BHI (cc) |
|---|---|---|---|---|---|---|
| fused genes | 0.5 | 72 | 0.49 ± 0.01 | 0.42 ± 0.01 | 0.35 ± 0.01 | 0.49 ± 0.01 |
| fused genes | 1 | 205 | 0.39 ± 0.01 | 0.22 ± 0.01 | 0.19 ± 0.01 | 0.37 ± 0.01 |
| fused genes | sampled | 56 | 0.49 ± 0.01 | 0.40 ± 0.01 | 0.32 ± 0.01 | 0.49 ± 0.01 |
The Lee et al. ChIP data are used in this article to mimick the Liu et al. analysis. The results here show that the Harbison et al. data result in a greater number of fused genes, with similar overall BHI scores. Also shown are results for a run where w is sampled using a Gibbs sampler. This shows a small degradation over the w=0.5 .case.
The BHI scores for the galactose utilization dataset
| Similarity matrix | w | No. of genes | BHI (all) | BHI (bp) | BHI (mf) | BHI (cc) |
|---|---|---|---|---|---|---|
| Fused genes | 0.5 | 51 | 0.49 ± 0.03 | 0.43 ± 0.05 | 0.40 ± 0.04 | 0.49 ± 0.03 |
| Fused genes | 1 | 205 | 0.37 ± 0.01 | 0.22 ± 0.01 | 0.19 ± 0.01 | 0.37 ± 0.01 |
| Unfused (expression only) | 0 | 205 | 0.38 ± 0.01 | 0.26 ± 0.02 | 0.22 ± 0.02 | 0.38 ± 0.01 |
| Unfused (expression only) | 0.5 | 154 | 0.37 ± 0.03 | 0.30 ± 0.02 | 0.23 ± 0.02 | 0.37 ± 0.03 |
| Unfused (ChIP chip only) | 0 | 205 | 0.28 ± 0.03 | 0.13 ± 0.01 | 0.11 ± 0.02 | 0.25 ± 0.03 |
| Unfused (ChIP chip only) | 0.5 | 154 | 0.20 ± 0.06 | 0.06 ± 0.03 | 0.07 ± 0.04 | 0.19 ± 0.07 |
| Context-averaged (Liu | 0 | 205 | 0.38 ± 0.01 | 0.26 ± 0.02 | 0.22 ± 0.01 | 0.38 ± 0.01 |
| Context-averaged | 0.5 | 205 | 0.40 ± 0.01 | 0.24 ± 0.01 | 0.20 ± 0.01 | 0.40 ± 0.02 |
| Context-averaged | 1 | 205 | 0.37 ± 0.01 | 0.22 ± 0.01 | 0.19 ± 0.01 | 0.37 ± 0.01 |
We compute the BHI scores for each GO (biological process, molecular function and cellular component) and an overall value. The fused genes are those with a posterior probability of being fused of at least 0.5. All other genes are classed as unfused. Context-averaged similarity matrices are simply constructed by averaging the posterior similarity matrix over both contexts (i.e. datasets). This is the method used by Liu et al. For comparison, the result obtained using the BHC algorithm on the gene expression data alone is 0.323.
Fig. 2.Plots of the BHI for the galactose dataset, showing the variation with different numbers of fused genes. Shown are the BHI results for each GO separately, plus all three combined. In all cases, selecting 100 or fewer genes leads to an increase in the BHI score. The error bars show a distribution of randomized BHI scores where the cluster sizes and number of clusters are kept the same but gene names are drawn randomly from the 205 genes in the galactose dataset. By comparison, this gives us a measure of the enrichment of the fused gene clusters.
Fig. 3.Graphical representation of the significantly over-represented GO terms for each cluster of genes, for the galactose utilization (left) and cell cycle (right) datasets. Black indicates that a given gene is annotated with the relevant GO term and that the term is over-represented in that cluster.
Over-represented GO terms for one of the fused clusters extracted from the galactose utilization dataset (with w=0.5
| GO ID | Cluster | Count (fused) | Count (Liu) | GO term | |
|---|---|---|---|---|---|
| 4365 | 1 | 9.8 × 10−6 | 3/9 | Glyceraldehyde-3-phosphate dehydrogenase (phosphorylating) activity | |
| 16 620 | 1 | 5.0 × 10−4 | 3/9 | Oxidoreductase activity, acting on aldehyde/oxo donors, NAD/NADP acceptor | |
| 51 287 | 1 | 1.1 × 10−3 | 3/9 | NAD or NADH binding | |
| 6096 | 1 | 3.6 × 10−8 | 4/4 | 9/9 | Glycolysis |
| 19 320 | 1 | 3.3 × 10−7 | 4/4 | 9/9 | Hexose catabolic process |
| 46 164 | 1 | 7.4 × 10−7 | 4/4 | 9/9 | Alcohol catabolic process |
| 16 052 | 1 | 3.3 × 10−6 | 4/4 | 9/9 | Carbohydrate catabolic process |
| 6094 | 1 | 1.9 × 10−5 | 3/4 | 7/9 | Gluconeogenesis |
| 46 364 | 1 | 1.2 × 10−4 | 3/4 | 7/9 | Monosaccharide biosynthetic process |
| 19 752 | 1 | 5.7 × 10−4 | 4/4 | 8/9 | Carboxylic acid metabolic process |
| 42 180 | 1 | 6.4 × 10−4 | 4/4 | 8/9 | Cellular ketone metabolic process |
| 6082 | 1 | 6.8 × 10−4 | 4/4 | 8/9 | Organic acid metabolic process |
| 6800 | 1 | 1.2 × 10−3 | 3/9 | Oxygen and reactive oxygen species metabolic process | |
| 1950 | 1 | 1.7 × 10−4 | 3/4 | 7/9 | Plasma membrane enriched fraction |
| 5626 | 1 | 2.1 × 10−3 | 3/4 | 7/9 | Insoluble fraction |
| 5811 | 1 | 3.3 × 10−3 | 3/9 | Lipid particle | |
| 16 251 | 2 | 8.9 × 10−5 | 7/84 | General RNA polymerase II transcription factor activity | |
| 30 528 | 2 | 1.6 × 10−3 | 6/23 | 26/84 | Transcription regulator activity |
| 31 202 | 2 | 1.6 × 10−3 | 4/23 | RNA splicing factor activity, transesterification mechanism | |
| 3677 | 2 | 6.1 × 10−3 | 14/84 | DNA binding | |
| 16 070 | 2 | 9.2 × 10−12 | 19/23 | 53/84 | RNA metabolic process |
| 10 467 | 2 | 1.1 × 10−7 | 22/23 | 78/84 | Gene expression |
| 398 | 2 | 1.5 × 10−4 | 6/23 | 24/84 | Nuclear mRNA splicing via spliceosome |
| 375 | 2 | 2.3 × 10−4 | 6/23 | 26/84 | RNA splicing, via transesterification reactions |
| 45 449 | 2 | 4.1 × 10−4 | 6/23 | 29/84 | Regulation of transcription |
| 80 090 | 2 | 7.5 × 10−4 | 13/23 | 40/84 | Regulation of primary metabolic process |
| 34 961 | 2 | 1.2 × 10−3 | 17/23 | 52/84 | Cellular biopolymer biosynthetic process |
| 9059 | 2 | 2.9 × 10−3 | 17/23 | 52/84 | Macromolecule biosynthetic process |
| 51 171 | 2 | 6.1 × 10−3 | 8/23 | 31/84 | Regulation of nitrogen compound metabolic process |
| 5634 | 2 | 2.6 × 10−8 | 15/84 | Nucleus | |
| 32 991 | 2 | 7.9 × 10−5 | 24/84 | Macromolecular complex | |
| 5681 | 2 | 1.3 × 10−4 | 5/23 | 19/84 | Spliceosomal complex |
| 43 227 | 2 | 3.9 × 10−4 | 23/23 | 83/84 | Membrane-bounded organelle |
| 3735 | 3 | 1.3 × 10−21 | 16/17 | 75/75 | Structural constituent of ribosome |
| 6412 | 3 | 3.4 × 10−10 | 13/17 | 49/75 | Translation |
| 43 284 | 3 | 4.4 × 10−8 | 17/17 | 75/75 | Biopolymer biosynthetic process |
| 34 645 | 3 | 1.2 × 10−7 | 17/17 | 75/75 | Cellular macromolecule biosynthetic process |
| 9058 | 3 | 4.2 × 10−6 | 49/75 | Biosynthetic process | |
| 19 538 | 3 | 7.2 × 10−6 | 13/17 | 49/75 | Protein metabolic process |
| 34 960 | 3 | 4.8 × 10−4 | 17/17 | 75/75 | Cellular biopolymer metabolic process |
| 43 170 | 3 | 9.4 × 10−4 | 17/17 | 75/75 | Macromolecule metabolic process |
| 33 279 | 3 | 3.7 × 10−21 | 16/17 | 75/75 | Ribosomal subunit |
| 5829 | 3 | 1.1 × 10−13 | 16/17 | 74/75 | Cytosol |
| 22 627 | 3 | 3.2 × 10−11 | 8/17 | 33/75 | Cytosolic small ribosomal subunit |
| 43 232 | 3 | 8.3 × 10−10 | 16/17 | 75/75 | Intracellular non-membrane-bounded organelle |
| 22 625 | 3 | 1.1 × 10−9 | 8/17 | 41/75 | Cytosolic large ribosomal subunit |
| 32 991 | 3 | 1.5 × 10−6 | 16/17 | 75/75 | Macromolecular complex |
| 44 422 | 3 | 1.1 × 10−4 | 16/17 | 75/75 | Organelle part |
| 32 040 | 3 | 5.4 × 10−3 | 7/75 | Small-subunit processome | |
| 51119 | 4 | 2.8 × 10−9 | 4/4 | 11/12 | Sugar transmembrane transporter activity |
| 5353 | 4 | 3.2 × 10−7 | 3/4 | 10/12 | Fructose transmembrane transporter activity |
| 15578 | 4 | 3.2 × 10−7 | 3/4 | 10/12 | Mannose transmembrane transporter activity |
| 5355 | 4 | 4.6 × 10−7 | 3/4 | 10/12 | Glucose transmembrane transporter activity |
| 22891 | 4 | 3.3 × 10−5 | 4/4 | 12/12 | Substrate-specific transmembrane transporter activity |
| 5215 | 4 | 1.2 × 10−4 | 4/4 | 12/12 | Transporter activity |
| 8645 | 4 | 1.2 × 10−9 | 4/4 | 9/12 | Hexose transport |
| 8643 | 4 | 9.7 × 10−9 | 4/4 | 11/12 | Carbohydrate transport |
| 55085 | 4 | 1.4 × 10−5 | 4/4 | 12/12 | Transmembrane transport |
| 51234 | 4 | 8.3 × 10−3 | 4/4 | 12/12 | Establishment of localization |
| 5886 | 4 | 5.0 × 10−3 | 3/4 | 11/12 | Plasma membrane |
Also shown is a comparison with the GO terms extracted by the Liu et al. method. There is a general trend that the fused clusters are more highly GO enriched. For example, we have highlighted in bold all the cases where a cluster from one method shows a percentage of GO enrichment (for a given term) that is at least 1.5 times higher than the other method. Note that only GO terms appearing only in both cases are shown.
Fig. 4.The ChIP-chip data for the fused genes of the galactose utilization (left) and cell cycle (right) dataset analyses. The data have been sorted by the clustering partition. Black pixels indicate a transcription factor that binds to that gene. The different shades of grey show the clustering partition.
The BHI scores for the cell cycle dataset
| Similarity matrix | w | No. of genes | BHI (all) | BHI (bp) | BHI (mf) | BHI (cc) |
|---|---|---|---|---|---|---|
| Fused genes | 0.5 | 266 | 0.33 ± 0.01 | 0.18 ± 0.02 | 0.17 ± 0.01 | 0.23 ± 0.01 |
| Fused genes | 1 | 1165 | 0.30 ± 0.01 | 0.09 ± 0.01 | 0.14 ± 0.01 | 0.20 ± 0.01 |
| Unfused (expression only) | 0 | 1165 | 0.28 ± 0.01 | 0.07 ± 0.01 | 0.14 ± 0.01 | 0.19 ± 0.01 |
| Unfused (expression only) | 0.5 | 898 | 0.31 ± 0.01 | 0.08 ± 0.01 | 0.16 ± 0.01 | 0.20 ± 0.01 |
| Unfused (ChIP chip only) | 0 | 1165 | 0.30 ± 0.01 | 0.05 ± 0.01 | 0.12 ± 0.02 | 0.24 ± 0.02 |
| Unfused (ChIP chip only) | 0.5 | 898 | 0.25 ± 0.03 | 0.06 ± 0.01 | 0.13 ± 0.03 | 0.21 ± 0.02 |
| Context-averaged (Liu | 0 | 1165 | 0.29 ± 0.01 | 0.09 ± 0.01 | 0.14 ± 0.01 | 0.20 ± 0.01 |
| Context-averaged | 0.5 | 1165 | 0.30 ± 0.01 | 0.08 ± 0.01 | 0.15 ± 0.01 | 0.20 ± 0.01 |
| Context-averaged | 1 | 1165 | 0.30 ± 0.01 | 0.09 ± 0.01 | 0.14 ± 0.01 | 0.20 ± 0.01 |
The fused genes are those with a posterior probability of being fused ≥0.5. All other genes are classed as unfused. Context-averaged similarity matrices are simply constructed by averaging the posterior similarity matrix over both contexts (i.e. datasets). This is the method used by Liu et al. For comparison, the result obtained using the BHC algorithm on just the gene expression data is 0.285.