| Literature DB >> 23047558 |
Paul Kirk1, Jim E Griffin, Richard S Savage, Zoubin Ghahramani, David L Wild.
Abstract
MOTIVATION: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI (Multiple Dataset Integration). MDI can integrate information from a wide range of different datasets and data types simultaneously (including the ability to model time series data explicitly using Gaussian processes). Each dataset is modelled using a Dirichlet-multinomial allocation (DMA) mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets.Entities:
Mesh:
Year: 2012 PMID: 23047558 PMCID: PMC3519452 DOI: 10.1093/bioinformatics/bts595
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Graphical representation of three DMA mixture models. (a) Independent case. (b) The MDI model. In both (a) and (b), denotes the observation in dataset k and is generated by mixture component . The prior probabilities associated with the distinct component allocation variables, , are given in the vector , which is itself assigned a symmetric Dirichlet prior with parameter . The parameter vector, , for component c in dataset k is assigned a prior. In (b), we additionally have parameters, each of which models the dependence between the component allocations of observations in dataset k and
Fig. 2.(a) The data for the six-dataset synthetic example, separated into seven clusters. (b) A representation of how the cluster labels associated with each gene vary from dataset to dataset. Genes are ordered so that the clustering of Dataset 1 is the one that appears coherent. (c) A table showing the number of genes having the same cluster labels in datasets i and j. (d) A heatmap depiction of the similarity matrix formed by calculating the ARI between pairs of datasets
Fig. 3.(a) Densities fitted to the sampled values of . (b) Heatmap representation of the matrix with -entry , the posterior mean value for
BHI scores for the fused clusters obtained using the method of Savage , together with those obtained using MDI
| Method | BHI (all) | BHI (bp) | BHI (mf) | BHI (cc) | Number of genes |
|---|---|---|---|---|---|
| 0.98 | 0.85 | 0.71 | 0.98 | 72 | |
| MDI (bag-of-words) | 0.98 | 0.85 | 0.72 | 0.97 | 172 |
| MDI (multinomial) | 1.00 | 0.89 | 0.77 | 1.00 | 52 |
GOTO scores for fused clusters obtained for all combinations of the expression, ChIP and PPI datasets
| Dataset(s) | GOTO (bp) | GOTO (mf) | GOTO (cc) | Number Of genes |
|---|---|---|---|---|
| ChIP | 6.36 | 0.97 | 8.53 | 551 |
| PPI | 11.04 | 1.51 | 11.11 | 551 |
| Expression | 7.66 | 1.15 | 9.48 | 551 |
| ChIP + PPI | 27.04 | 3.47 | 18.99 | 31 |
| ChIP + Expression | 24.46 | 2.93 | 16.87 | 48 |
| PPI + Expression | 26.04 | 3.69 | 22.35 | 32 |
| ChIP + PPI + Expression | 34.81 | 2.46 | 26.70 | 16 |
Fig. 4.(a) Pairwise fusion probabilities for the 31 genes identified as fused across the ChIP and PPI datasets in the ‘Expression + ChIP + PPI’ example. Colours correspond to fused clusters and the dashed line indicates the fusion threshold. (b) Three-way fusion probabilities for the same 31 genes. Genes that do not exceed the fusion threshold have white bars. (c) The expression profiles for genes identified as fused according to the ChIP and PPI datasets. The coloured lines indicate genes that are also fused across the expression dataset as well
Clusters formed by the genes fused across all 3 datasets
| ID | Gene | Brief description |
|---|---|---|
| 2 | Involved in synthesis of 40S ribosomal subunits | |
| 2 | Required for biogenesis of the small ribosomal subunit | |
| 2 | Involved in assembly of 60S ribosomal subunit | |
| 2 | Component of the SSU processome | |
| 2 | Involved in biogenesis of 60S ribosomal subunit | |
| 3 | Histone H4, core histone protein | |
| 3 | Histone H2B, core histone protein | |
| 3 | Histone H2A, core histone protein | |
| 3 | Histone H3, core histone protein | |
| 3 | Histone H2B, core histone protein | |
| 3 | Histone H3, core histone protein | |
| 3 | Histone H4, core histone protein | |
| 5 | Subunit of the cohesin complex | |
| 5 | Subunit of the cohesin complex |
Descriptions were derived from the Saccharomyces Genome Database (Cherry ). The IDs in this table correspond to the cluster IDs in Figure 4, with singletons omitted.