| Literature DB >> 23506672 |
Cécile Bazot1, Nicolas Dobigeon, Jean-Yves Tourneret, Aimee K Zaas, Geoffrey S Ginsburg, Alfred O Hero.
Abstract
BACKGROUND: This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is that uBLU constrains the factor loadings to be non-negative and the factor scores to be probability distributions over the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted here to generate random samples according to the posterior distribution of the factors, factor scores, and number of factors. These samples are then used to estimate all the unknown parameters.Entities:
Mesh:
Year: 2013 PMID: 23506672 PMCID: PMC3681645 DOI: 10.1186/1471-2105-14-99
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Synthetic datasets
| 1 | Peaky factors |
| 2 | Realistic factors |
| 3 | Orthogonal factors |
| 4 | Orthogonal and positive factors |
Simulation results for dataset
| | |||||
| N/A | N/A | 205.99 | 267.42 | ||
| | 6.04 | 61.12 | N/A | N/A | |
| | 0.97 | 9.78 | 325.58 | 67.14 | |
| N/A | N/A | 64.39 | 226.58 | ||
| | 2.00 | 2.00 | N/A | N/A | |
| | 0.30 | 0.28 | 75.87 | 41.33 | |
| N/A | N/A | 21.69 | 12.48 | ||
| | 3.49 | 3.50 | N/A | N/A | |
| | 1.49 | 1.50 | 23.24 | 27.43 | |
| GSAD (×10 | 20.38 | 20.38 | 24.04 | 37.35 | |
| RE | 9.12 | 9.12 | 1.94 | 9.16 | |
| Time ( | 1.24×10 | 0.71 | 47.15 | 0.39×10 | |
| | |||||
| 6.01 | 0.48 | 212.30 | 40.27 | ||
| | 0.60 | 6.53 | 681.42 | 147.74 | |
| | 0.54 | 5.86 | 137.22 | 94.90 | |
| 6.62 | 0.19 | 76.09 | 45.29 | ||
| | 0.04 | 2.40 | 142.72 | 17.37 | |
| | 0.84 | 0.05 | 76.22 | 33.78 | |
| 1.86 | 0.53 | 10.68 | 11.86 | ||
| | 1.18 | 0.31 | 15.18 | 12.50 | |
| | 0.28 | 1.36 | 5.33 | 13.96 | |
| GSAD (×10 | 3.39 | 3.38 | 24.23 | 33.38 | |
| RE | 0.18 | 1.84 | 0.18 | ||
| Time ( | 1.24×10 | 0.95 | 53.60 | 0.56×10 | |
| | |||||
| 6.02 | 87.78 | 205.66 | 195.89 | ||
| | 0.60 | 6.53 | 247.96 | 101.34 | |
| | 0.54 | 8.03 | 330.01 | 68.69 | |
| 23.82 | 26.56 | 64.59 | 57.58 | ||
| | 11.70 | 0.23 | 114.02 | 3.10 | |
| | 6.37 | 18.04 | 75.47 | 27.72 | |
| 1.86 | 6.14 | 9.74 | 8.84 | ||
| | 1.18 | 0.31 | 22.15 | 26.80 | |
| | 0.28 | 1.36 | 8.17 | 27.32 | |
| GSAD (×10 | 3.39 | 3.36 | 28.62 | 29.23 | |
| RE | 0.18 | 2.08 | 0.18 | ||
| Time ( | 1.24×10 | 0.96 | 63.88 | 0.70×10 | |
MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods.
Simulation results for dataset
| | |||||
| 1.97 | N/A | N/A | N/A | ||
| | N/A | 1.06 | 37.67 | 58.75 | |
| | 0.14 | 26.68 | 52.09 | 150.09 | |
| 0.34 | N/A | N/A | N/A | ||
| | N/A | 1.12 | 1.17 | 22.37 | |
| | 0.94 | 6.24 | 0.62 | 1.18 | |
| 0.44 | N/A | N/A | N/A | ||
| | N/A | 1.32 | 16.53 | 13.34 | |
| | 0.47 | 3.72 | 15.21 | 18.14 | |
| GSAD (×10 | 1.51 | 1.53 | 37.99 | 129.40 | |
| RE (×10 | 1.62 | 1.65 | 0.65 | 5.47 | |
| Time ( | 22.06×10 | 32.02 | 4.07×10 | 9.24×10 | |
| | |||||
| 1.97 | 14.87 | 24.41 | 61.00 | ||
| | 0.14 | 20.53 | 50.59 | 58.31 | |
| | 0.14 | 14.02 | 35.89 | 65.11 | |
| 0.34 | 0.34 | 1.41 | 4.80 | ||
| | 0.15 | 2.44 | 0.65 | 9.40 | |
| | 0.09 | 0.92 | 1.19 | 5.40 | |
| 0.44 | 2.84 | 14.35 | 13.72 | ||
| | 0.48 | 4.75 | 15.47 | 13.62 | |
| | 0.47 | 4.00 | 17.50 | 15.82 | |
| GSAD (×10 | 1.02 | 1.49 | 29.29 | 129.29 | |
| RE (×10 | 0.64 | 1.55 | 0.75 | 1.62 | |
| Time ( | 22.06×10 | 45.91 | 5.37×10 | 16.59×10 | |
| | |||||
| 1.97 | 13.13 | 24.25 | 64.90 | ||
| | 0.14 | 20.53 | 50.52 | 64.09 | |
| | 0.14 | 14.02 | 28.32 | 69.99 | |
| 0.34 | 0.20 | 1.42 | 15.12 | ||
| | 0.48 | 1.00 | 0.65 | 9.55 | |
| | 0.09 | 0.44 | 1.31 | 7.73 | |
| 0.44 | 2.54 | 14.74 | 14.53 | ||
| | 0.48 | 5.52 | 15.45 | 14.55 | |
| | 0.47 | 4.79 | 16.45 | 16.17 | |
| GSAD (×10 | 1.02 | 1.06 | 40.36 | 129.29 | |
| RE (×10 | 0.64 | 0.69 | 0.86 | 1.50 | |
| Time ( | 22.06×10 | 55.86 | 5.59×10 | 16.59×10 | |
MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods.
Simulation results for dataset
| | |||||
| 0.83 | 0.82 | N/A | 1.14 | ||
| | 0.85 | 0.92 | 1.34 | 2.30 | |
| | N/A | N/A | 1.36 | N/A | |
| 7.75 | 7.72 | N/A | 8.94 | ||
| | 7.76 | 0.48 | 12.30 | 11.86 | |
| | N/A | N/A | 11.05 | N/A | |
| 7.09 | 7.04 | N/A | 15.55 | ||
| | 7.13 | 7.19 | 8.41 | 16.43 | |
| | 8.71 | N/A | N/A | N/A | |
| GSAD (×10 | 3.23 | 2.59 | 6.59 | 15.26 | |
| RE (×10 | 3.11 | 0.70 | 0.70 | 2.50 | |
| Time ( | 1.59×10 | 0.70 | 42.02 | 0.40×10 | |
| | |||||
| 0.15 | 0.15 | 1.74 | 1.20 | ||
| | 0.85 | 1.02 | 1.76 | 2.26 | |
| | 1.15 | 1.57 | 1.55 | 2.40 | |
| 7.75 | 14.89 | 11.40 | 14.09 | ||
| | 7.76 | 0.40 | 12.11 | 12.33 | |
| | 9.84 | 0.30 | 10.94 | 12.76 | |
| 2.60 | 2.47 | 11.34 | 15.76 | ||
| | 7.13 | 7.16 | 9.45 | 16.40 | |
| | 8.71 | 8.80 | 9.06 | 15.66 | |
| GSAD (×10 | 3.23 | 1.71 | 6.88 | 15.20 | |
| RE (×10 | 3.11 | 0.29 | 0.49 | 2.44 | |
| Time ( | 1.59×10 | 1.24 | 59.72 | 0.54×10 | |
| | |||||
| 0.02 | 1.43 | 1.43 | 1.19 | ||
| | 1.48 | 5.49 | 3.92 | 2.06 | |
| | 1.15 | 1.68 | 1.88 | 2.33 | |
| 13.78 | 20.56 | 16.66 | 13.15 | ||
| | 7.76 | 12.36 | 15.34 | 11.75 | |
| | 9.84 | 3.99 | 11.25 | 13.29 | |
| 0.97 | 10.27 | 10.24 | 15.97 | ||
| | 7.93 | 15.78 | 16.45 | 14.92 | |
| | 8.71 | 8.66 | 10.98 | 15.89 | |
| GSAD (×10 | 3.23 | 1.20 | 5.51 | 15.98 | |
| RE (×10 | 3.11 | 0.16 | 0.41 | 2.45 | |
| Time ( | 1.59×10 | 1.15 | 67.71 | 0.69×10 | |
MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods.
Simulation results for dataset
| | |||||
| N/A | 5.12 | N/A | N/A | ||
| | 1.61 | 3.59 | 15.35 | 18.69 | |
| | 0.44 | N/A | 14.42 | 19.20 | |
| N/A | 3.23 | N/A | N/A | ||
| | 0.87 | 2.65 | 0.33 | 1.62 | |
| | 0.69 | 0.76 | N/A | 1.30 | |
| N/A | 4.25 | N/A | N/A | ||
| | 3.08 | 3.71 | 14.90 | 14.89 | |
| | 0.68 | N/A | 15.59 | 15.70 | |
| GSAD (×10 | 5.24 | 5.25 | 157.09 | 156.19 | |
| RE (×10 | 4.88 | 4.89 | 19.34 | 8.48 | |
| Time ( | 1.61×10 | 1.36 | 35.29 | 0.40×10 | |
| | |||||
| 0.02 | 6.18 | 18.38 | 21.63 | ||
| | 1.61 | 4.79 | 16.10 | 19.55 | |
| | 0.09 | 4.21 | 15.04 | 19.85 | |
| 0.28 | 1.67 | 1.44 | 1.29 | ||
| | 0.87 | 1.01 | 0.37 | 1.75 | |
| | 0.69 | 0.94 | 0.26 | 1.17 | |
| 0.34 | 4.12 | 15.21 | 15.65 | ||
| | 3.08 | 4.09 | 15.26 | 15.90 | |
| | 0.51 | 4.16 | 16.07 | 15.36 | |
| GSAD (×10 | 4.97 | 4.99 | 157.08 | 154.80 | |
| RE (×10 | 4.49 | 4.36 | 25.00 | 8.48 | |
| Time ( | 1.61×10 | 1.78 | 41.05 | 0.55×10 | |
| | |||||
| 0.02 | 6.98 | 17.51 | 21.60 | ||
| | 1.61 | 7.30 | 15.07 | 19.03 | |
| | 0.07 | 4.27 | 14.55 | 19.14 | |
| 0.28 | 0.65 | 0.75 | 1.29 | ||
| | 0.87 | 0.91 | 0.77 | 1.18 | |
| | 0.69 | 0.56 | 0.56 | 1.33 | |
| 0.34 | 4.41 | 15.61 | 15.51 | ||
| | 3.08 | 4.81 | 16.31 | 14.77 | |
| | 0.51 | 4.00 | 15.84 | 15.26 | |
| GSAD (×10 | 4.97 | 4.94 | 156.76 | 162.63 | |
| RE (×10 | 4.49 | 4.33 | 13.48 | 8.29 | |
| Time ( | 1.61×10 | 1.56 | 48.22 | 0.70×10 | |
MSEs, GMSEs, SADs, GSADs, REs and computational times between the proposed uBLU algorithm and PCA, NMF, BFRM and GB-GMF methods.
Figure 1Experimental results on the H3N2 viral challenge dataset of gene expression profiles. (a) Estimated posterior distribution of the number of factors R. (b) Factor loadings ranked by decreasing dominance. (c) Heatmap of the factor scores of the inflammatory component clearly separates symptomatic subjects (bottom 9 rows) and the time course of their molecular inflammatory response. The five black colored pixels indicate samples that were not assayed.
Figure 2Reconstruction error and estimated number of factors as a function of the number of iterations (H3N2 challenge data). Top: Reconstruction error (RE() computed from the observation matrix Y and the estimated matrices M( and A( as a function of the iteration index t. Bottom: Estimated number of factors R( as a function of the iteration number t.
NCI-curated pathway associations of group of genes contributing to uBLU inflammatory component
| IFN-gamma pathway | CASP1, CEBPB, IL1B, IRF1, IRF9, PRKCD, SOCS1, STAT1, STAT3 | 1.34e-09 |
| PDGFR-beta signaling pathway | DOCK4, EIF2AK2, FYN, HCK, LYN, PRKCD, SLA, SRC, STAT1, STAT3, STAT5A, STAT5B | 3.26e-08 |
| IL23-mediated signaling events | CCL2, CXCL1, CXCL9, IL1B, STAT1, STAT3, STAT5A | 2.18e-07 |
| Signaling events mediated by TCPTP | EIF2AK2, SRC, STAT1, STAT3, STAT5A, STAT5B, STAT6 | 6.38e-07 |
| Signaling events mediated by PTP1B | FYN, HCK, LYN, SRC, STAT3, STAT5A, STAT5B | 2.40e-06 |
| GMCSF-mediated signaling events | CCL2, LYN, STAT1, STAT3, STAT5A, STAT5B | 3.70e-06 |
| IL12-mediated signaling events | HLA-A, IL1B, SOCS1, STAT1, STAT3, STAT5A, STAT6 | 1.32e-05 |
| IL6-mediated signaling events | CEBPB, HCK, IRF1, PRKCD, STAT1, STAT3 | 1.80e-05 |
NCI-curated pathway associations of group of genes contributing to uBLU inflammatory component, whose factor scores are shown in Figure 1 (Source: NCI pathway interaction database http://pid.nci.nih.gov). Genes in uBLU factor are significantly better represented in the NCI-curated pathways than the genes in NMF (compare p-values here to those in Table 8).
NCI-curated pathway associations of group of genes contributing to NMF inflammatory component
| IL23-mediated signaling events | CCL2, CXCL1, CXCL9, IL1B, JAK2, STAT1, STAT5A | 2.18e-07 |
| IL12-mediated signaling events | GADD45B, IL1B, JAK2, MAP2K6, SOCS1, STAT1, STAT5A, STAT6 | 1.10e-06 |
| IFN-gamma pathway | CASP1, IL1B, IRF9, JAK2, SOCS1, STAT1 | 1.07e-05 |
| Signaling events mediated by TCPTP | EIF2AK2, PIK3R2, STAT1, STAT5A, STAT5B, STAT6 | 1.07e-05 |
| IL27-mediated signaling events | IL1B, JAK2, STAT1, STAT2, STAT5A | 1.22e-05 |
| CXCR3-mediated signaling events | CXCL10, CXCL11, CXCL13, CXCL9, MAP2K6, PIK3R2 | 1.23e-05 |
| GMCSF-mediated signaling events | CCL2, JAK2, STAT1, STAT5A, STAT5B | 6.24e-05 |
| PDGFR-beta signaling pathway | EIF2AK2, JAK2, PIK3R2, ARAP1, DOCK4, STAT1, STAT5A, STAT5B | 1.38e-04 |
NCI-curated pathway associations of group of genes contributing to NMF inflammatory component, whose factor scores are shown in Figure 4 (Source: NCI pathway interaction database http://pid.nci.nih.gov). Genes in uBLU factor are significantly better represented in the NCI-curated pathways than the genes in NMF (compare p-values here to those in Table 6).
Figure 3Factor loadings ranked by decreasing dominance for H3N2 challenge data. uBLU shows a particularly strong component (Figure 1b), the group ♯1, that corresponds to the well-known inflammatory pathway. NMF and PCA algorithms also reveal an inflammatory component, but it includes fewer relevant genes than uBLU. See Figure 4 for the corresponding factor scores.
Figure 4Heatmaps of the factor scores of the inflammatory component for H3N2 challenge data. The inflammatory factor determined by the proposed uBLU method (a) shows higher contrast between symptomatic and asymptomatic subjects than the other methods. The five black colored pixels of the heatmaps indicate samples that were not assayed.
Simulation results for real H3N2 dataset
| Fisher criteria (× 10−2) (22) | 2.03 | 6.17 | 4.68 | 2.30 | |
| RE | 4.89 | 7.31.10−2 | 4.82 | 9.51.10−2 | |
| Time | ≈ 12 | 116 | ≈ 47 | ≈ 10 | |
| Number of iterations | 10 000 | N/A | 5 000 | 10 000 | 500 |
Measure of the Fisher linear discriminant measure ([23], p. 119) between post-onset symptomatic samples and the other samples on heatmaps (Figure 4), reconstruction error (RE) between the observed data and the MAP estimators, computational times (for an implementation in MATLAB 7.8.0 (R2009a) on a 3 GHz Intel(R) Core(TM)2 Duo processor), and corresponding number of iterations.
Figure 5Chip clouds after demixing for H3N2 challenge data. These figures show the scatter of the four dimensional factor score vectors (projected onto the plane using MDS) for each algorithm that was compared to uBLU. uBLU, NMF and BFRM obtain a clean separation of samples of symptomatic (red points) and asymptomatic (blue points) subjects whereas the separation is less clear with PCA. In these scatter plots the size of a point is proportional to the time at which the sample was taken during challenge study.
Figure 6Contribution of each constraint on the scores of the inflammatory factor (H3N2 challenge data). The five black colored pixels of the heatmaps indicate samples that were not assayed. Note that when only the sum-to-one constraint is applied, non-negativity is not guaranteed. However, for this dataset the sum-to-one factor scores turn out to take on non-negative values for the inflammatory factor (but not for the other factors).
Contribution of each of uBLU’s constraints
| | ||||
|---|---|---|---|---|
| P-value of the “IFN-gamma pathway” | 6.00.10−2 | 2.05.10−2 | 2.17.10−1 | |
| P-value of the “IL23-mediated signaling events” | 2.60.10−1 | 8.37.10−2 | 2.28.10−2 |
Benefit of constraints in uBLU in terms of gene enrichment in the NCI-curated IFN-gamma and IL23-mediated pathways. As in Tables 6 and 8, the top 200 genes in the inflammatory components, whose scores are shown in Figures 6(a-d), were analyzed using the NCI Pathway Interaction Database. Both positivity and sum-to-one constraints are necessary for uBLU to reveal these two pathways with the high significance (p-value less than 10−6).