| Literature DB >> 19671149 |
Katrijn Van Deun1, Age K Smilde, Mariët J van der Werf, Henk A L Kiers, Iven Van Mechelen.
Abstract
BACKGROUND: Data integration is currently one of the main challenges in the biomedical sciences. Often different pieces of information are gathered on the same set of entities (e.g., tissues, culture samples, biomolecules) with the different pieces stemming, for example, from different measurement techniques. This implies that more and more data appear that consist of two or more data arrays that have a shared mode. An integrative analysis of such coupled data should be based on a simultaneous analysis of all data arrays. In this respect, the family of simultaneous component methods (e.g., SUM-PCA, unrestricted PCovR, MFA, STATIS, and SCA-P) is a natural choice. Yet, different simultaneous component methods may lead to quite different results.Entities:
Mesh:
Year: 2009 PMID: 19671149 PMCID: PMC2752463 DOI: 10.1186/1471-2105-10-246
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Illustrations of coupled data. Illustrations of coupled two-way two-mode data that share a single mode: In the left panel three data matrices (gene expression data, motif data, and ChIP-chip data) share the row mode (genes); in the right panel two data blocks (gas chromatography and liquid chromatography mass spectrometry data) share the column mode (fermentation batches).
Characterization of published simultaneous component methods in function of the general framework.
| Common mode | Pre-processing | Matrix-specific weights | Identification constraint | |
| SUM-PCA | object | Variables: auto-scaling | All | Principal axes |
| unr. PCovR | object | Variables: auto-scaling | Minimize cross- validation error | Principal axes |
| MFA | object | Variables: auto-scaling | Inverse of largest singular value | Principal axes |
| STATIS | object | Compromise weights | Principal axes | |
| SCA-P | variable | Variables: auto-scaling | All | Principal axes |
Principles to realize a fair integration of different data matrices.
| Principle | Methods aiming at this principle |
| Same weight for all matrices (naive approach) | SCA-P |
| More weight for smaller matrices | SUM-PCA, MFA |
| More weight for less redundant matrices | MFA |
| More weight for matrices with more stable predictive information | PCovR |
| More weight for matrices that share more information with other matrices ( | STATIS |
Weights put on GC versus LC by different SCA methods.
| GC | LC | |
| PCovR GC1 | 1.00 | 0 |
| STATIS | .99 | .01 |
| SCA-P | .77 | .23 |
| MFA | .66 | .34 |
| SUM-PCA | .50 | .50 |
| PCovR LC1 | 0 | 1.00 |
1: Matrix-specific weights obtained by leave one out crossvalidation using the observed scores in the LC data to reproduce the scores of the GC data (PCovR GC) or the other way around (PCovR LC).
These weights have been calculated as the matrix-specific sum of squares divided by the sum of squares of the concatenated data.
Tucker's coefficient of congruence between the component scores (R = 5).
| PCovR GC | SCA-P | MFA | SUM-PCA | PCovR LC | LC2 | STATIS | |
| GC1 | 1 | 0.91 | 0.86 | 0.81 | 0.55 | 0.55 | 0.13 |
| PCovR GC3 | 0.91 | 0.86 | 0.81 | 0.55 | 0.55 | 0.13 | |
| SCA-P | 0.99 | 0.96 | 0.73 | 0.73 | 0.12 | ||
| MFA | 0.99 | 0.79 | 0.79 | 0.12 | |||
| SUM-PCA | 0.84 | 0.84 | 0.11 | ||||
| PCovR LC4 | 1 | 0.08 | |||||
| LC | 0.08 |
1: Ordinary principal component analysis of GC data only
2: Ordinary principal component analysis of LC data only
3+4: Matrix-specific weights obtained by leave one out crossvalidation using the observed scores in the LC data to reproduce the scores of the GC data (PCovR GC) or the other way around (PCovR LC).
Figure 2Proportion of variance accounted for by the MFA solution. Proportion of variance accounted for by the MFA solution in each matrix (bars) and proportion of variance accounted for by separate component analyses (lines).
Component scores (labeled 'SC1' to 'SC5') after VARIMAX rotation of the MFA solution with five components and, on the last three lines, the variance accounted for by these components in the GC data (the 'GC' line), the LC data (the 'LC' line), and the concatenated data (the 'TOTAL' line).
| SC1 | SC2 | SC3 | SC4 | SC5 | |
| | 0.07 | -0.04 | 0.01 | -0.12 | |
| | -0.05 | -0.02 | -0.10 | -0.09 | -0.16 |
| | -0.01 | -0.05 | -0.07 | -0.06 | |
| | 0.10 | -0.01 | -0.09 | 0.01 | |
| | 0.18 | -0.17 | 0.17 | -0.03 | 0.10 |
| | 0.18 | -0.11 | -0.10 | ||
| | 0.03 | -0.07 | -0.10 | -0.03 | |
| | -0.17 | 0.03 | -0.09 | ||
| | -0.10 | 0.05 | -0.10 | 0.03 | |
| | 0.04 | 0.02 | 0.00 | -0.06 | |
| | 0.04 | 0.01 | -0.11 | -0.06 | |
| | -0.02 | -0.02 | -0.08 | -0.16 | |
| | -0.13 | 0.09 | -0.05 | 0.08 | |
| | -0.10 | 0.11 | -0.02 | 0.07 | |
| | 0.01 | -0.02 | -0.04 | 0.03 | |
| | -0.09 | 0.04 | -0.10 | 0.14 | |
| | 0.00 | -0.04 | -0.01 | -0.01 | |
| | -0.04 | -0.12 | 0.01 | 0.01 | -0.15 |
| | -0.03 | 0.06 | 0.00 | -0.19 | -0.09 |
| | -0.03 | -0.02 | -0.12 | 0.07 | |
| | 0.09 | -0.01 | 0.14 | ||
| | 0.07 | 0.10 | -0.02 | -0.13 | |
| | -0.01 | -0.02 | -0.01 | -0.03 | |
| | -0.05 | -0.05 | -0.05 | 0.11 | |
| | 0.19 | -0.10 | 0.06 | ||
| | 0.02 | -0.11 | -0.12 | 0.04 | |
| | 0.08 | -0.06 | |||
| | 0.11 | -0.06 | -0.03 | ||
| 0.14 | 0.12 | 0.08 | 0.14 | 0.06 | |
| 0.10 | 0.14 | 0.20 | 0.07 | 0.13 | |
| 0.13 | 0.13 | 0.12 | 0.12 | 0.09 | |
In the part with component scores, the first column describes the experiments in relation to the reference condition and the number of hours the samples were in the bioreactor. Component scores ≥ .25 (in absolute value) are in boldface.
Figure 3Heatmap. Heat map of the metabolite loadings on the five components. Labels were used for 130 of the 188 metabolites (the labels 'unknown' were dropped).