| Literature DB >> 22216090 |
Sri Priya Ponnapalli1, Michael A Saunders, Charles F Van Loan, Orly Alter.
Abstract
The number of high-dimensional datasets recording multiple aspects of a single phenomenon is increasing in many areas of science, accompanied by a need for mathematical frameworks that can compare multiple large-scale matrices with different row dimensions. The only such framework to date, the generalized singular value decomposition (GSVD), is limited to two matrices. We mathematically define a higher-order GSVD (HO GSVD) for N≥2 matrices D(i)∈R(m(i) × n), each with full column rank. Each matrix is exactly factored as D(i)=U(i)Σ(i)V(T), where V, identical in all factorizations, is obtained from the eigensystem SV=VΛ of the arithmetic mean S of all pairwise quotients A(i)A(j)(-1) of the matrices A(i)=D(i)(T)D(i), i≠j. We prove that this decomposition extends to higher orders almost all of the mathematical properties of the GSVD. The matrix S is nondefective with V and Λ real. Its eigenvalues satisfy λ(k)≥1. Equality holds if and only if the corresponding eigenvector v(k) is a right basis vector of equal significance in all matrices D(i) and D(j), that is σ(i,k)/σ(j,k)=1 for all i and j, and the corresponding left basis vector u(i,k) is orthogonal to all other vectors in U(i) for all i. The eigenvalues λ(k)=1, therefore, define the "common HO GSVD subspace." We illustrate the HO GSVD with a comparison of genome-scale cell-cycle mRNA expression from S. pombe, S. cerevisiae and human. Unlike existing algorithms, a mapping among the genes of these disparate organisms is not required. We find that the approximately common HO GSVD subspace represents the cell-cycle mRNA expression oscillations, which are similar among the datasets. Simultaneous reconstruction in the common subspace, therefore, removes the experimental artifacts, which are dissimilar, from the datasets. In the simultaneous sequence-independent classification of the genes of the three organisms in this common subspace, genes of highly conserved sequences but significantly different cell-cycle peak times are correctly classified.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22216090 PMCID: PMC3245232 DOI: 10.1371/journal.pone.0028072
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Higher-order generalized singular value decomposition (HO GSVD).
In this raster display of Equation (1) with overexpression (red), no change in expression (black), and underexpression (green) centered at gene- and array-invariant expression, the S. pombe, S. cerevisiae and human global mRNA expression datasets are tabulated as organism-specific genes17-arrays matrices , and . The underlying assumption is that there exists a one-to-one mapping among the 17 columns of the three matrices but not necessarily among their rows. These matrices are transformed to the reduced diagonalized matrices , and , each of 17-“arraylets,” i.e., left basis vectors17-“genelets,” i.e., right basis vectors, by using the organism-specific genes17-arraylets transformation matrices , and and the shared 17-genelets17-arrays transformation matrix . We prove that with our particular of Equations (2)–(4), this decomposition extends to higher orders all of the mathematical properties of the GSVD except for complete column-wise orthogonality of the arraylets, i.e., left basis vectors that form the matrices , and . We therefore mathematically define, in analogy with the GSVD, the “common HO GSVD subspace” of the matrices to be the subspace spanned by the genelets, i.e., right basis vectors that correspond to higher-order generalized singular values that are equal, , where, as we prove, the corresponding arraylets, i.e., the left basis vectors , and , are orthonormal to all other arraylets in , and . We show that like the GSVD for two organisms [7], the HO GSVD provides a sequence-independent comparative mathematical framework for datasets from more than two organisms, where the mathematical variables and operations represent biological reality: Genelets of common significance in the multiple datasets, and the corresponding arraylets, represent cell-cycle checkpoints or transitions from one phase to the next, common to S. pombe, S. cerevisiae and human. Simultaneous reconstruction and classification of the three datasets in the common subspace that these patterns span outline the biological similarity in the regulation of their cell-cycle programs. Notably, genes of significantly different cell-cycle peak times [19] but highly conserved sequences [20], [21] are correctly classified.
Figure 2Genelets or right basis vectors.
(a) Raster display of the expression of the 17 genelets, i.e., HO GSVD patterns of expression variation across time, with overexpression (red), no change in expression (black) and underexpression (green) around the array-, i.e., time-invariant expression. (b) Bar chart of the corresponding inverse eigenvalues , showing that the 13th through the 17th genelets correspond to . (c) Line-joined graphs of the 13th (red), 14th (blue) and 15th (green) genelets in the two-dimensional subspace that approximates the five-dimensional HO GSVD subspace (Figure S4 and Section 2.4), normalized to zero average and unit variance. (d) Line-joined graphs of the projected 16th (orange) and 17th (violet) genelets in the two-dimensional subspace. The five genelets describe expression oscillations of two periods in the three time courses.
Arraylets or left basis vectors.
| Overexpression | Underexpression | ||||
| Dataset | Arraylet | Annotation |
| Annotation |
|
|
| 13 | G2 |
| G1 |
|
| 14 | M |
| G2 |
| |
| 15 | M |
| S |
| |
| 16 | G2 |
| G1 |
| |
| 17 | G2 |
| S |
| |
|
| 13 | S/G2 |
| M/G1 |
|
| 14 | M/G1 |
| G2/M |
| |
| 15 | G1 |
| S |
| |
| 16 | G2/M |
| G1 |
| |
| 17 | G2/M |
| G1 |
| |
| Human | 13 | G1/S |
| G2 |
|
| 14 | M/G1 |
| G2 |
| |
| 15 | G2 |
| None |
| |
| 16 | G1/S |
| G2 |
| |
| 17 | G2 |
| M/G1 |
| |
Probabilistic significance of the enrichment of the arraylets, i.e., HO GSVD patterns of expression variation across the S. pombe, S. cerevisiae and human genes, that span the common HO GSVD subspace in each dataset, in over- or underexpressed cell cycle-regulated genes. The P-value of each enrichment is calculated as described [39] (Section 2.2 in Appendix S1) assuming hypergeometric distribution of the annotations (Datasets S1, S2, S3) among the genes, including the = 100 genes most over- or underexpressed in each arraylet.
Figure 3Common HO GSVD subspace represents similar cell-cycle oscillations.
(a–c) S. pombe, S. cerevisiae and human array expression, projected from the five-dimensional common HO GSVD subspace onto the two-dimensional subspace that approximates it (Sections 2.3 and 2.4 in Appendix S1). The arrays are color-coded according to their previous cell-cycle classification [15]–[18]. The arrows describe the projections of the arraylets of each dataset. The dashed unit and half-unit circles outline 100% and 50% of added-up (rather than canceled-out) contributions of these five arraylets to the overall projected expression. (d–f) Expression of 380, 641 and 787 cell cycle-regulated genes of S. pombe, S. cerevisiae and human, respectively, color-coded according to previous classifications. (g–i) The HO GSVD pictures of the S. pombe, S. cerevisiae and human cell-cycle programs. The arrows describe the projections of the shared genelets and organism-specific arraylets that span the common HO GSVD subspace and represent cell-cycle checkpoints or transitions from one phase to the next.
Figure 4Simultaneous HO GSVD classification of homologous genes of different cell-cycle peak times.
(a) The S. pombe gene BFR1, and (b) its closest S. cerevisiae homologs. (c) The S. pombe and (d) S. cerevisiae closest homologs of the S. cerevisiae gene PLB1. (e) The S. pombe cyclin-encoding gene CIG2 and its closest S. pombe, (f) S. cerevisiae and (g) human homologs.