| Literature DB >> 18931094 |
Atsushi Fukushima1, Masayoshi Wada, Shigehiko Kanaya, Masanori Arita.
Abstract
Gene co-expression analysis has been widely used in recent years for predicting unknown gene function and its regulatory mechanisms. The predictive accuracy depends on the quality and the diversity of data set used. In this report, we applied singular value decomposition (SVD) to array experiments in public databases to find that co-expression linkage could be estimated by a much smaller number of array data. Correlations of co-expressed gene were assessed using two regulatory mechanisms (feedback loop of the fundamental circadian clock and a global transcription factor Myb28), as well as metabolic pathways in the AraCyc database. Our conclusion is that a smaller number of informative arrays across tissues can suffice to reproduce comparable results with a state-of-the-art co-expression software tool. In our SVD analysis on Arabidopsis data set, array experiments that contributed most as the principal components included stamen development, germinating seed and stress responses on leaf.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18931094 PMCID: PMC2608847 DOI: 10.1093/dnares/dsn025
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1Pie chart of the biomaterials of array data in each data repository.
Figure 2Distribution of correlation coefficient from five types of data matrices (with- and without-SVD compression) normalized by RMA. Data matrices were reconstructed by largest 20 SVs (solid line), 40 SVs (lower dotted line), 300 and 700 SVs (upper dotted lines), and without-SVD (outermost dotted line). The SD of each distribution are 0.34, 0.31, 0.27, 0.26 and 0.26, respectively.
Figure 3Scatter plots (with white circles) among three major central oscillator-related genes in Arabidopsis: (A) CCA1 versus LHY, (B) LHY versus TOC1 and (C) CCA1 versus TOC1. Highly overlapped parts look black. (D) The simplest model of the central mechanism of circadian oscillator. Co-expressions were calculated by Pearson’s correlation. See main texts for abbreviations.
Rank of correlations (in parentheses) between three basal genes (CCA1, LHY and TOC1) in the central circadian clock
| SVs used | |||
|---|---|---|---|
| 20 | |||
| 40 | |||
| 300 | |||
| 700 | |||
| 2364 |
Correlation coefficients and their ranks (in parentheses) among Myb28-regulated GSL biosynthetic genes [NS, not significant (P ≥ 1E−300)]
| Probe name | AGI code | Description | SVs used | ATTED-II | ||||
|---|---|---|---|---|---|---|---|---|
| 20 | 40 | 300 | 700 | All | ||||
| 247549_at | At5g61420 | Myb family transcription factor (Myb28) | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| 266395_at | At2g43100 | Aconitase C-terminal domain-containing protein (AtLeuD1) | 0.89 (7) | 0.85 (7) | 0.80 (8) | 0.79 (8) | 0.79 (8) | 0.74 |
| 251524_at | At3g58990 | Aconitase C-terminal domain-containing protein (AtLeuD2) | 0.89 (6) | 0.86 (6) | 0.83 (5) | 0.82 (5) | 0.82 (4) | 0.78 |
| 254687_at | At4g13770 | Cytochrome P450 family protein (CYP83A1) | 0.95 (1) | 0.93 (1) | 0.90 (1) | 0.89 (1) | 0.89 (1) | 0.80 |
| 249866_at | At5g23010 | 2-Isopropylmalate synthase 3 (IMS3) (MAM-1) | 0.88 (8) | 0.86 (5) | 0.82 (6) | 0.81 (6) | 0.81 (6) | 0.70 |
| 257021_at | At3g19710 | Branched-chain amino acid transaminase, putative (AtBCAT-4) (MAAT) | 0.86 (9) | 0.84 (8) | 0.8 (7) | 0.79 (7) | 0.79 (7) | 0.68 |
| 262717_s_at | At1g16410 | Cytochrome P450, putative (CYP79F1) | 0.85 (12) | 0.82 (11) | 0.76 (12) | 0.74 (12) | 0.74 (12) | 0.67 |
| At1g16400 | No entry (CYP79F2) | |||||||
| 260745_at | At1g78370 | glutathione S-transferase, putative (ATGSTU20) | 0.77 (29) | 0.75 (18) | 0.72 (16) | 0.71 (16) | 0.71 (15) | 0.52 |
| 263477_at | At2g31790 | UDP-glucoronosyl/UDP-glucosyl transferase family protein (UGT74C1) | 0.92 (3) | 0.89 (3) | 0.86 (2) | 0.85 (2) | 0.84 (2) | 0.72 |
| 255437_at | At4g03060 | 2-Oxoglutarate-dependent dioxygenase, putative (AOP2) | 0.61 (274) | 0.6 (156) | 0.52 (303) | 0.51 (332) | 0.5 (328) | 0.43 |
| 255773_at | At1g18590 | Sulfotransferase family protein (AtSOT17) | 0.8 (19) | 0.77 (16) | 0.73 (15) | 0.72 (15) | 0.71 (16) | 0.61 |
| 264873_at | At1g24100 | UDP-glucoronosyl/UDP-glucosyl transferase family protein (UGT74B1) | 0.61 (307) | 0.58 (249) | 0.53 (223) | 0.52 (257) | 0.52 (257) | 0.43 |
| 260385_at | At1g74090 | Sulfotransferase family protein (AtSOT18) | 0.90 (5) | 0.87 (4) | 0.84 (4) | 0.83 (4) | 0.82 (5) | 0.76 |
| 263706_s_at | At5g14200 | AtIMD1 | 0.77 (30) | 0.74 (23) | 0.70 (18) | 0.70 (18) | 0.69 (18) | NS |
| 249867_at | At5g23020 | 2-Isopropylmalate synthase 2 (IMS2) (MAM3) | NS | NS | NS | NS | NS | 0.41 |
| 263714_at | At2g20610 | Aminotransferase, putative (SUR1) | 0.73 (43) | 0.71 (33) | 0.67 (24) | 0.66 (25) | 0.66 (25) | 0.54 |
| 250633_at | At5g07460 | Peptide methionine sulfoxide reductase, putative (PMSR2) | NS | NS | NS | NS | NS | 0.44 |
| 258851_at | At3g03190 | Glutathione S-transferase, putative (ATGSTF11) | 0.8 (16) | 0.78 (13) | 0.73 (14) | 0.72 (14) | 0.72 (14) | 0.71 |
| 254742_at | At4g13430 | Aconitate hydratase family protein (AtLeuC1) | 0.68 (97) | 0.65 (64) | 0.60 (53) | 0.6 (55) | 0.59 (59) | 0.62 |
| 259343_s_at | At3g03780 | Cobalamin-independent methionine synthase, putative (AtMS2) | 0.54 (813) | NS | NS | NS | NS | NS |
| 252274_at | At3g49680 | Branched-chain amino acid transaminase 3 (AtBCAT-3) | 0.67 (106) | 0.65 (68) | 0.62 (48) | 0.62 (41) | 0.61 (44) | 0.53 |
Figure 4Evaluation of AraCyc genes in co-expression rankings against various thresholds (r = 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9). Average ranks of intra-pathway correlations using reconstructed matrices were calculated across the 78 AraCyc pathways that contain ≥10 genes in ATH1 GeneChip.
Figure 5The plot of the number of arrays (y-axis) against λ (x-axis from 1 to 10) for different SVs. Each bar corresponds to 10, 20, 30, 40 and 50 SVs from left to right. The number of significant columns rapidly decreases as the λ increases, and contributing arrays are independent of the number of SVs.
Figure 6Hierarchical clustering of the reconstructed data matrices using only one SV δ. (A–E) Show the matrix reconstructed by the largest SV δ to fifth largest value δ. Columns are experimental series and rows are genes; both of which are hierarchically clustered in each figure. Magenta denotes the positive value of the reconstructed matrix B and the cyan the negative value.