| Literature DB >> 21816037 |
Jeremy A Miller1, Chaochao Cai, Peter Langfelder, Daniel H Geschwind, Sunil M Kurian, Daniel R Salomon, Steve Horvath.
Abstract
BACKGROUND: Genomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements corresponding to a single gene, representing the expression profiles of a co-expression module by a single expression profile, and aggregating cell-type marker information to de-convolute expression data. Several standard statistical summary techniques can be used, but network methods also provide useful alternative methods to find representatives. Currently few collapsing functions are developed and widely applied.Entities:
Mesh:
Year: 2011 PMID: 21816037 PMCID: PMC3166942 DOI: 10.1186/1471-2105-12-322
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example pipeline for using collapseRows functions. The collapseRows function could be used in two steps of a pipeline for finding predictors of a clinical outcome. First, probes could be collapsed into genes by taking the probe with the highest expression (1.max strategy) to allow comparability of data run using several microarray platforms (or RNAseq). These data could then be combined into a consensus module. Second, modules from the resulting network could be summarized using the the most highly connected gene (3.kMax strategy), some of which will likely be related to clinical outcomes.
Summary of data sets and corresponding collapsing strategies
| Fig | Analysis | Data sets used | 1. max | 2. var | 3. kMax | 4. kVar | 5. ME | 6. Avg |
|---|---|---|---|---|---|---|---|---|
| Summary | Hypothetical data | X | - | X | - | - | - | |
| Collapsing probes to genes | 18 Human Brain # 20 Mouse Brain % 5 Human Blood $ | X | X | X | X | - | - | |
| Choosing module centroids | 7 Human Brain # 8 Mouse Brain % 5 Human Blood $ | X | - | X | - | X | - | |
| Predicting cell type proportions | Abbas et al 2009 (cell lines) | X | - | X | - | X | X | |
| Predicting cell type proportions | Grigoryev et al 2010 (whole blood) | X | - | X | - | X | X | |
"#" - The 18 human brain data sets were the following GSE numbers: 1133, 1297, 1572, 2164B, 3526A, 3526B, 3790A, 3790B, 3790C, 4036, 4757, 5281A, 5281B, 5388A, 5388B, 7621, 8397, and 9770. "%" - The 20 mouse brain data sets were the following GSE numbers: 1482, 1782A, 1782B, 2392, 3248, 3327A, 3327B, 3594C, 3963A, 3963B, 4269, 4734, 5429, 6285, 6514A, 6514B, 9444A, 9444B, 9444C, 10263. For "#" and "&," underlined data sets were used in Figure 3 as well as Figure 2. See Miller et al 2010 for more details on these data sets. "$" - The 5 human blood data sets were from Dumeaux et al 2010, Goring et al 2007, Pankla et al 2009, and Saris et al 2009.
Figure 2When collapsing probes to genes, 1.max is usually the optimal collapsing strategy to choose. A) A typical example of ranked expression (left column) and ranked connectivity (right column) correlation between two data sets. Each dot represents a gene in common between data sets, with the x and y axes represented that gene's ranked expression or connectivity in data sets 1 and 2, respectively. B-D) Across several studies in human brain (B), mouse brain (C), and human blood (D) the MaxMean (1.max) parameter generally produces better ranked expression correlations (left column) than maxVariance (2.var). For both MaxMean and maxVariance, use of connectivityBasedCollapsing (3.kMax and 4.kVar) decreases the between-study correlations. Similar results hold, to a lesser extent, with connectivity correlations (right column). Y-axes correspond to the average expression and connectivity correlation between data sets. Error bars represent standard error. Percentages indicate the percent of assessments in which the relevant strategy had the highest overall between-set correlation.
Figure 3When collapsing genes to modules, the optimal method depends on the goal of the analysis. A-C) Across several studies in human brain (A), mouse brain (B), and human blood (C) the MaxMean parameter generally produces better ranked expression correlations (left column) when setting the connectivityBasedCollapsing parameter to FALSE. On the other hand, better connectivity correlations (right column) are found when setting connectivityBasedCollapsing to TRUE, in some cases producing higher correlations that even the module eigengene (5.ME). Labelling as in Figure 2 B-D.
Figure 4collapseRows accurately predicts the relative quantity of blood cell lines across mixed samples. Four prediction methods were used on data from (Abbas et al 2009), from which both gene expression data and actual blood cell counts were known: A) maximum mean expression (1.max), B) maximum connectivity (3.kMax), C) module eigengene (5.ME), and D) average (6.Avg) expression of all marker genes. Each dot presents one cell type in one sample. The X-axes correspond to the predicted proportion of each cell type, while the Y-axes correspond to the actual proportion of each cell type across samples. Values are scaled so that the sum of the proportions for a single cell type across all samples is 1. For all methods (except ME), the x = y line (representing perfect agreement) is plotted. Note that choosing the gene with the highest connectivity (B) most accurately predicts the true cell type proportions.
Figure 5collapseRows accurately predicts the relative quantity of cell type across samples of whole blood. Using data from a realistic blood model (Grigoryev et al 2010), the 1.max, 3.kMax, 5.ME, and 6.Avg collapseRows aggregation strategies can still predict the relative proportion of several major cell types. Each point represents the correlation between true and predicted proportions for one of the four strategies. The X-axis corresponds to the number of marker genes used for the predictor, while the Y-axis corresponds to the correlation between true and predicted proportions. Note that all methods other than MaxMean (1.max) are relatively robust to choice in number of marker genes.