| Literature DB >> 27797771 |
Weizhuang Zhou1, Lichy Han2, Russ B Altman1,3.
Abstract
Microarray measurements of gene expression constitute a large fraction of publicly shared biological data, and are available in the Gene Expression Omnibus (GEO). Many studies use GEO data to shape hypotheses and improve statistical power. Within GEO, the Affymetrix HG-U133A and HG-U133 Plus 2.0 are the two most commonly used microarray platforms for human samples; the HG-U133 Plus 2.0 platform contains 54 220 probes and the HG-U133A array contains a proper subset (21 722 probes). When different platforms are involved, the subset of common genes is most easily compared. This approach results in the exclusion of substantial measured data and can limit downstream analysis. To predict the expression values for the genes unique to the HG-U133 Plus 2.0 platform, we constructed a series of gene expression inference models based on genes common to both platforms. Our model predicts gene expression values that are within the variability observed in controlled replicate studies and are highly correlated with measured data. Using six previously published studies, we also demonstrate the improved performance of the enlarged feature space generated by our model in downstream analysis. Availability and Implementation: The gene inference model described in this paper is available as a R package (affyImpute), which can be downloaded at http://simtk.org/home/affyimpute. Contact: rbaltman@stanford.edu. Supplementary information: Supplementary data are available at Bioinformatics online.Entities:
Mesh:
Year: 2017 PMID: 27797771 PMCID: PMC5408923 DOI: 10.1093/bioinformatics/btw664
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Number of GEO samples from each platform over time. The sample counts were limited to only human samples
Fig. 2Venn diagram depicting the common and imputed gene sets
Fig. 3Test and training CV(RMSE). Each colored circle represents a gene model. The marginal histograms show the distribution of errors across the 9986 gene models. The 365 gene models from the Human Disease Network are depicted in orange
Fig. 4Coefficient of variation of MAQC data (in comparison with test CV(RMSE))
Gene overlap between MSigDB collections and platforms
| Collection | ||||||||
|---|---|---|---|---|---|---|---|---|
| H | C1 | C2 | C3 | C4 | C5 | C6 | C7 | |
| Total number of valid signatures | 50 | 271 | 2949 | 752 | 702 | 737 | 186 | 1910 |
| Overlap with HG-U133A | 50 | 156 | 2752 | 747 | 680 | 664 | 184 | 1910 |
| Overlap with HG-U133 Plus 2.0 | 50 | 215 | 2880 | 752 | 689 | 715 | 186 | 1910 |
A valid signature contains between 25 and 500 genes.
Fig. 5Tumor grade and estrogen receptor (ER), progesterone receptor (PR) and HER2/neu immunoreactive scores for patients in GSE3893 (A). Hierarchical clustering of breast cancer tumors from GSE3893 using the original HG-U133A probes (B) and the imputed HG-U133 Plus 2.0 array (C)
Fig. 6Hierarchical clustering of kidney tumors using the HG-U133A array and the imputed HG-U133 Plus 2.0 array
Performance results of gene signature models in distinguishing long from short survival on external validation data
| Gene set | Accuracy | Precision | Recall | F1 Measure | Chi-squared |
|---|---|---|---|---|---|
| Original Probes ( | 0.759 | 0.789 | 0.833 | 0.811 | 0.029 |
| Common Genes ( | 0.643 | 0.786 | 0.611 | 0.689 | 0.236 |
| Imputed Genes ( |