| Literature DB >> 23452776 |
Abstract
BACKGROUND: Microarrays have become a routine tool to address diverse biological questions. Therefore, different types and generations of microarrays have been produced by several manufacturers over time. Likewise, the diversity of raw data deposited in public databases such as NCBI GEO or EBI ArrayExpress has grown enormously.This has resulted in databases currently containing several hundred thousand microarray samples clustered by different species, manufacturers and chip generations. While one of the original goals of these databases was to make the data available to other researchers for independent analysis and, where appropriate, integration with their own data, current software implementations could not provide that feature.Only those data sets generated on the same chip platform can be readily combined and even here there are batch effects to be taken care of. A straightforward approach to deal with multiple chip types and batch effects has been missing.The software presented here was designed to solve both of these problems in a convenient and user friendly way.Entities:
Mesh:
Year: 2013 PMID: 23452776 PMCID: PMC3599117 DOI: 10.1186/1471-2105-14-75
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Scheme of the steps performed by virtualArray. A total of seven distinct steps is needed to create a virtual array. The first two require user input while all others are performed without user intervention. Please see “Implementation” for detailed descriptions.
Example identifier coverages and overlaps between selected chip platforms
| Agilent | G4112F | H. sapiens | gene symbols | 41078 | 18575 | 17981 | 96.8% |
| Affymetrix | U133Plus2 | H. sapiens | gene symbols | 54675 | 19798 | 90.8% | |
| Agilent | G4112F | H. sapiens | gene symbols | 41078 | 18575 | 16976 | 91.4% |
| Affymetrix | U133Plus2 | H. sapiens | gene symbols | 54675 | 19798 | 85.7% | |
| Illumina | HumanRef8v3 | H. sapiens | gene symbols | 24526 | 21090 | 80.5% | |
| Agilent | G4112F | H. sapiens | ENTREZ ID | 41078 | 18575 | 17981 | 96.8% |
| Affymetrix | U133Plus2 | H. sapiens | ENTREZ ID | 54675 | 20723 | 86.8% | |
| Agilent | G4112F | H. sapiens | ENTREZ ID | 41078 | 18575 | 16976 | 91.4% |
| Affymetrix | U133Plus2 | H. sapiens | ENTREZ ID | 54675 | 20723 | 81.9% | |
| Illumina | HumanRef8v3 | H. sapiens | ENTREZ ID | 24526 | 21090 | 80.5% | |
| Agilent | G4112F | H. sapiens | Unigene | 41078 | 19712 | 19163 | 97.2% |
| Affymetrix | U133Plus2 | H. sapiens | Unigene | 54675 | 21505 | 89.1% | |
| Agilent | G4112F | H. sapiens | Unigene | 41078 | 19712 | 18189 | 92.3% |
| Affymetrix | U133Plus2 | H. sapiens | Unigene | 54675 | 21505 | 84.6% | |
| Illumina | HumanRef8v3 | H. sapiens | Unigene | 24526 | 21153 | 86.0% | |
| Agilent | G4112F | H. sapiens | ENSEMBL | 41078 | 17899 | 17574 | 98.2% |
| Affymetrix | U133Plus2 | H. sapiens | ENSEMBL | 54675 | 18618 | 94.4% | |
| Agilent | G4112F | H. sapiens | ENSEMBL | 41078 | 17899 | 17281 | 96.5% |
| Affymetrix | U133Plus2 | H. sapiens | ENSEMBL | 54675 | 18618 | 92.8% | |
| Illumina | HumanRef8v3 | H. sapiens | ENSEMBL | 24526 | 19291 | 89.6% | |
| Illumina | MouseRef8v2 | M. musculus | gene symbols | 25697 | 22221 | 18037 | 81.2% |
| Affymetrix | M430.2 | M. musculus | gene symbols | 45101 | 22114 | 81.6% | |
| Illumina | MouseRef8v2 | M. musculus | ENTREZ ID | 25697 | 22221 | 18037 | 81.2% |
| Affymetrix | M430.2 | M. musculus | ENTREZ ID | 45101 | 22114 | 81.6% | |
| Illumina | MouseRef8v2 | M. musculus | Unigene | 25697 | 22663 | 19510 | 86.1% |
| Affymetrix | M430.2 | M. musculus | Unigene | 45101 | 22261 | 87.6% | |
| Illumina | MouseRef8v2 | M. musculus | ENSEMBL | 25697 | 20126 | 17384 | 86.4% |
| Affymetrix | M430.2 | M. musculus | ENSEMBL | 45101 | 17780 | 97.8% |
Several major microarray chip platforms have been tested with virtualArray. The collapsing of probes/probesets was based on gene symbols, ENTREZ ID, Unigene ID or ENSEMBL ID, resulting in different reduced feature numbers (collapsed feature number). When two or three platforms are merged, the feature number is further reduced. However, the fraction of overlap in respect to the single chips was always above 80%.
Figure 2Hierarchical clusterings and principle component analyses of ExpressionSets outputted by virtualArray. On the basis of the combined dataset from three different platforms a hierarchical clustering was calculated based on Euclidian distance matrices. Samples from GSE23402 are marked in red, samples from GSE26428 are marked in green and samples from GSE28688 are marked in blue. ESC, human embryonic stem cells; iPSC, human induced pluripotent stem cells. A, clustering of combined data without batch effect removed, B, clustering of combined data with batch effect removed in non-supervised mode; C, clustering of combined data with batch effect removed in supervised mode. The direct analysis of the combined dataset exhibits strong batch effects (A), that can be reduced by the use of EBM (B) in non-supervised mode. The benefit of the supervised mode can be seen in PCA plots (D, E) but not hierarchical clusterings (C). Principle component analyses were performed on the combined batch effect removed dataset. The batch effects were removed in non-supervised (D) and supervised mode (E), respectively.
Contents of an exemplary “sample_info.txt” file
| 1 | GSM574058 | GSM574058 | GSE23402 | fibroblast |
| 2 | GSM574059 | GSM574059 | GSE23402 | fibroblast |
| 3 | GSM574060 | GSM574060 | GSE23402 | fibroblast |
| 4 | GSM574061 | GSM574061 | GSE23402 | ESC |
| 5 | GSM574062 | GSM574062 | GSE23402 | ESC |
| 6 | GSM574063 | GSM574063 | GSE23402 | ESC |
| 7 | GSM574064 | GSM574064 | GSE23402 | ESC |
| 8 | GSM574065 | GSM574065 | GSE23402 | ESC |
| 9 | GSM574066 | GSM574066 | GSE23402 | ESC |
| 10 | GSM574067 | GSM574067 | GSE23402 | ESC |
| 11 | GSM574068 | GSM574068 | GSE23402 | ESC |
| 12 | GSM574069 | GSM574069 | GSE23402 | ESC |
| 13 | GSM574070 | GSM574070 | GSE23402 | ESC |
| 14 | GSM574071 | GSM574071 | GSE23402 | ESC |
| 15 | GSM574072 | GSM574072 | GSE23402 | ESC |
| 16 | GSM574073 | GSM574073 | GSE23402 | ESC |
| 17 | GSM574074 | GSM574074 | GSE23402 | ESC |
| 18 | GSM574075 | GSM574075 | GSE23402 | ESC |
| 19 | GSM574076 | GSM574076 | GSE23402 | ESC |
| 20 | GSM574077 | GSM574077 | GSE23402 | ESC |
| 21 | GSM574078 | GSM574078 | GSE23402 | iPSC |
| 22 | GSM574079 | GSM574079 | GSE23402 | iPSC |
| 23 | GSM574080 | GSM574080 | GSE23402 | iPSC |
| 24 | GSM574081 | GSM574081 | GSE23402 | iPSC |
| 25 | GSM648497 | GSM648497 | GSE26428 | iPSC |
| 26 | GSM648498 | GSM648498 | GSE26428 | iPSC |
| 27 | GSM648499 | GSM648499 | GSE26428 | fibroblast |
| 28 | GSM710513 | GSM710513 | GSE28688 | fibroblast |
| 29 | GSM710514 | GSM710514 | GSE28688 | fibroblast |
| 30 | GSM710515 | GSM710515 | GSE28688 | fibroblast |
| 31 | GSM710516 | GSM710516 | GSE28688 | fibroblast |
| 32 | GSM710517 | GSM710517 | GSE28688 | fibroblast |
| 33 | GSM710518 | GSM710518 | GSE28688 | fibroblast |
| 34 | GSM710519 | GSM710519 | GSE28688 | fibroblast |
| 35 | GSM710520 | GSM710520 | GSE28688 | fibroblast |
| 36 | GSM710521 | GSM710521 | GSE28688 | ESC |
| 37 | GSM710522 | GSM710522 | GSE28688 | ESC |
| 38 | GSM710523 | GSM710523 | GSE28688 | iPSC |
| 39 | GSM710524 | GSM710524 | GSE28688 | iPSC |
| 40 | GSM710525 | GSM710525 | GSE28688 | iPSC |
| 41 | GSM710526 | GSM710526 | GSE28688 | iPSC |
The first two columns need to correspond to the sample names used in the ExpressionSets, respectively. In column 3 the contribution of individual samples to batches is tracked. Finally, column 4 contains user defined group assignments of each sample. Group assignments (covariates) can include more than one column, for example to include source tissue, sex, age, etc.