| Literature DB >> 20981138 |
Abstract
Microarray technologies have been an increasingly important tool in cancer research in the last decade, and a number of initiatives have sought to stress the importance of the provision and sharing of raw microarray data. Illumina BeadArrays provide a particular problem in this regard, as their random construction simultaneously adds value to analysis of the raw data and obstructs the sharing of those data.We present a compression scheme for raw Illumina BeadArray data, designed to ease the burdens of sharing and storing such data, that is implemented in the BeadDataPackR BioConductor package (http://bioconductor.org/packages/release/bioc/html/BeadDataPackR.html). It offers two key advantages over off-the-peg compression tools. First it uses knowledge of the data formats to achieve greater compression than other approaches, and second it does not need to be decompressed for analysis, but rather the values held within can be directly accessed.Entities:
Keywords: Illumina BeadArray; compression; microarray; open data
Year: 2010 PMID: 20981138 PMCID: PMC2956622
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Showing the physical layout of a typical Illumina BeadChip (in this case a Whole Genome 6 expression array). Illustrated are the multiple arrays (samples) on the chip, the multiple sections within an array, the multiple segments within a section, and the hexagonal grid structure within the segments (only one corner of a segment is illustrated). The ordering of beads within the .locs file is also indicated on the grid.
Figure 2An overview of the structure of the compressed (.bab) files. The structure consists of a header section followed by several blocks of data (one per bead-type). Overviews of the structures of the header and of a ‘bead-type block’ are also given.
Showing the performance of BeadDataPackR for four varieties of single-channel array (HumanWG6-V2, HumanRef8-V2, HumanWG6-V3, Human HT12), and four varieties of dual-channel array (CNV370-Duo, Infinium II, DASL, Human 1M). The sizes of the original files, the zipped files, and the files compressed using BeadDataPackR are given in MB for differing degrees of precision in the storage of the bead-coordinates. These values are representative and will show small variations between arrays of the same type.
| Single-colour | Two-colour | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| WG6-V2 | Ref8-V2 | WG6-V3 | HT12 | CNV370 | Inf. II | DASL | 1M | ||
| Original | 39.0 | 37.5 | 38.4 | 40.2 | 35.6 | 66.9 | 68.6 | 69.1 | |
| Zipped | 18.2 (47%) | 17.4 (46%) | 17.9 (47%) | 18.7 (47%) | 17.6 (49%) | 33.9 (51%) | 34.3 (50%) | 34.2 (49%) | |
| Compressed by BeadDataPackR using X bytes to store fractional parts of coordinates | |||||||||
Impact of reducing the precision of the stored bead-coordinates on the downstream analysis of a 12 array expression experiment. For 19443 well-annotated probes, the mean-squared errors (relative to full precision) for individual beads, for summarized gene intensities, and for log-ratios between arrays are presented. The mean variance from 4 sets of 3 technical replicates is also presented, as is the ‘correct’ proportion (relative to full precision) of the top 1200 returned genes in a differential expression analysis. There are approximately 1200 significantly differentially expressed genes in an analysis using the full precision.
| Number of bytes used for storage | MSE beadlog-intensity | MSE summarized intensities | Mean variance of 3 tech reps | MSE log-ratio of two arrays | Proportion of first 1200 ER-driven genes returned |
|---|---|---|---|---|---|
| 4 | 0 | 0 | 0.0167 | 0 | 1.000 |
| 3 | 4.3 × 10−8 | 9.6 × 10−7 | 0.0169 | 1.9 × 10−6 | 0.999 |
| 2 | 6.4 × 10−6 | 1.6 × 10−5 | 0.0169 | 3.2 × 10−5 | 0.998 |
| 1 | 1.5 × 10−4 | 1.8 × 10−4 | 0.0169 | 3.6 × 10−4 | 0.984 |
| 0 | 2.4 × 10−2 | 2.4 × 10−3 | 0.0179 | 4.8 × 10−3 | 0.931 |
Figure 3Impact of reducing the precision of the stored bead-coordinates on the downstream analysis of a 12 array expression experiment. For 19443 well-annotated probes, top gene-lists are compared across analyses based upon full precision and reduced precision. The left-hand panel gives, for varying lengths of gene list, the proportion of genes from the full-precision analysis that are returned in the reduced-precision analysis. The right-hand panel gives, for varying lengths of gene list, Cohen’s kappa score of agreement for the two partitions (one from the full-precision analysis and one from the reduced-precision analysis). The full-precision analysis suggests that approximately 1200 probes show differential expression and this length of gene list is indicated.