| Literature DB >> 23445565 |
Konstantin Tretyakov1, Sven Laur, Geert Smant, Jaak Vilo, Pjotr Prins.
Abstract
BACKGROUND: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources.Entities:
Mesh:
Year: 2013 PMID: 23445565 PMCID: PMC3582436 DOI: 10.1186/1471-2164-14-S2-S8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Variability in biological data
| Dataset | Description | File type | Number of files | Total size (in GB) | File size (in MB) |
| |
|---|---|---|---|---|---|---|---|
| Min | Max | ||||||
| E61/dat | Ensembl v61 genome annotation (DAT) and DNA sequence (FASTA) files in both compressed (gzip) and uncompressed forms. | dat | 5544 | 169.57 | 5.04 | 1385.14 | 0.782 |
| E61/dat.gz | dat.gz | 5544 | 42.92 | 1.02 | 400.21 | 0.996 | |
| E61/fa | fa | 1484 | 498.51 | 3.47 | 13306.96 | 0.015 | |
| E61/fa.gz | fa.gz | 1484 | 95.25 | 1.0 | 973.15 | 0.594 | |
| GPL570/cel | Microarray files for the HG U133 Plus chip from GEO (all files of GPL570 platform as of 03.2011). Affymetrix CEL and CHP format files, in compressed (gzip) and uncompressed form. | cel | 59892 | 1022.29 | 1.92 | 173.27 | 0.000 |
| GPL570/cel.gz | cel.gz | 59892 | 330.09 | 1.13 | 48.84 | 0.000 | |
| GPL570/chp | chp | 2535 | 63.30 | 1.67 | 36.50 | 0.209 | |
| GPL570/ch.gz | chp.gz | 2535 | 26.36 | 1.02 | 23.05 | 0.995 | |
| BioC2.7/BSGenome | Raw DNA sequence from the Bioconductor package BSGenome, in compressed and uncompressed forms | rda | 513 | 8.45 | 1.00 | 117.17 | 0.981 |
| BioC2.7/BSGenome/u | un-packed | 513 | 32.41 | 1.62 | 447.40 | 0.000 | |
| YaleTFBS/bedGraph4 | Raw ChIP-seq data from the YaleTFBS dataset of the ENCODE project. Four different file types, both in compressed and uncompressed forms. | bed-Graph4 | 171 | 139.91 | 216.73 | 2447.62 | 0.924 |
| YaleTFBS/bedGraph4.gz | bed-Graph4.gz | 171 | 31.45 | 52.89 | 551.80 | 0.996 | |
| YaleTFBS/fastq | fastq | 388 | 541.99 | 199.25 | 4469.89 | 0.919 | |
| YaleTFBS/fastq.gz | fastq.gz | 388 | 160.75 | 49.55 | 1564.84 | 0.996 | |
| YaleTFBS/tagAlign | tagAlign | 520 | 279.45 | 79.95 | 2357.32 | 0.544 | |
| YaleTFBS/tagAlign.gz | tag-Align.gz | 520 | 96.70 | 27.86 | 815.63 | 0.994 | |
| YaleTFBS/wig | wig | 33 | 10.66 | 188.92 | 693.66 | 0.912 | |
| YaleTFBS/wig.gz | wig.gz | 33 | 3.27 | 59.76 | 207.93 | 0.996 | |
Measurements of δ-variability in several biological datasets. Exact description of the experiment is available in the Supplementary material online [14].
Detailed inspection of similar file pairs
| Dataset | File pair and remarks | File sizes (in MB) | δ |
|---|---|---|---|
| E61/fa | Homo_sapiens.GRCh37.61.dna_rm.chromosome.HSCHR6_MHC_SSTO.fa | 166.04 | 0.015 |
| 166.06 | |||
| These are two alternative haplotype "patch" files for the same chromosome locus. The dataset contains 11 other examples of similar file pairs with | |||
| GPL570/cel | GSM405175.CEL | 12.93 | 8e-6 |
| GSM341406.CEL | 12.93 | ||
| The second file differs from the first by a single Affymetrix probe measurement. According to GEO metadata the two files are simply different packagings of the same experimental data by two researchers. The GEO570 dataset contains 9 other examples of similar file pairs with | |||
| GPL570/cel.gz | GSM405175.CEL.gz | 4.31 | 6e-4 |
| GSM341406.CEL.gz | 4.31 | ||
| A gzip-compressed version of the pair above. Same remarks apply. The most similar pair of actually different datafiles has | |||
| BioC2.7/B SGenome/u | BSgenome.Athaliana.TAIR.01222004/extdata/chr1.rda | 29.04 | 2e-4 |
| BSgenome.Athaliana.TAIR.04232008/extdata/chr1.rda | 29.04 | ||
| Consequtive versions of A.thaliana reference genome. The next most similar file pair in this dataset has | |||
The table lists the suspiciously similar pairs of files from the studied datasets.
Figure 1Comparison of PFFF to conventional hashing. The plots demonstrate time for hashing a single file of a given size by MD5 and by PFFF. Note that axis scales on the four plots are different.