| Literature DB >> 23587147 |
Guy Cochrane1, Charles E Cook, Ewan Birney.
Abstract
Archives operating under the International Nucleotide Sequence Database Collaboration currently preserve all submitted sequences equally, but rapid increases in the rate of global sequence production will soon require differentiated treatment of DNA sequences submitted for archiving. Here, we propose a graded system in which the ease of reproduction of a sequencing-based experiment and the relative availability of a sample for resequencing define the level of lossy compression applied to stored data.Entities:
Year: 2012 PMID: 23587147 PMCID: PMC3617450 DOI: 10.1186/2047-217X-1-2
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Relative cost of regenerating sequences for different classes of experiments
| Historical sampling of environment or time point-specific elements | Environmental genomics studies with a longitudinal component; Pathogen sequencing from epidemics | Earth imaging; environmental imaging for longitudinal studies | |
| Very rare objects | Ancient DNA specimens; forensic samples | Fossils; rare meteorites | |
| Longitudinal studies which could in theory be rerun in the future but have a > 10 year horizon to recreate | RNA-seq and DNA-seq from a prospective cohort; environmental sequencing of a specific field trial/intervention in an environment | MRI scans from a prospective cohort; cell imaging from a cohort | |
| Samples acquired from patients or animals with a high individual acquisition cost, but a conceptually continuous generation | Cancer DNA sequencing | Histology samples from Cancer | |
| A complex experiment with > 6 month resource development | RNA-seq on a specifically created mouse gene knockout (mouse colonies stored) | Cell imaging on a specific RNAi library | |
| A routine experiment with < 6 month resource development | RNA-seq of a standard cell line | Routine imaging of Drosophila embryos | |
| Verification experiment as a component in an overall flow | Resequencing of insert vector | Imaging of cell lines to determine confluence levels |
Relative costs decrease from class 1 through class 7.
One possible set of DNA sequence data compression factors for the various experimental classes
| Historical sampling of environment or time specific elements | 1.0 | 1.0 | |
| Very rare objects | 1.0 | 1.0 | |
| Longitudinal studies which could in theory be rerun in the future but have a > 10 year horizon to recreate | 1.0 | 2.0 | |
| Samples acquired from patients or animals with a high individual acquisition cost, but a conceptually continuous generation | 1.0 | 10.0 | |
| A complex experiment with > 6 month resource development | 10.0 | 100.0 | |
| A routine experiment with < 6 month resource development | 20.0 | 200.0 | |
| Verification experiment as a component in an overall flow | 1000.0 | ∞ (Infinite compression of data indicates no data archiving; it may, however, be useful simply to record that the experiment was carried out.) |
Compression is higher for data that are easy or inexpensive to reproduce, and lower for data derived from unique or irreproducible samples.