| Literature DB >> 32293013 |
Guillermo Gonzalez-Calderon1, Ruizheng Liu1, Rodrigo Carvajal1, Jamie K Teer1,2.
Abstract
Falling sequencing costs and large initiatives are resulting in increasing amounts of data available for investigator use. However, there are informatics challenges in being able to access genomic data. Performance and storage are well-appreciated issues, but precision is critical for meaningful analysis and interpretation of genomic data. There is an inherent accuracy vs. performance trade-off with existing solutions. The most common approach (Variant-only Storage Model, VOSM) stores only variant data. Systems must therefore assume that everything not variant is reference, sacrificing precision and potentially accuracy. A more complete model (Full Storage Model, FSM) would store the state of every base (variant, reference and missing) in the genome thereby sacrificing performance. A compressed variation of the FSM can store the state of contiguous regions of the genome as blocks (Block Storage Model, BLSM), much like the file-based gVCF model. We propose a novel approach by which this state is encoded such that both performance and accuracy are maintained. The Negative Storage Model (NSM) can store and retrieve precise genomic state from different sequencing sources, including clinical and whole exome sequencing panels. Reduced storage requirements are achieved by storing only the variant and missing states and inferring the reference state. We evaluate the performance characteristics of FSM, BLSM and NSM and demonstrate dramatic improvements in storage and performance using the NSM approach.Entities:
Keywords: data storage; genetic variation; next-generation sequencing
Mesh:
Year: 2020 PMID: 32293013 PMCID: PMC7157186 DOI: 10.1093/database/baz158
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1Overview of genetic variation storage models. This example shows the same 7 samples and 11 positions using (A) Full Storage Model, (B) Variant Only Storage Model, (C) Block Storage Model and (D) NSM. The features and information stored in each model are listed to the right. FSM: Full Storage Model, VOSM: Variant Only Storage Model, BLSM: Block Storage Model, NSM: Negative Storage Model.
Figure 2NSM data loading process. The process for determining what information should actually be stored in the database.
Figure 3NSM data querying process. The process for extracting precise information for a given position based on the model and information stored in the database.
Performance metrics of genetic variant storage models
| Approach | Load time | Use case 1 | Use case 2 | Use case 3 | Use case 4 | DB size (GB) | DB record (rows) |
|---|---|---|---|---|---|---|---|
|
| 216 h | 600–4000 | 600–4000 | 1.1 M–10.5 M | 216 | 1100 | 15.8B |
|
| 1296 h |
|
| 2628 | 33 | 1665 | “ |
|
| 148 h |
|
|
|
| 876 | “ |
|
| 453 h | 5.5 | 1.9 | 87.1 |
| 14.6 | 164.9 M |
|
| 2 h | 13.2 | 12.1 | 635 |
| 13.8 | 140.7 M |
|
|
|
|
| 43.9 |
|
| “ |
Time in seconds, unless noted
116× Xeon E7-4480, 256 GB RAM, SSD
24× Xeon ES-2650, 64 GB RAM, SSD