| Literature DB >> 21159730 |
Marc Bouffard1, Michael S Phillips, Andrew M K Brown, Sharon Marsh, Jean-Claude Tardif, Tibor van Rooij.
Abstract
Data generation, driven by rapid advances in genomic technologies, is fast outpacing our analysis capabilities. Faced with this flood of data, more hardware and software resources are added to accommodate data sets whose structure has not specifically been designed for analysis. This leads to unnecessarily lengthy processing times and excessive data handling and storage costs. Current efforts to address this have centered on developing new indexing schemas and analysis algorithms, whereas the root of the problem lies in the format of the data itself. We have developed a new data structure for storing and analyzing genotype and phenotype data. By leveraging data normalization techniques, database management system capabilities and the use of a novel multi-table, multidimensional database structure we have eliminated the following: (i) unnecessarily large data set size due to high levels of redundancy, (ii) sequential access to these data sets and (iii) common bottlenecks in analysis times. The resulting novel data structure horizontally divides the data to circumvent traditional problems associated with the use of databases for very large genomic data sets. The resulting data set required 86% less disk space and performed analytical calculations 6248 times faster compared to a standard approach without any loss of information. Database URL: http://castor.pharmacogenomics.ca.Entities:
Mesh:
Year: 2010 PMID: 21159730 PMCID: PMC3004464 DOI: 10.1093/database/baq029
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Genomic data structure with a large amount of duplicate data
| Sample identifier | SNP | SNP value | SNP position |
|---|---|---|---|
| Sample 1 | rs3094315 | CC | 742 429 |
| Sample 1 | rs41480945 | CC | 21 227 772 |
| Sample 1 | rs4040617 | CG | 95 952 929 |
| Sample 2 | rs3094315 | TT | 742 429 |
| Sample 2 | rs41480945 | AT | 21 227 772 |
| Sample 2 | rs4040617 | CC | 95 952 929 |
Genotype dimension table (see genotypes_dim in Figure 1)
| Code | Genotype | Allele_a | Allele_c | Allele_g | Allele_t |
|---|---|---|---|---|---|
| 1 | AA | 2 | 0 | 0 | 0 |
| 2 | CC | 0 | 2 | 0 | 0 |
| 3 | GG | 0 | 0 | 2 | 0 |
| 4 | TT | 0 | 0 | 0 | 2 |
| 5 | AC | 1 | 1 | 0 | 0 |
| 6 | AG | 1 | 0 | 1 | 0 |
| 7 | AT | 1 | 0 | 0 | 1 |
| 8 | CG | 0 | 1 | 1 | 0 |
| 9 | CT | 0 | 1 | 0 | 1 |
| 10 | GT | 0 | 0 | 1 | 1 |
Figure 1.CASTOR data diagram.
Phenotype dimension table (see phenotypes_dim, Figure 1)
| Id | Name | Discrete | Description | Column |
|---|---|---|---|---|
| 1 | Medication dosing (units) | 0 | Medication dose per day in units | ptype1 |
| 2 | Pain severity | 0 | Severity of patient pain | ptype2 |
| 3 | Smoking status | 1 | Never, former, current | ptype3 |
Discrete phenotype dimension table (see phenotypes_discrete_dim, Figure 1)
| Code | Phenotypes_dim_id | Label |
|---|---|---|
| 1 | 3 | Never smoked |
| 2 | 3 | Former smoker |
| 3 | 3 | Current smoker |
Query return times of common statistical analyses based on a single table query (genotype or phenotype)
| Query | CASTOR (s) | Original (s) |
|---|---|---|
| Query (gtypes) avg(int) | 0.347017 | 871.454132 |
| Query (gtypes) sqrt(int) | 0.096701 | 0.050104 |
| Query (gtypes) min(int) | 0.319485 | 716.520514 |
| Query (ptypes) stddev(int) | 0.014837 | 1341.081771 |
| Query (ptypes) avg(float) | 0.010675 | 1417.641397 |
| Query (ptypes) sqrt(float) | 0.003062 | 12.227921 |
| Query (ptypes) min(float) | 0.009296 | 0.014992 |
| Query (ptypes) stddev(float) | 0.013895 | 3.202807 |
| Query (ptypes) var_pop(float) | 0.010164 | 16.41966 |
| Query (gtypes) count(int) where int is 1 | 0.325984 | 16.669058 |
| Query (gtypes) count(int) where int is 3 | 0.358017 | 641.80022 |
| Query (gtypes) count(int) where int is 4 and patient_id = 1234 | 0.027244 | 668.470442 |
All queries performed on the Oracle 11G DBMS.
Performance comparison results
| Original | CASTOR | |
|---|---|---|
| Size of genotype data | 97 GB | 6.8 MB × 2000 tables = 13.3 GB |
| Size of phenotype data | 1.4 GB | 34 MB × 14 tables = 476 MB |
| Total data set size | 98.4 GB | 13.776 GB |
| Oracle 11G: load time (min) | 493.5 (8.23 h) | 90.3 (1.51 h) |
| Oracle 11G: Total time to run all performance tests (min) | 937.2 (15.62 h) | 0.15 (9.1 s) |
| Oracle 11G: Total time to perform evaluation (min) | 1430.7 (23.85 h) | 90.18 (1.50 h) |