| Literature DB >> 24359123 |
Patrick P Putnam1, Ge Zhang, Philip A Wilsey.
Abstract
BACKGROUND: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647-657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly important topics which need to be addressed. The use of succinct data structures is one method of reducing physical size of a data set without the use of expensive compression techniques. In this work, we consider the use of 2- and 3-bit encoding schemes for genotype data. We compare the computational performance of allele or genotype counting algorithms utilizing genotype data encoded in both schemes.Entities:
Mesh:
Year: 2013 PMID: 24359123 PMCID: PMC3879196 DOI: 10.1186/1471-2105-14-369
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Frequency table for raw input from Tables 3, 4 and5
| 2 | 1 | 1 | 1 | |
| 2 | 1 | 2 | 0 |
Pairwise genotype count table for two markers
| | | |||||
|---|---|---|---|---|---|---|
| AA | 1 | 0 | 1 | 0 | 2 | |
| Aa | 1 | 0 | 0 | 0 | 1 | |
| aa | 0 | 0 | 1 | 0 | 1 | |
| NN | 0 | 1 | 0 | 0 | 1 | |
| 2 | 1 | 2 | 0 | |||
Note that the marginal sums of this table are the individual markers frequencies from Table 1.
Example genotype input
| AA | Aa | AA | aa | NN | |
| AA | AA | aa | aa | Aa |
I1-5 represent individuals, and MA and MB are markers.
3-bit encoding scheme
| | AA | 1 | 0 | 1 | 0 | 0 |
| Aa | 0 | 1 | 0 | 0 | 0 | |
| | aa | 0 | 0 | 0 | 1 | 0 |
| | AA | 1 | 1 | 0 | 0 | 0 |
| Aa | 0 | 0 | 0 | 0 | 1 | |
| aa | 0 | 0 | 1 | 1 | 0 |
2-bit encoding scheme
| AA OR aa | 1 | 0 | 1 | 1 | 0 | |
| Aa OR aa | 0 | 1 | 0 | 1 | 0 | |
| AA OR aa | 1 | 1 | 1 | 1 | 0 | |
| Aa OR aa | 0 | 0 | 1 | 1 | 1 |
Figure 1Constructing a frequency table from 2-bit encoded genotypes.
Figure 2Constructing a contingency table from 2-bit encoded genotypes.
Figure 3Average Case/Control frequency table construction using simulated data following Affy6 SNPs of HapMap CEU individuals.
Figure 4Average Case/Control contingency table construction using simulated data following Affy6 SNPs of HapMap CEU individuals.
Epistasis runtime comparison
| 1000 | 28.56 s | 28.45 s | 0.37 |
| 5000 | 92.07 s | 93.32 s | -1.33 |
| 10000 | 173.12 s | 177.46 s | -2.45 |
| 25000 | 418.31 s | 420.71 s | -0.57 |
| 50000 | 810.71 s | 820.26 s | -1.16 |
| 150000 | 2408.05 s | 24.27.84 s | -0.81 |
Speedup is measured relative to the 3-bit runtime.
Figure 5Average epistasis runtime using BOOST [6] algorithm.