| Literature DB >> 27782180 |
G V Roshchupkin1,2, H H H Adams1,3, M W Vernooij1,3, A Hofman3, C M Van Duijn3, M A Ikram1,3,4, W J Niessen1,2,5.
Abstract
High-throughput technology can now provide rich information on a person's biological makeup and environmental surroundings. Important discoveries have been made by relating these data to various health outcomes in fields such as genomics, proteomics, and medical imaging. However, cross-investigations between several high-throughput technologies remain impractical due to demanding computational requirements (hundreds of years of computing resources) and unsuitability for collaborative settings (terabytes of data to share). Here we introduce the HASE framework that overcomes both of these issues. Our approach dramatically reduces computational time from years to only hours and also requires several gigabytes to be exchanged between collaborators. We implemented a novel meta-analytical method that yields identical power as pooled analyses without the need of sharing individual participant data. The efficiency of the framework is illustrated by associating 9 million genetic variants with 1.5 million brain imaging voxels in three cohorts (total N = 4,034) followed by meta-analysis, on a standard computational infrastructure. These experiments indicate that HASE facilitates high-dimensional association studies enabling large multicenter association studies for future discoveries.Entities:
Year: 2016 PMID: 27782180 PMCID: PMC5080584 DOI: 10.1038/srep36076
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of complexity and speed between the HASE framework and a classical workflow.
| Stage | Complexity | Time | ||||
|---|---|---|---|---|---|---|
| Classical workflow | HASE | |||||
| Classical workflow | HASE | Classical workflow | HASE | |||
| 2.46 | 0.63 | 2.46 × 106 | 0.70 | |||
| 0.04 | 0.07 | 4 × 104 | 11.6 | |||
| 0.06 | 0.03 | 6 × 104 | 1.7 × 103 | |||
aBased on a model with three covariates and 9 million genetic variants, for a total of 4034 participants from three sites. For the classical workflow we used the PLINK software for single site analysis and METAL for the meta-analysis.
bFor single site analysis and meta-analysis the time is given in CPU hours; for the data transfer stage this is in hours using an average network speed of 10 Mbps.
cComplexity for CPU hours is given in terms of classical computation time complexity; complexity for data transfer is shown in terms of how the size of the to be transferred data depends on the size of the input data.
*This time is derived from the transfer of partial derivatives only, because for an association analysis with relatively few phenotypes it is not necessary to transfer encoded data.
- number of individuals in the study; - number of phenotypes of interest; - number of tests (genetic variants); - number of sites in the meta-analysis. In standard analysis ≪ and ≪ .
Figure 1Analysis time (HASE versus RegScan) with 2.172.718 variants.
(A)– for 1 phenotype; (B)– for 100 phenotypes; (C)- for 1000 phenotypes.
Figure 2Manhattan plot of the hippocampus voxel with the most significant association after screening all 7030 hippocampal voxels.
The most significant association (rs77956314; p = 3 × 10−9) corresponded to a previously identified locus on chromosome 12q24. Such voxel-wise hippocampus screening would take less than 8 hours on standard laptop.
Figure 3Correlation plot of voxel GWAS t-statistic estimated from pooled together data and voxel GWAS t-statistic estimated from meta-analysis of partial derivatives and encoded matrix.
It took 40 min for single site to pre-compute data instead of 280 years to compute summary statistics.
Figure 4Explanation of the achieved speed reduction in HASE framework by removing redundant computations.
In HASE multi-dimensional (A,B) matrices need to be calculated to perform GWAS studies. In the figure grey color means elements are parts of the matrix that are not necessary to calculate, as the A matrix is symmetric. The green color indicates elements that need to be calculated only once. Blue elements only have to be calculated for every SNP and yellow only for every phenotype. The red color indicates the most computationally expensive element, which needs to be calculated for every combination of phenotype and genotype. N denotes the number subjects in study.