| Literature DB >> 28327993 |
Andrew Hill1, Po-Ru Loh2,3, Ragu B Bharadwaj4,5, Pascal Pons6, Jingbo Shang7, Eva Guinan8,9, Karim Lakhani4,9,10, Iain Kilty11, Scott A Jelinsky11.
Abstract
Background: The association of differing genotypes with disease-related phenotypic traits offers great potential to both help identify new therapeutic targets and support stratification of patients who would gain the greatest benefit from specific drug classes. Development of low-cost genotyping and sequencing has made collecting large-scale genotyping data routine in population and therapeutic intervention studies. In addition, a range of new technologies is being used to capture numerous new and complex phenotypic descriptors. As a result, genotype and phenotype datasets have grown exponentially. Genome-wide association studies associate genotypes and phenotypes using methods such as logistic regression. As existing tools for association analysis limit the efficiency by which value can be extracted from increasing volumes of data, there is a pressing need for new software tools that can accelerate association analyses on large genotype-phenotype datasets.Entities:
Keywords: Crowdsourcing; Genome-wide association study; Logistic regression; Open innovation; PLINK
Mesh:
Year: 2017 PMID: 28327993 PMCID: PMC5467032 DOI: 10.1093/gigascience/gix009
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Iterative open source contests to accelerate logistic regression for GWAS analysis. Workflow and code outputs of our project. First, a 10-day marathon crowd sourcing competition was hosted to accelerate the logistic regression code from PLINK 1.07, yielding code C1. Accelerated logistic regression code was integrated back into PLINK 1.07, yielding code PLINK-FLR. The logistic regression elements were donated and integrated into the PLINK2 project. A first-to-finish contest was run to speed up data initialization in the C1 code, yielding code C2. Another first-to-finish contest was run to multithread the C2 code, yielding code C3. C3 was then combined with coarse-grained HPC parallelization and PLINK-FLR, yielding mPLINK.
Acceleration of logistic regression
| Code | n (# replicated runs) | Avg time (sec) | SD time (sec) | Fold-speedup vs PLINK 1.07 |
|---|---|---|---|---|
|
|
|
|
|
|
| PLINK 1.07 (logistic regression only) | 5 | 68.8 | 14.58 |
|
| LRC4 | 5 | 1.5 | 0.03 |
|
| LRC5 | 5 | 1.9 | 0.05 |
|
| LRC1 | 5 | 2.3 | 0.48 |
|
| LRC3 | 5 | 3.8 | 1.17 |
|
| LRC2 | 5 | 3.9 | 0.06 |
|
All results are on a test set with N = 6000, M = 7000, P = 1, and C = 5 in the HPC environment, where N is the number of subjects and M is the number of genetic markers (variants), P is the number of phenotypes, and C is the number of covariates. First row of table indicates the end-to-end run time of PLINK 1.07, for context. Subsequent lines indicate run times of isolated logistic regression routines.
Figure 2:Run time of codes in AWS environment.. Run times of a test case with dimensions N = 6678, M = 645 863, C = 7, P = 1 were determined. Shown are run times for PLINK 1.07 (P1.07), PLINK-FLR(P-FLR), C1, C2, and C3. Values above the column represent run times in seconds. See text for detailed description of codes.
mPLINK wall-clock runtimes (seconds) in HPC environment
| Test case | |||
|---|---|---|---|
| N6678 × M645863 | N6678 × M3200000 | N7000 × M7000000 | |
| M*N | 4 313 073 114 | 21 369 600 000 | 49 000 000 000 |
| Software run | |||
| PLINK-1.07 | 17 146 | 70 617 | 172 602 |
| mPLINK (1 process) | 94 | NA (RAM)* | NA (RAM)* |
| mPLINK (5 process) | 34 | 109 | 281 |
| mPLINK (10 process) | 29 | 111 | 199 |
| mPLINK (50 process) | 39 | 60 | 119 |
| Max speedup (fold) | 591× | 1177× | 1450× |
*NA(RAM) signifies that the dataset was too large to load into memory and therefore was not calculated
N refers to the number of subjects; M is the number of genetic markers (variants).
Figure 3:GWAS analysis results. (A) Scatter plot comparison of –log10 P values for a synthetic test case with dimensions N = 6678, M = 645 863, C = 7, P = 1, and no missing values. PLINK 1.07 output was compared to the output of C3. 97 % of P values computed by C3 are within a 0.1 % relative tolerance of reference P values from PLINK 1.07. (B) Manhattan plots for real-world test case from COPDGene study with same dimensions as (A). Top panel: all P values as computed by PLINK. P values above user-set threshold of P = 0.001 are colored red. Bottom panel: Second-pass (final) mPLINKP values for markers meeting the P = 0.001 threshold in the first round. A small number of markers fall below the P = 0.001 cutoff due to differences in missing value handling and convergence criteria in C3, versus PLINK-FLR. Compute time was approximately 29 seconds for mPLINK compared to 4.7 hours for PLINK 1.07. (C) Two-way clustering of SNPs and phenotypes according to SNP-phenotype association P values. 164 binary phenotypes from the COPDGene study were associated against each of the M = 645 863 SNPs in the study. Results were filtered to variants that had any logistic association P value <4.81e-9 (i.e., a Bonferroni adjusted P value of 0.05, for N = 645 683 SNPS and P = 164 traits).