| Literature DB >> 22574887 |
Gad Abraham1, Adam Kowalczyk, Justin Zobel, Michael Inouye.
Abstract
BACKGROUND: A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genome-wide and whole genome single nucleotide polymorphism (SNP) profiles allows us to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size.Entities:
Mesh:
Year: 2012 PMID: 22574887 PMCID: PMC3483007 DOI: 10.1186/1471-2105-13-88
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1SparSNP analysis pipeline. An example pipeline for analysing a SNP discovery dataset with SparSNP and testing the model on a validation dataset. Most of the data preparation and processing can be done with PLINK
Comparison of the evaluated methods
| SparSNP | ••• | (1) | (1) | yes | •◦◦◦ | ••••• |
| glmnet | ••∘ | (2) | (1) | no | •∘∘∘ | •••∘∘ |
| HyperLasso | ••• | (5) | (3) | yes | •••∘ | ∘∘∘∘∘ |
| LIBLINEAR | ••∘ | (3) | (2) | no | •∘∘∘ | ••∘∘∘ |
| LIBLINEAR-CDBLOCK | ••• | (4) | (4) | yes | ••∘∘∘ | ••∘∘∘ |
We evaluated each method in terms of the following criteria:
(a) Memory requirements: maximum GiB required to complete the prediction experiment. Three points: ≤4GiB, as is commonly available on laptops. Two points: >4GiB and ≤32GiB, commonly available on compute servers. One point: >32GiB, typically available on higher-end servers.
(b) Speed: time to complete in the timing experiments with 50,000 SNPs (Figure 2).
(c) Prediction: best cross-validated AUC in the prediction experiment (Figure 3).
(d) Fitted largest data: whether the tool successfully completed the largest timing experiment, consisting of p = 500,000 SNPs and N = 10,000 samples.
(e) Models: one point for each natively supported model of (i) additive, (ii) dominant/recessive, (iii) heterozygous models, (iv) and interaction models.
(f) Ease of use: one point for each of (i) does the tool support input in formats commonly used in the genetics community, such as PLINK BED or PED files, (ii) does the tool implement cross-validation, (iii) does the tool estimate the AUC, R2, or explained variance from the cross-validation, (iv) does the tool produce plots of the resulting AUC, R2, or explained variance, for easy model selection and evaluation, and (v) does the tool implement native imputation of missing genotypes.
Figure 2Timing experiments. Time (in seconds) for model fitting, over sub-samples of the celiac disease dataset, taken as the minimum time over 10 independent runs. The inset panel shows the results for 50,000 SNPs in more detail, note the different scales. For in-memory methods we included the time to read the data into memory. For SparSNP and glmnet we used a penalty grid of size 20, and a maximum model size of 2048 SNPs. LIBLINEAR (denoted “LL-L1”) and LIBLINEAR-CDBLOCK (denoted “LL-CD-L2”) induced one model with C = 1. LIBLINEAR-CDBLOCK used m = 50 blocks. For some datasets, glmnet and LIBLINEAR did not complete and these running times are not shown. HyperLasso is not shown since it took much longer to complete than the other methods
Figure 3Prediction experiments. LOESS-smoothed AUC and explained phenotypic variance (denoted “VarExp”), for the Finnish celiac disease dataset, for increasing model sizes. AUC is estimated over 20×3-fold cross-validation, except for HyperLasso for which we ran only 2×3-fold cross-validation due to the high computational cost. The explained phenotypic variance is estimated from the AUC using the method of [11], assuming a population prevalence of celiac disease K=1%. Note that glmnet, HyperLasso, LIBLINEAR (denoted “LL-L1”), and SparSNP used an -penalised model, whereas LIBLINEAR-CDBLOCK (denoted “LL-CD-L2”) used an -penalised model (non sparse), inducing a model using all 516,504 SNPs, therefore it is shown as a horizontal line across all model sizes. Note that tuning the penalty for LIBLINEAR-CDBLOCK resulted in very similar AUC