| Literature DB >> 29713384 |
Elizabeth R Piette1, Jason H Moore2.
Abstract
BACKGROUND: Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions.Entities:
Keywords: Cross validation; Epistasis; GWAS; Machine learning
Year: 2018 PMID: 29713384 PMCID: PMC5907739 DOI: 10.1186/s13040-018-0167-7
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Data set simulation parameters, prevalence = 0.5
| Scenario 1 | Scenario 2 | Scenario 3 | |||||||
| SNP1 MAF: | 0.1 | 0.2 | 0.2 | ||||||
| SNP2 MAF: | 0.1 | 0.1 | 0.2 | ||||||
| Penetrance: | 0.493 | 0.531 | 0.522 | 0.507 | 0.480 | 0.556 | 0.514 | 0.481 | 0.425 |
| 0.526 | 0.387 | 0.410 | 0.471 | 0.590 | 0.249 | 0.467 | 0.544 | 0.674 | |
| 0.611 | 0.008 | 0.358 | 0.485 | 0.532 | 0.482 | 0.539 | 0.447 | 0.304 | |
| Scenario 4 | Scenario 5 | Scenario 6 | |||||||
| SNP1 MAF: | 0.3 | 0.3 | 0.3 | ||||||
| SNP2 MAF: | 0.1 | 0.2 | 0.3 | ||||||
| Penetrance: | 0.513 | 0.494 | 0.456 | 0.488 | 0.525 | 0.450 | 0.481 | 0.533 | 0.446 |
| 0.438 | 0.530 | 0.696 | 0.527 | 0.455 | 0.562 | 0.525 | 0.468 | 0.513 | |
| 0.520 | 0.475 | 0.506 | 0.478 | 0.458 | 0.814 | 0.483 | 0.470 | 0.734 | |
| Scenario 7 | Scenario 8 | Scenario 9 | |||||||
| SNP1 MAF: | 0.4 | 0.4 | 0.4 | ||||||
| SNP2 MAF: | 0.1 | 0.2 | 0.3 | ||||||
| Penetrance: | 0.484 | 0.501 | 0.535 | 0.490 | 0.523 | 0.455 | 0.502 | 0.523 | 0.425 |
| 0.570 | 0.494 | 0.359 | 0.512 | 0.468 | 0.568 | 0.499 | 0.472 | 0.588 | |
| 0.545 | 0.551 | 0.245 | 0.565 | 0.395 | 0.668 | 0.495 | 0.503 | 0.501 | |
| Scenario 10 | Scenario 11 | Scenario 12 | |||||||
| SNP1 MAF: | 0.4 | 0.5 | 0.5 | ||||||
| SNP2 MAF: | 0.4 | 0.1 | 0.2 | ||||||
| Penetrance: | 0.476 | 0.535 | 0.449 | 0.306 | 0.333 | 0.341 | 0.476 | 0.521 | 0.482 |
| 0.506 | 0.473 | 0.568 | 0.428 | 0.314 | 0.256 | 0.521 | 0.472 | 0.536 | |
| 0.536 | 0.503 | 0.410 | 0.322 | 0.198 | 0.595 | 0.715 | 0.392 | 0.502 | |
| Scenario 13 | Scenario 14 | Scenario 15 | |||||||
| SNP1 MAF: | 0.5 | 0.5 | 0.5 | ||||||
| SNP2 MAF: | 0.3 | 0.4 | 0.5 | ||||||
| Penetrance: | 0.500 | 0.520 | 0.459 | 0.422 | 0.515 | 0.547 | 0.440 | 0.560 | 0.440 |
| 0.477 | 0.480 | 0.563 | 0.548 | 0.491 | 0.470 | 0.522 | 0.484 | 0.509 | |
| 0.608 | 0.482 | 0.429 | 0.531 | 0.492 | 0.485 | 0.515 | 0.472 | 0.542 | |
Fig. 1Comparing traditional cross validation and proportional instance cross validation (PICV). a The overall distribution of 9 SNP-SNP interaction genotypes (the 9 categories that result from the interaction of two SNPs in a hypothetical population of individuals. Note: only one possible allocation is depicted. b Traditional cross validation in which 2/3 of observations are randomly allocated to the training set and the remaining 1/3 are allocated to the testing set can result in draws with imbalanced genotype proportions. c PICV randomly allocates 2/3 of observations of each genotype to the training set and the remaining 1/3 to the testing set, ensuring that the relative proportions of genotypes are maintained
Fig. 2Consistency of training and testing performance measures for models with and without the interaction term, comparing a traditional cross validation procedure to PICV. Experimental scenario in which both SNPs have a MAF of 0.5, n = 2000. PPV: positive predictive value, NPV: negative predictive value, N.S.: not significant
Summary of performance measures across minor allele frequency combinations, n = 2000
| Measure, Model Scenario | Sensitivity, without interaction | Sensitivity, with interaction | Specificity, without interaction | Specificity, with interaction | PPV, without interaction | PPV, with interaction | NPV, without interaction | NPV, with interaction |
|---|---|---|---|---|---|---|---|---|
| SNP1 MAF: 0.1 | 3.06e-17 | 9.89e-08 | N.S. | N.S. | 1.90e-18 | 7.67e-08 | N.S. | N.S. |
| SNP2 MAF: 0.1 | ||||||||
| SNP1 MAF: 0.2 | 7.04e-20 | 4.54e-05 | 3.88e-02 | N.S. | 3.68e-11 | 5.56e-06 | 4.35e-02 | 1.89e-02 |
| SNP2 MAF: 0.1 | ||||||||
| SNP1 MAF: 0.2 | 1.69e-10 | 1.69e-10 | N.S. | 6.87e-03 | 4.06e-09 | 4.06e-09 | N.S. | N.S. |
| SNP2 MAF: 0.2 | ||||||||
| SNP1 MAF: 0.3 | 1.59e-08 | 2.47e-05 | 4.35e-02 | N.S. | 9.27e-09 | 2.47e-05 | 3.46e-02 | N.S. |
| SNP2 MAF: 0.1 | ||||||||
| SNP1 MAF: 0.3 | 6.14e-04 | 5.02e-11 | N.S. | N.S. | 3.07e-16 | 1.22e-14 | N.S. | N.S. |
| SNP2 MAF: 0.2 | ||||||||
| SNP1 MAF: 0.3 | 5.16e-04 | 4.33e-04 | N.S. | N.S. | 1.75e-04 | 1.75e-04 | N.S. | N.S. |
| SNP2 MAF: 0.3 | ||||||||
| SNP1 MAF: 0.4 | 9.94e-05 | 7.67e-08 | N.S. | N.S. | 3.52e-08 | 5.53e-10 | N.S. | N.S. |
| SNP2 MAF: 0.1 | ||||||||
| SNP1 MAF: 0.4 | 6.65e-17 | 1.45e-04 | N.S. | N.S. | 5.36e-09 | 2.42e-02 | N.S. | N.S. |
| SNP2 MAF: 0.2 | ||||||||
| SNP1 MAF: 0.4 | 2.71e-08 | 4.54e-05 | N.S. | N.S. | 8.97e-07 | 4.46e-06 | N.S. | N.S. |
| SNP2 MAF: 0.3 | ||||||||
| SNP1 MAF: 0.4 | 1.63e-05 | 1.41e-03 | N.S. | N.S. | 2.66e-03 | 8.62e-04 | N.S. | N.S. |
| SNP2 MAF: 0.4 | ||||||||
| SNP1 MAF: 0.5 | 8.97e-07 | 7.06e-09 | N.S. | N.S. | 2.27e-06 | 1.27e-07 | 4.85e-03 | N.S. |
| SNP2 MAF: 0.1 | ||||||||
| SNP1 MAF: 0.5 | 9.42e-18 | 6.75e-05 | 1.28e-02 | N.S. | 4.00e-12 | 8.60e-06 | N.S. | N.S. |
| SNP2 MAF: 0.2 | ||||||||
| SNP1 MAF: 0.5 | 4.38e-07 | 2.47e-05 | N.S. | N.S. | 7.67e-08 | 4.12e-10 | N.S. | 1.46e-02 |
| SNP2 MAF: 0.3 | ||||||||
| SNP1 MAF: 0.5 | 2.69e-07 | 5.54e-05 | N.S. | N.S. | 7.06e-09 | 8.62e-04 | N.S. | N.S. |
| SNP2 MAF: 0.4 | ||||||||
| SNP1 MAF: 0.5 | 1.27e-07 | 6.92e-06 | N.S. | 1.54e-02 | 2.89e-12 | 9.27e-09 | N.S. | N.S. |
| SNP2 MAF: 0.5 |
Number of scenarios for which PICV yielded smaller median, maximum differences between training and testing
| Measure, Model | PICV median less than traditional CV median (out of 15) | PICV maximum less than traditional CV maximum (out of 15) | ||||
|---|---|---|---|---|---|---|
| Prevalence | Prevalence | |||||
| 0.02 | 0.1 | 0.5 | 0.02 | 0.1 | 0.5 | |
| Specificity, without interaction | 15 | 15 | 12 | 15 | 15 | 15 |
| Specificity, with interaction | 15 | 15 | 15 | 15 | 15 | 15 |
| NPV, without interaction | 14 | 9 | 9 | 11 | 8 | 8 |
| NPV, with interaction | 8 | 9 | 10 | 8 | 9 | 9 |
Interaction analysis summary
| Data set | ALX4 variant | RBFOX1 variant | LRT |
|---|---|---|---|
| eMERGE | rs10838251 | rs653127 | 7.29E-06 |
| NEIGHBOR | rs7126447 | rs11077011 | 1.62E-06 |
| GLAUGEN | rs7126447 | rs11077011 | 0.327 |