| Literature DB >> 26619286 |
Badri Padhukasahasram1, Chandan K Reddy2, Albert M Levin3, Esteban G Burchard4,5, L Keoki Williams1,6.
Abstract
Multi-marker approaches have received a lot of attention recently in genome wide association studies and can enhance power to detect new associations under certain conditions. Gene-, gene-set- and pathway-based association tests are increasingly being viewed as useful supplements to the more widely used single marker association analysis which have successfully uncovered numerous disease variants. A major drawback of single-marker based methods is that they do not look at the joint effects of multiple genetic variants which individually may have weak or moderate signals. Here, we describe novel tests for multi-marker association analyses that are based on phenotype predictions obtained from machine learning algorithms. Instead of assuming a linear or logistic regression model, we propose the use of ensembles of diverse machine learning algorithms for prediction. We show that phenotype predictions obtained from ensemble learning algorithms provide a new framework for multi-marker association analysis. They can be used for constructing tests for the joint association of multiple variants, adjusting for covariates and testing for the presence of interactions. To demonstrate the power and utility of this new approach, we first apply our method to simulated SNP datasets. We show that the proposed method has the correct Type-1 error rates and can be considerably more powerful than alternative approaches in some situations. Then, we apply our method to previously studied asthma-related genes in 2 independent asthma cohorts to conduct association tests.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26619286 PMCID: PMC4664402 DOI: 10.1371/journal.pone.0143489
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of empirical power and Type-1 error rates of gene-based association tests for simulated datasets assuming linkage equilibrium.
| #SNP (#DSL) | Logistic Regression | Fisher | Vegas-Sum | Original Simes | Vegas-Max | GATES | SKAT | Machine-Learning Ensemble | |
|---|---|---|---|---|---|---|---|---|---|
| Linkage Equilibrium | |||||||||
| Type-1 Error | 3(0) | 4.66 [3.4–6.0] | 4.67 [3.4–6.0] | 4.70 [3.4–6.2] | 4.61 [3.5–6.1] | 4.62 [3.5–6.1] | 4.61 [3.5–6.1] | 5.15 [4.7–5.6] | 5.94 [5.3–6.6] |
| Type-1 Error | 10(0) | 5.10 [3.8–6.7] | 5.00 [3.7–6.5] | 5.04 [3.8–6.5] | 5.06 [3.8–6.5] | 5.07 [3.8–6.5] | 5.06 [3.8–6.5] | 4.82 [4.4–5.3] | 6.29 [5.6–7.0] |
| Type-1 Error | 30(0) | 5.26 [4.0–6.8] | 4.96 [3.7–6.4] | 4.97 [3.7–6.4] | 4.97 [3.7–6.4] | 5.04 [3.8–6.5] | 4.97 [3.7–6.4] | 4.86 [4.5–5.3] | 4.22 [3.7–4.8] |
| Power Additive | 3(1) | 43.71 [40.7–46.8] | 41.79 [38.7–44.8] | 42.67 [39.6–45.7] | 45.28 [42.2–48.3] | 45.22 [42.2–48.3] | 45.28 [42.2–48.3] | 45.1 [42–48.2] | 56.00 [51.5–60.4] |
| Power Additive | 10(2) | 56.88 [53.8–59.9] | 53.32 [50.3–56.4] | 54.56 [51.5–57.6] | 54.76 [51.7–57.8] | 54.00 [50.9–57.1] | 54.76 [51.7–57.8] | 60.8 [57.7–63.8] | 57.60 [53.1–62] |
| Power Additive | 30(6) | 65.32 [62.4–68.2] | 61.5 [58.4–64.5] | 63.28 [60.2–66.2] | 47.18 [44.1–50.3] | 45.62 [42.6–48.8] | 47.18 [44.1–50.3] | 69.8 [66.8–72.6] | 69.00 [64.7–73.0] |
| Power Multiplicative | 3(1) | 46.61 [43.5–49.8] | 44.72 [41.6–47.8] | 45.54 [42.5–48.7] | 48.39 [45.3–51.5] | 48.3 [45.2–51.5] | 48.39 [45.3–51.5] | 43.3 [40.2–46.4] | 53.00 [48.5–57.5] |
| Power Multiplicative | 10(2) | 69.00 [66.0–71.9] | 65.25 [62.3–68.2] | 66.88 [63.9–69.7] | 67.00 [64.0–69.9] | 66.26 [63.3–69.1] | 67.00 [64.0–69.9] | 70.9 [68–73.7] | 69.00 [64.7–73.0] |
| Power Multiplicative | 30(6) | 93.45 [91.8–94.9] | 91.44 [89.6–93.1] | 92.28 [90.5–93.8] | 82.21 [79.8–84.5] | 80.18 [77.6–82.5] | 82.21 [79.8–84.5] | 94.3 [92.7–95.7] | 94.60 [92.2–96.4] |
DSL denotes the number of disease susceptibility markers. Machine learning test is based on ensemble learning variation 1 with the following components: logistic regression, support vector machine with linear kernel and random forests with mtry = 1 and ntree = 1000.
Comparison of empirical power and Type-1 error rates of gene-based association tests on simulated datasets for strong linkage disequilibrium.
| #SNP (#DSL) | Logistic Regression | Fisher | Vegas-Sum | Original-Simes | Vegas-Max | GATES | SKAT | Machine learning ensemble | |
|---|---|---|---|---|---|---|---|---|---|
| Linkage Disequilibrium | |||||||||
| Type-1 Error | 3(0) | 4.96 [3.73–6.43] | 11.49 [9.6–13.5] | 5.23 [3.99–6.76] | 3.88 [2.8–5.2] | 5.22 [4.0–6.8] | 5.35 [4.1–6.9] | 4.86 [4.5–5.3] | 6.05 [5.4–6.7] |
| Type-1 Error | 10(0) | 5.33 [4.08–6.88] | 15.68 [13.5–18.0] | 4.84 [3.7–6.3] | 3.37 [2.4–4.6] | 4.88 [3.7–6.3] | 5.34 [4.1–6.9] | 5.03 [4.6–5.5] | 5.05 [4.5–5.7] |
| Type-1 Error | 30(0) | 5.57 [4.26–7.10] | 17.9 [15.6–20.4] | 4.89 [3.7–6.3] | 3.38 [2.4–4.6] | 4.89 [3.7–6.3] | 5.64 [4.4–7.2] | 5.04 [4.6–5.5] | 3.78 [3.3–4.4] |
| Power Additive | 3(1) | 45.03 [42–48.1] | --- | 58.81 [55.8–61.9] | 53.88 [50.8–56.9] | 58.2 [55.1–61.3] | 60.43 [57.4–63.5] | 57.1 [54–60.2] | 61.00 [56.6–65.3] |
| Power Additive | 10(2) | 57.20 [54.1–60.3] | --- | 75.74 [73–78.3] | 66.39 [63.4–69.2] | 71.71 [68.9–74.5] | 74.3 [71.5–77] | 77.9 [75.2–80.4] | 59.00 [54.6–63.4] |
| Power Additive | 30(6) | 65.56 [62.6–68.5] | --- | 86.3 [84–88.4] | 62.84 [59.8–65.8] | 66.80 [63.8–69.7] | 72.75 [69.9–75.4] | 86.0 [83.7–88.1] | 65.80 [61.5–70] |
| Power Multiplicative | 3(1) | 47.13 [44.1–50.3] | --- | 60.88 [57.8–63.8] | 56.28 [53.2–59.3] | 60.74 [57.7–63.7] | 62.77 [59.7–65.7] | 59.7 [56.6–62.8] | 65.00 [60.6–69.2] |
| Power Multiplicative | 10(2) | 68.45 [65.5–71.3] | --- | 84.89 [82.5–87] | 77.14 [74.5–79.7] | 80.59 [78–82.9] | 83.00 [80.5–85.3] | 88.1 [85.9–90] | 74.40 [70.3–78.2] |
| Power Multiplicative | 30(6) | 93.4 [91.7–94.9] | --- | 99.2 [98.4–99.7] | 91.42 [89.6–93.1] | 92.24 [90.5–93.8] | 95.38 [93.9–96.5] | 98.8 [97.9–99.4] | 94.00 [91.5–95.9] |
DSL denotes the number of disease susceptibility markers. Machine learning test is based on ensemble learning variation 1 with the following components: logistic regression, support vector machine with linear kernel and random forests with mtry = 1 and ntree = 1000.
Comparison of empirical Power and Type-1 error rates of gene-based association tests for a quantitative trait simulated under models with interactions.
| Value | Phenotype distribution | #SNP (#TAS) | GATES | Linear Regression | SKAT | Machine Learning |
|---|---|---|---|---|---|---|
| Type-1 error |
| 5(0) | 4.68 [4.11–5.30] | 5.12 [4.53–5.77] | 4.96 [4.37–5.60] | 5.02 [4.43–5.66] |
| Type-1 error |
| 10(0) | 5.02 [4.43–5.66] | 5.02 [4.43–5.66] | 4.84 [4.26–5.47] | 4.22 [3.68–4.81] |
| Power |
| 10(4) | 4.0 [2.46–6.11] | 6.6 [4.59–9.14] | 5.4 [3.59–7.76] | 9.0 [6.64–11.86] |
| Power |
| 5(4) | 99.6 [98.56–99.95] | 99.6 [98.56–99.95] | 100 [99.26–100] | 97.4 [95.59–98.61] |
| Power |
| 5(3) | 8.2 [5.95–10.96] | 9.0 [6.64–11.86] | 17.6 [14.36–21.23] | 34.2 [30.05–38.54] |
| Power |
| 5(3) | 8.6 [6.29–11.41] | 8.6 [6.29–11.41] | 18.2 [14.91–21.87] | 42.0 [37.63–46.46] |
| Power |
| 5(3) | 8.8 [6.47–11.63] | 9.8 [7.34–12.75] | 26.4 [22.59–30.50] | 55.4 [50.92–59.81] |
| Power |
| 10(6) | 7.0 [4.92–9.60] | 5.6 [3.75–7.99] | 7.4 [5.26–10.06] | 5.6 [3.75–7.99] |
| Power |
| 5(4) | 98.2 [96.61–99.17] | 98.4 [96.87–99.31] | 98.6 [97.14–99.44] | 92.4 [89.72–94.57] |
| Power |
| 5(4) | 99.8 [98.89–99.99] | 100.0 [99.26–100] | 100.0 [99.26–100] | 100.0 [99.26–100] |
TAS denotes the number of trait associated SNPs. Machine learning test is based on ensemble learning variation 1 with the following components: multiple linear regression, support vector machine with linear kernel and random forests with mtry = 1 and ntree = 1000.
Empirical Power and Type-1 error rate of a gene-based test of interactions for a simulated quantitative trait.
| Value | Phenotype distribution | #SNP (#TAS) | Machine learning ensemble |
|---|---|---|---|
| Type-1 error |
| 5(0) | 6.10 [4.70–7.77] |
| Type-1 error |
| 10(0) | 5.10 [3.82–6.65] |
| Power |
| 10(4) | 14.8 [12.66–17.15] |
| Power |
| 5(4) | 56.3 [53.16–59.40] |
| Power |
| 5(3) | 30.8 [27.95–33.76] |
| Power |
| 5(3) | 41.0 [37.93–44.12] |
| Power |
| 5(3) | 59.4 [56.28–62.46] |
| Power |
| 10(6) | 14.7 [12.56–17.05] |
| Power |
| 5(4) | 94.4 [92.79–95.74] |
| Power |
| 5(4) | 95.8 [94.36–96.96] |
TAS denotes the number of trait associated SNPs. Machine learning test is based on ensemble learning variation 1 with the following components: multiple linear regression, support vector machine with linear kernel and random forests with mtry = 1 and ntree = 1000.
Gene-based p values for previously reported asthma-related genes in 1,427 African American individuals from the SAPPHIRE cohort.
| Chromosome | Gene | Length in base pairs | Number of SNPs tested | Gene-based | Gene-based | Gene-based |
|---|---|---|---|---|---|---|
| 1 |
| 45513 | 13 | 0.198 | 0.230 | 0.130 |
| 2 |
| 7466 | 6 | 0.982 | 0.982 | 0.832 |
| 5 |
| 6333 | 5 | 0.063 | 0.064 | 0.533 |
| 9 |
| 42198 | 12 | 0.408 | 0.180 | 0.130 |
| 17 |
| 14056 | 15 | 0.401 | 0.401 | 0.870 |
| 5 |
| 2937 | 3 | 0.156 | 0.164 | 0.387 |
| 15 |
| 57175 | 23 | 0.323 | 0.323 | 0.359 |
| 5 |
| 25906 | 15 | 0.0095 | 0.162 | 0.0076 |
| 5 |
| 87698 | 34 | 0.010 | 0.010 | 0.367 |
Gene-based p values for previously reported asthma-related genes in 3,772 Latino individuals from the GALA study.
| Chromosome | Gene | Length in base pairs | Number of SNPs tested | Gene-based | Gene-based | Gene-based |
|---|---|---|---|---|---|---|
| 1 |
| 45513 | 15 | 0.320 | 0.320 | 0.530 |
| 2 |
| 7466 | 16 | 0.038 | 0.038 | 0.046 |
| 5 |
| 6333 | 7 | 0.270 | 0.270 | 0.250 |
| 9 |
| 42198 | 14 | 0.0014 | 0.095 | 0.069 |
| 17 |
| 14056 | 13 | 2.33E-09 | 4.20E-08 | 6.24E-11 |
| 5 |
| 2937 | 10 | 0.280 | 0.280 | 0.100 |
| 15 |
| 57175 | 28 | 0.464 | 0.464 | 0.063 |
| 5 |
| 25906 | 12 | 0.838 | 0.838 | 0.956 |
| 5 |
| 87698 | 33 | 0.217 | 0.217 | 0.050 |
Comparison of empirical power and Type-1 error rates of gene-based association tests in simulated datasets for moderate linkage disequilibrium.
| #SNP (#DSL) | Logistic Regression | Fisher | Vegas-Sum | Original-Simes | Vegas-Max | GATES | SKAT | Machine-Learning Ensemble | |
|---|---|---|---|---|---|---|---|---|---|
| Linkage Disequilibrium | |||||||||
| Type-1 Error | 3(0) | 4.86 [3.7–6.3] | 7.17 [5.7–8.9] | 4.91 [3.7–6.4] | 4.54 [3.4–6.0] | 4.81 [3.7–6.3] | 4.98 [3.7–6.4] | 4.71 [4.3–5.1] | 6.02 [5.4–6.7] |
| Type-1 Error | 10(0) | 4.88 [3.7–6.3] | 9.8 [8.0–11.8] | 4.83 [3.7–6.3] | 4.55 [3.4–6.0] | 4.92 [3.7–6.4] | 5.00 [3.7–6.5] | 4.70 [4.3–5.1] | 6.16 [5.5–6.9] |
| Type-1 Error | 30(0) | 5.63 [4.4–7.2] | 11.09 [9.2–13.1] | 5.03 [3.8–6.5] | 4.97 [3.7–6.4] | 5.29 [4.0–6.8] | 5.56 [4.3–7.1] | 5.05 [4.6–5.5] | 3.80 [3.3–4.4] |
| Power Additive | 3(1) | 44.59 [41.5–47.6] | --- | 49.36 [46.3–52.5] | 49.71 [46.7–52.9] | 50.51 [47.5–53.6] | 51.23 [48.2–54.3] | 46.9 [43.8–50.1] | 55.20 [50.7–59.6] |
| Power Additive | 10(2) | 56.25 [53.2–59.3] | --- | 61.36 [58.3–64.3] | 58.39 [55.3–61.4] | 59.12 [56.1–62.2] | 60.72 [57.7–63.7] | 64.2 [61.1–67.2] | 63.80 [59.4–68.0] |
| Power Additive | 30(6) | 65.47 [62.5–68.4] | --- | 71.96 [69.1–74.7] | 53.29 [50.2–56.3] | 52.24 [49.2–55.3] | 55.65 [52.6–58.7] | 74.3 [71.5–77] | 68.00 [63.7–72.1] |
| Power Multiplicative | 3(1) | 46.52 [43.5–49.7] | --- | 50.98 [47.9–54.0] | 51.19 [48.1–54.2] | 52.00 [48.9–55.1] | 52.65 [49.6–55.7] | 48.0 [44.9–51.2] | 53.40 [48.9–57.8] |
| Power Multiplicative | 10(2) | 68.42 [65.5–71.3] | --- | 72.48 [69.6–75.2] | 70.66 [67.8–73.4] | 70.9 [68.0–73.7] | 72.4 [69.5–75.2] | 75.8 [73.0–78.4] | 70.20 [66–74.2] |
| Power Multiplicative | 30(6) | 93.68 [92.0–95.0] | --- | 95.59 [94.1–96.7] | 86.07 [83.8–88.1] | 84.34 [82.0–86.5] | 87.52 [85.4–89.5] | 94.7 [93.1–96.0] | 94.70 [92.5–96.4] |
DSL denotes the number of disease susceptibility markers. Machine learning test is based on ensemble learning variation 1 with the following components: logistic regression, support vector machine with linear kernel and random forests with mtry = 1 and ntree = 1000.