| Literature DB >> 24942080 |
Nuno Sepúlveda1, Alphaxard Manjurano, Chris Drakeley, Taane G Clark.
Abstract
Multiple imputation based on chained equations (MICE) is an alternative missing genotype method that can use genetic and nongenetic auxiliary data to inform the imputation process. Previously, MICE was successfully tested on strongly linked genetic data. We have now tested it on data of the HBA2 gene which, by the experimental design used in a malaria association study in Tanzania, shows a high missing data percentage and is weakly linked with the remaining genetic markers in the data set. We constructed different imputation models and studied their performance under different missing data conditions. Overall, MICE failed to accurately predict the true genotypes. However, using the best imputation model for the data, we obtained unbiased estimates for the genetic effects, and association signals of the HBA2 gene on malaria positivity. When the whole data set was analyzed with the same imputation model, the association signal increased from 0.80 to 2.70 before and after imputation, respectively. Conversely, postimputation estimates for the genetic effects remained the same in relation to the complete case analysis but showed increased precision. We argue that these postimputation estimates are reasonably unbiased, as a result of a good study design based on matching key socio-environmental factors.Entities:
Keywords: Genotype imputation; HBA2 gene; malaria positivity; multiple imputation based on chained equations
Mesh:
Substances:
Year: 2014 PMID: 24942080 PMCID: PMC4140543 DOI: 10.1111/ahg.12065
Source DB: PubMed Journal: Ann Hum Genet ISSN: 0003-4800 Impact factor: 1.670
Background information of the 13 study sites where α3.7-globin genotyping was attempted
| No. of α3.7-globin deletions, n (%) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Major ethnic | Malaria parasite | Mild anemia | ||||||||||
| Transect | Sample | group, | Females, | positivity, | prevalence, | Deletion | HWE | |||||
| (region), village | Altitude, m | size, n | % | % | % | % | 0 | 1 | 2 | Missing | frequency | p-value |
| Kilimanjaro (Kilimanjaro) | ||||||||||||
| Mokala | 1702 | 378 | Wachaga (98.7) | 61.9 | 4.5 | 14.8 | 154 (84.6) | 27 (14.8) | 1 (0.5) | 196 (51.9) | 0.080 | 0.387 |
| Machame Aleni | 1421 | 242 | Wachaga (99.6) | 54.5 | 1.7 | 9.0 | 168 (87.0) | 25 (13.0) | 0 (0.0) | 49 (20.2) | 0.065 | 0.188 |
| Ikuini | 1160 | 318 | Wachaga (98.1) | 60.3 | 10.7 | 19.9 | 157 (82.6) | 32 (16.8) | 1 (0.5) | 128 (40.3) | 0.089 | 0.235 |
| Kileo | 723 | 242 | Wapare (84.3) | 61.2 | 6.6 | 22.8 | 175 (75.1) | 53 (22.7) | 5 (2.1) | 9 (3.7) | 0.135 | 0.221 |
| South Pare (Kilimanjaro) | ||||||||||||
| Bwambo | 1598 | 375 | Wapare (98.4) | 57.3 | 3.2 | 20.5 | 178 (81.7) | 37 (17.0) | 3 (1.4) | 157 (41.9) | 0.099 | 0.504 |
| Mpinji | 1445 | 361 | Wapare (95.0) | 59.0 | 2.8 | 18.3 | 175 (75.1) | 55 (23.6) | 3 (1.3) | 128 (35.5) | 0.131 | 0.074 |
| Goha | 1162 | 389 | Wapare (95.6) | 60.3 | 10.9 | 20.1 | 172 (72.3) | 62 (26.6) | 4 (1.7) | 151 (36.8) | 0.147 | 0.047 |
| Kadando | 528 | 381 | Wapare (70.5) | 59.5 | 23.9 | 34.7 | 136 (57.9) | 86 (36.6) | 13 (5.5) | 146 (38.3) | 0.238 | 0.008 |
| West Usambara (Tanga) | ||||||||||||
| Kwadoe | 1523 | 404 | Wasambaa (94.1) | 61.6 | 7.7 | 33.4 | 166 (77.6) | 46 (21.5) | 2 (0.9) | 190 (47.0) | 0.117 | 0.106 |
| Funta | 1279 | 303 | Wasambaa (97.0) | 67.0 | 24.1 | 42.6 | 129 (61.1) | 72 (34.1) | 10 (4.7) | 92 (30.4) | 0.218 | 0.025 |
| Tamota | 1176 | 403 | Wasambaa (93.5) | 54.1 | 24.8 | 43.4 | 130 (58.6) | 86 (38.7) | 6 (2.7) | 181 (44.9) | 0.221 | <0.001 |
| Mgila | 432 | 382 | Wasambaa (67.7) | 69.9 | 38.9 | 51.0 | 125 (54.8) | 92 (40.4) | 11 (4.8) | 154 (40.3) | 0.250 | 0.001 |
| Tanga Coast (Tanga) | ||||||||||||
| Mgome | 196 | 236 | Other (86.3) | 53.6 | 48.9 | 44.1 | 100 (43.7) | 105 (45.9) | 24 (10.5) | 7 (3.0) | 0.334 | <0.001 |
Association analysis between the number of α3.7-globin deletions and malaria parasite positivity using complete case data. “Log-likelihood” refers to the maximum value of the log-likelihood function after maximum likelihood parameter estimation. Association analysis was performed adjusting or not for putative confounders (age, gender, ethnicity, altitude, and transect). Association signals refer to −log10(p-value), where p-value is from Pearson's χ2 test for two-way contingency tables in the unadjusted analysis, and from the likelihood ratio test for lack of genetic association in the adjusted analysis
| Unadjusted analysis | Adjusted analysis | |||
|---|---|---|---|---|
| Parameter | Estimate | SE | Estimate | SE |
| Average no. of deletions | 0.337 | 0.010 | − | − |
| λ1 | 0.408 | 0.121 | −0.225 | 0.138 |
| λ2 | 1.165 | 0.250 | 0.179 | 0.283 |
| Log-likelihood | −1069.16 | − | −876.99 | − |
| Association signal | 5.71 | − | 0.80 | |
Figure 1(A) Testing a missing completely at random (MCAR) hypothesis under the basic assumption of the missing at random (MAR) model for the data resulting from the cross-tabulation of the number of α3.7-globin deletions with malaria parasite positivity. Each dot represents the p-value for the corresponding likelihood ratio test. Horizontal pointed line refers to the 5% significance level. In this analysis, we accepted the MCAR hypothesis on data from villages where p-value >0.05. The rejection of MCAR led to the acceptance of an MAR mechanism. (B) Association analysis between α3.7-globin deletions and different variables (phenotypes – at the left, SNPs – at the centre, and socioenvironmental factors – at the right) using complete data. Association signal is expressed in terms of −log10(p-value) for the corresponding association test: χ2 test for categorical explanatory variables (SNPs, low Hb, anemia, parasite positivity, gender, transect, village, and ethnicity) and score tests for quantitative explanatory variables (Hb levels, parasite density, age, and altitude) using a three-category logistic regression framework. Horizontal dashed line refers to −log10(0.001) corresponding to a 0.1% significance level.
Genotype-based performance of different imputation models: IM0 refers to imputation carried out using the observed frequencies of α3.7-globin deletions, IM1 includes four SNPs as imputation covariates (rs1800629, rs3211938, rs334, and rs542998), IM2 includes eight phenotypes and socio-environmental factors (Hb, mild anemia, malaria parasite positivity, transect, altitude, and ethnicity), and IM includes all variables in IM and IM
| IM0 | IM1 | IM2 | IM3 | |||||
|---|---|---|---|---|---|---|---|---|
| Missing completely | Genotype error | Average no. of | Genotype error | Average no. of | Genotype error | Average no. of | Genotype error | Average no. of |
| at random | rate (range), % | deletions (range) | rate (range), % | deletions (range) | rate (range), % | deletions (range) | rate (range), % | deletions (range) |
| Pmiss = 10% | 44.1 (41.1–47.0) | 0.34 (0.33–0.35) | 44.1 (41.1–47.3) | 0.337 (0.33–0.35) | 44.3 (40.8–47.4) | 0.34 (0.33–0.35) | 44.1 (41.0–47.5) | 0.34 (0.33–0.35) |
| Pmiss = 25% | 44.3 (41.9–46.1) | 0.34 (0.33–0.35) | 44.2 (41.7–46.3) | 0.338 (0.33–0.35) | 44.3 (41.9–46.3) | 0.34 (0.32–0.35) | 44.4 (41.8–46.9) | 0.34 (0.32–0.35) |
| Pmiss = 50% | 44.2 (42.8–45.8) | 0.34 (0.31–0.37) | 44.1 (42.5–45.9) | 0.337 (0.31–0.37) | 44.2 (42.5–45.9) | 0.34 (0.31–0.37) | 44.3 (42.6–45.9) | 0.34 (0.31–0.37) |
| Missing data from one village | ||||||||
| Kilimanjaro | ||||||||
| Mokala | 37.7 (28.9–47.4) | 0.35 (0.34–0.36) | 36.4 (27.7–47.4) | 0.38 (0.33–0.36) | 26.8 (20.8–32.9) | 0.35 (0.34–0.35) | 26.5 (19.1–35.8) | 0.32 (0.32–0.33) |
| Machame | 37.1 (28.6–46.9) | 0.35 (0.35,0.36) | 37.1 (28.6–46.3) | 0.36 (0.36–0.38) | 27.6 (21.1–34.9) | 0.38 (0.37–0.38) | 26.8 (18.3–34.3) | 0.36 (0.36–0.37) |
| Ikuini | 38.6 (27.7–45.1) | 0.35 (0.34,0.36) | 36.9 (27.7–48.9) | 0.35 (0.34–0.36) | 30.5 (30.5–22.8) | 0.32 (0.32–0.33) | 29.8 (22.8–38.6) | 0.33 (0.33–0.34) |
| Kileo | 41.8 (34.6–49.6) | 0.34 (0.34,0.35) | 40.9 (32.0–47.4) | 0.34 (0.33–0.36) | 38.7 (28.1–53.5) | 0.34 (0.32–0.38) | 36.1 (28.1–49.6) | 0.35 (0.34–0.38) |
| South Pare | ||||||||
| Bwambo | 39.1 (32.2–45.8) | 0.35 (0.34,0.36) | 37.3 (29.9–44.4) | 0.35 (0.34–0.36) | 32.2 (25.7–40.7) | 0.34 (0.33–0.35) | 31.8 (24.3–39.7) | 0.34 (0.33–0.35) |
| Mpinji | 41.5 (32.5–49.6) | 0.34 (0.34,0.36) | 41.0 (31.6–47.8) | 0.35 (0.34–0.37) | 34.6 (26.8–40.8) | 0.34 (0.34–0.35) | 37.4 (26.3–45.6) | 0.33 (0.33–0.33) |
| Goha | 43.0 (35.7–48.7) | 0.34 (0.33, 0.35) | 42.8 (36.6–49.1) | 0.34 (0.33–0.35) | 39.4 (31.7–46.0) | 0.32 (0.32–0.33) | 40.5 (32.1–46.4) | 0.33 (0.33–0.34) |
| Kadando | 49.6 (41.9–59.0) | 0.33 (0.32, 0.33) | 49.2 (40.1–55.8) | 0.34 (0.33–0.35) | 52.0 (45.2–60.4) | 0.33 (0.32–0.34) | 51.6 (45.2–59.4) | 0.34 (0.33–0.35) |
| West Usambara | ||||||||
| Kwadoe | 40.1 (31.5–46.9) | 0.35 (0.34, 0.35) | 40.2 (33.8–48.4) | 0.35 (0.34–0.37) | 39.0 (31.9–46.5) | 0.37 (0.36–0.38) | 38.9 (30.5–45.1) | 0.36 (0.36–0.37) |
| Funta | 47.6 (41.3–56.8) | 0.33 (0.32,0.34) | 48.6 (41.3–56.8) | 0.34 (0.33–0.35) | 47.3 (38.8–52.9) | 0.33 (0.32–0.33) | 48.3 (39.8–56.3) | 0.33 (0.33–0.34) |
| Tamota | 48.8 (41.4–55.8) | 0.33 (0.32,0.34) | 48.1 (41.4–55.3) | 0.34 (0.34–0.36) | 48.4 (43.3–56.7) | 0.34 (0.33–0.34) | 48.7 (40.9–58.1) | 0.36 (0.36–0.37) |
| Mgila | 51.3 (43.8–58.1) | 0.32 (0.32,0.33) | 50.7 (43.8–56.7) | 0.35 (0.35–0.37) | 53.3 (43.3–60.6) | 0.39 (0.38–0.40) | 55.0 (46.8–63.1) | 0.37 (0.36–0.39) |
| Tanga coast | ||||||||
| Mgome | 57.3 (50.2–66.7) | 0.31 (0.30,0.32) | 55.9 (48.9–64.4) | 0.33 (0.32–0.35) | 56.6 (47.0–67.1) | 0.35 (0.32–0.40) | 55.2 (46.6–65.8) | 0.35 (0.32–0.42) |
Results based on 100 MCAR data sets in which each data set was analyzed by MICE using 25 imputed data sets generated from chains of 25 iterations and random initial conditions.
Results based on 100 imputed data sets generated by MICE using chains of 25 iterations and random initial conditions.
Performance of different imputation models in terms of genetic effect estimation: IM0 refers to imputation carried out using the observed frequencies of the α3.7-globin deletions, IM1 includes four SNPs as imputation covariates (rs1800629, rs3211938, rs334, and rs542998), IM2 includes eight phenotypes and socio-environmental factors (Hb, mild anemia, malaria parasite positivity, transect, altitude, and ethnicity), and IM3 includes all variables in IM1 and IM2
| IM0 | IM1 | IM2 | IM3 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Estimation bias | Estimation bias | Estimation bias | Estimation bias | |||||||||
| Log- | (CI coverage, %) | Log- | (CI coverage, %) | Log- | (CI coverage, %) | Log- | (CI coverage, %) | |||||
| likelihood | likelihood | likelihood | likelihood | |||||||||
| Missing completely at random | bias (%) | λ1 | λ2 | bias (%) | λ1 | λ2 | bias (%) | λ1 | λ2 | bias (%) | λ1 | λ2 |
| Pmiss = 10% | −8.80 (1.00) | 0.04 (100) | −0.06 (100) | −8.80 (1.00) | 0.03 (100) | −0.06 (100) | −1.16 (0.13) | 0.10 (100) | −0.07 (100) | −1.14 (0.13) | 0.10 (100) | −0.08 (100) |
| Pmiss = 25% | −8.90 (1.02) | 0.07 (100) | −0.04 (100) | −8.95 (1.02) | 0.07 (100) | −0.06 (100) | −1.02 (0.12) | 0.11 (100) | −0.09 (100) | −1.00 (0.11) | 0.10 (100) | −0.10 (100) |
| Pmiss = 50% | −9.12 (1.04) | 0.12 (100) | −0.09 (100) | −9.16 (1.05) | 0.12 (100) | 0.13 (100) | −0.93 (0.11) | 0.13 (100) | −0.15 (100) | −0.93 (0.11) | 0.13 (100) | −0.16 (100) |
| Missing data from one village | ||||||||||||
| Kilimanjaro | ||||||||||||
| Mokala | −0.19 (0.02) | 0.02 (100) | 0.01 (100) | −1.47 (0.17) | 0.24 (80) | 0.13 (100) | 7.83 (0.89) | −0.31 (8) | 0.41 (100) | 2.63 (0.30) | 0.19 (100) | 0.64 (0) |
| Machame | −0.06 (0.01) | 0.01 (100) | <−0.01 (100) | −0.90 (0.11) | 0.06 (100) | −0.08 (100) | 4.39 (0.50) | −0.16 (100) | 0.25 (100) | 1.62 (0.19) | −0.02 (100) | 0.20 (100) |
| Ikuini | 0.21 (0.02) | −0.02 (100) | −0.01 (100) | −0.80 (0.09) | 0.07 (100) | −0.01 (100) | 1.20 (0.13) | −0.09 (100) | −0.05 (100) | 0.51 (0.06) | −0.02 (100) | 0.07 (100) |
| Kileo | 0.46 (−0.05) | −0.05 (100) | −0.07 (100) | −0.19 (0.02) | 0.00 (100) | −0.05 (100) | 2.71 (0.31) | −0.15 (100) | 0.05 (100) | 1.98 (0.23) | −0.07 (100) | 0.14 (100) |
| South Pare | ||||||||||||
| Bwambo | −0.30 (0.03) | 0.01 (100) | −0.07 (100) | −0.30 (0.03) | 0.01 (100) | −0.07 (100) | −0.26 (0.03) | 0.01 (100) | −0.06 (100) | −0.25 (0.02) | 0.01 (100) | −0.06 (100) |
| Mpinji | 0.21 (−0.02) | −0.02 (100) | −0.01 (100) | −1.50 (0.17) | 0.22 (90) | 0.10 (100) | 8.66 (0.99) | −0.34 (0) | 0.42 (100) | 2.80 (0.32) | 0.16 (100) | 0.64 (0) |
| Goha | −0.15 (0.03) | 0.03 (100) | 0.04 (100) | −0.46 (0.05) | 0.03 (100) | −0.05 (100) | 1.07 (0.12) | −0.08 (100) | −0.03 (100) | 0.26 (0.26) | 0.01 (100) | 0.10 (100) |
| Kadando | −0.33 (0.04) | 0.06 (100) | 0.07 (100) | −0.94 (0.11) | 0.11 (100) | 0.03 (100) | 1.27 (0.15) | −0.06 (100) | 0.06 (100) | 1.14 (0.13) | 0.01 (100) | 0.22 (100) |
| West Usambara | ||||||||||||
| Kwadoe | 0.02 (<0.01) | <−0.01 (100) | −0.02 (100) | −0.20 (0.02) | −0.00 (100) | −0.08 (100) | 4.05 (0.46) | −0.10 (100) | 0.35 (99) | 2.73 (0.31) | −0.08 (100) | 0.20 (100) |
| Funta | 0.04 (<0.01) | 0.02 (100) | 0.05 (100) | −0.30 (0.03) | 0.02 (100) | −0.04 (100) | 0.00 (0.00) | 0.00 (100) | −0.02 (100) | 0.12 (0.01) | −0.01 (100) | −0.03 (100) |
| Tamota | −0.66 (0.07) | 0.05 (100) | −0.05 (100) | −1.60 (0.18) | 0.25 (62) | −0.02 (100) | 9.63 (1.10) | −0.38 (4) | 0.38 (96) | 2.74 (0.31) | 0.23 (83) | 0.65 (13) |
| Mgila | 0.24 (0.03) | 0.07 (100) | 0.21 (100) | −0.65 (0.07) | 0.13 (100) | 0.16 (100) | 8.56 (0.98) | −0.21 (87) | 0.47 (58) | 4.35 (0.50) | −0.02 (100) | 0.43 (70) |
| Tanga coast | ||||||||||||
| Mgome | −0.75 (0.09) | 0.04 (100) | −0.32 (100) | −1.32 (0.15) | 0.18 (88) | 0.05 (100) | 8.15 (0.93) | −0.33 (23) | 0.41 (94) | 3.38 (0.39) | 0.10 (100) | 0.59 (8) |
Results based on 100 MCAR data sets in which each data set was analyzed by MICE using 25 imputed data sets generated from chains of 25 iterations and random initial conditions.
Results based on 100 imputed data sets generated by MICE using chains of 25 iterations and random initial conditions.
Genetic association analysis using IM3 (100 imputed data sets) under different data settings: (i) data of 13 villages where genotyping of the HBA2 gene was attempted in the majority of the individuals, (ii) data of the same 13 villages and an additional village where geno-typing was not attempted, and (iii) all data from the 24 villages
| Estimates (SE) | ||||||
|---|---|---|---|---|---|---|
| Total sample | Missing | Mean association | Average no. | |||
| Analysis | size, n | genotypes, % | signal (range) | of deletions | λ1 | λ2 |
| 13 villages | 4143 | 34.9 | 1.47 (0.42–3.22) | 0.326 (0.01) | −0.226 (0.12) | 0.181 (0.27) |
| 13 villages and an additional village | ||||||
| North Pare | ||||||
| Kilomeni | 4433 | 39.1 | 1.47 (0.19–3.68) | 0.32 (0.01) | −0.22 (0.13) | 0.19 (0.270) |
| Lambo | 4405 | 38.7 | 1.59 (0.18–4.28) | 0.32 (0.01) | −0.24 (0.13) | 0.18 (0.27) |
| Ngulu | 4499 | 40.0 | 1.62 (0.29–4.75) | 0.33 (0.01) | −0.23 (0.13) | 0.17 (0.28) |
| Kambi ya Simba | 4356 | 38.0 | 1.65 (0.13–4.42) | 0.33 (0.01) | −0.23 (0.13) | 0.20 (0.26) |
| West Usambara 1 | ||||||
| Emmao | 4321 | 37.5 | 1.51 (0.20–3.98) | 0.33 (0.01) | −0.23 (0.13) | 0.17 (0.27) |
| Handei | 4489 | 39.9 | 1.67 (0.38–4.87) | 0.33 (0.01) | −0.23 (0.12) | 0.21 (0.27) |
| Tewe | 4463 | 39.5 | 1.88 (0.21–5.21) | 0.33 (0.01) | −0.23 (0.13) | 0.20 (0.27) |
| Mn'galo | 4490 | 39.9 | 1.75 (0.27–4.56) | 0.33 (0.01) | −0.22 (0.13) | 0.21 (0.27) |
| West Usambara 2 | ||||||
| Magamba | 4351 | 38.0 | 1.56 (0.59–3.51) | 0.32 (0.01) | −0.23 (0.12) | 0.20 (0.27) |
| Ubiri | 4297 | 37.2 | 1.77 (0.07–5.08) | 0.33 (0.01) | −0.25 (0.13) | 0.17 (0.27) |
| Kwemasimba | 4374 | 38.3 | 1.83 (0.21–4.63) | 0.34 (0.01) | −0.24 (0.13) | 0.18 (0.27) |
| 24 villages | 7048 | 61.7 | 2.70 (0.10–6.75) | 0.34 (0.03) | −0.23 (0.12) | 0.18 (0.25) |
Association signal is calculated by −log10(p-value) using either the mean or the median of log-likelihood ratio statistic across all imputed data sets.
Results of model IM3 were obtained from imputed data using chains of 25 iterations and random initial conditions.
Results of model IM3 were obtained from imputed data using chains of 100 iterations and random initial conditions.