| Literature DB >> 28961250 |
Jinzhuang Dou1, Baoluo Sun1, Xueling Sim2, Jason D Hughes3, Dermot F Reilly3, E Shyong Tai2,4,5, Jianjun Liu5,6, Chaolong Wang1,4.
Abstract
Knowledge of biological relatedness between samples is important for many genetic studies. In large-scale human genetic association studies, the estimated kinship is used to remove cryptic relatedness, control for family structure, and estimate trait heritability. However, estimation of kinship is challenging for sparse sequencing data, such as those from off-target regions in target sequencing studies, where genotypes are largely uncertain or missing. Existing methods often assume accurate genotypes at a large number of markers across the genome. We show that these methods, without accounting for the genotype uncertainty in sparse sequencing data, can yield a strong downward bias in kinship estimation. We develop a computationally efficient method called SEEKIN to estimate kinship for both homogeneous samples and heterogeneous samples with population structure and admixture. Our method models genotype uncertainty and leverages linkage disequilibrium through imputation. We test SEEKIN on a whole exome sequencing dataset (WES) of Singapore Chinese and Malays, which involves substantial population structure and admixture. We show that SEEKIN can accurately estimate kinship coefficient and classify genetic relatedness using off-target sequencing data down sampled to ~0.15X depth. In application to the full WES dataset without down sampling, SEEKIN also outperforms existing methods by properly analyzing shallow off-target data (~0.75X). Using both simulated and real phenotypes, we further illustrate how our method improves estimation of trait heritability for WES studies.Entities:
Mesh:
Year: 2017 PMID: 28961250 PMCID: PMC5636172 DOI: 10.1371/journal.pgen.1007021
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Fig 2Cryptic relatedness among 2,452 individuals in the Singapore Living Biobank Project.
We estimated kinship coefficient ϕ and the proportion of zero-IBD-sharing π0 for each pair of individuals using PC-Relate. Relatedness types were determined using the inference criteria of ϕ and π0 given by [18]. An ambiguous relationship was inferred if the criteria of ϕ and π0 were not met simultaneously. (A) Results for pairs of Chinese. (B) Results for pairs of Malays. (C) Results for pairs that consist of a Chinese and a Malay.
Fig 6Off-target sequencing data improve kinship estimation in WES of 762 Chinese and Malays.
In each panel, we plotted the difference between sequence-based estimates and array-based estimates (ϕseq–ϕarray, y-axis) versus the array-based estimates from PC-Relate (ϕarray, x-axis). Colored circles represent kinship coefficients between two individuals and different types of relatedness were determined in Fig 2. Grey crosses represent self-kinship coefficients. The analyses were based on the BEAGLE+1KG3 call set at SNPs overlapping with the SGVP dataset. We evaluated SEEKIN (A, C) and PC-Relate (B, D) using 40,824 SNPs within the WES target regions or 1,054,229 SNPs across both target and off-target regions.
Performance of homogeneous kinship estimators in ~0.15X sequencing data of 254 Chinese.
| Call set | Method | Unrelated (31,925 pairs) | 3rd degree (22 pairs) | 2nd degree (36 pairs) | PO/FS (146 pairs) | Self-kinship (254 individuals) | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | ||
| Bcftools | lcMLkin | 0.035 | 0.035 | 0.019 | 0.016 | 0.013 | -0.004 | 0.028 | -0.026 | — | — |
| GCTA | 0.007 | -0.006 | 0.034 | -0.033 | 0.062 | -0.061 | 0.116 | -0.116 | 0.123 | -0.122 | |
| KING | 0.053 | 0.053 | 0.018 | -0.016 | 0.088 | -0.087 | 0.225 | -0.225 | — | — | |
| BEAGLE | SEEKIN | 0.007 | -0.004 | 0.012 | -0.009 | 0.018 | -0.016 | 0.028 | -0.023 | 0.043 | -0.003 |
| GCTA | 0.005 | -0.003 | 0.027 | -0.026 | 0.050 | -0.049 | 0.094 | -0.093 | 0.014 | 0.011 | |
| KING | 0.017 | -0.014 | 0.036 | -0.036 | 0.054 | -0.054 | 0.099 | -0.099 | — | — | |
| BEAGLE+1KG3 | SEEKIN | 0.005 | -0.004 | 0.004 | -0.001 | 0.006 | -0.001 | 0.013 | 0.008 | 0.032 | 0.002 |
| GCTA | 0.004 | -0.003 | 0.014 | -0.014 | 0.027 | -0.027 | 0.047 | -0.046 | 0.007 | -0.009 | |
| KING | 0.005 | -0.002 | 0.014 | -0.013 | 0.022 | -0.022 | 0.044 | -0.043 | — | — | |
RMSE is the root mean squared error and BIAS is defined as the mean difference to the array-based estimates from PC-Relate for each type of relatedness. Negative values of BIAS suggest underestimation for results based on sparse sequencing data and vice versa.
* Smallest magnitude of RMSE or BIAS in each call set and each type of relatedness.
Performance of heterogeneous kinship estimators in ~0.15X sequencing data of 762 Chinese and Malays.
| Call set | Method | Unrelated (289,205 pairs) | 3rd degree (148 pairs) | 2nd degree (147pairs) | PO/FS (437 pairs) | Self-kinship (762 individuals) | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | ||
| BEAGLE | SEEKIN | 0.007 | -0.002 | 0.010 | -0.001 | 0.014 | -0.007 | 0.025 | -0.010 | 0.058 | 0.006 |
| PC-Relate | 0.005 | 0.000 | 0.022 | -0.021 | 0.044 | -0.043 | 0.084 | -0.083 | 0.035 | 0.018 | |
| REAP | 0.004 | -0.001 | 0.024 | -0.023 | 0.048 | -0.048 | 0.091 | -0.090 | 0.033 | -0.005 | |
| RelateAdmix | 0.004 | 0.002 | 0.023 | -0.022 | 0.046 | -0.045 | 0.088 | -0.087 | — | — | |
| BEAGLE+1KG3 | SEEKIN | 0.004 | -0.002 | 0.006 | 0.004 | 0.009 | 0.006 | 0.021 | 0.015 | 0.041 | 0.018 |
| PC-Relate | 0.002 | 0.000 | 0.011 | -0.011 | 0.025 | -0.024 | 0.049 | -0.048 | 0.014 | -0.008 | |
| REAP | 0.002 | -0.001 | 0.015 | -0.015 | 0.030 | -0.029 | 0.054 | -0.053 | 0.020 | -0.014 | |
| RelateAdmix | 0.002 | 0.001 | 0.013 | -0.013 | 0.026 | -0.025 | 0.048 | -0.047 | — | — | |
RMSE is the root mean squared error and BIAS is defined as the mean difference to the array-based estimates from PC-Relate for each type of relatedness. Negative values of BIAS suggest underestimation for results based on sparse sequencing data and vice versa.
* Smallest magnitude of RMSE or BIAS in each call set and each type of relatedness.
Comparison of kinship estimation with and without off-target data in WES of 762 Chinese and Malays.
| Dataset | Method | Unrelated (289,205 pairs) | 3rd degree (148 pairs) | 2nd degree (147pairs) | PO/FS (437 pairs) | Self-kinship (762 individuals) | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | RMSE | BIAS | ||
| Target | SEEKIN | 0.006 | -0.001 | 0.007 | -0.001 | 0.008 | -0.003 | 0.010 | -0.004 | 0.018 | -0.005 |
| PC-Relate | 0.005 | 0.000 | 0.007 | -0.001 | 0.008 | -0.004 | 0.012 | -0.008 | 0.022 | -0.017 | |
| Target + off-target | SEEKIN | 0.003 | -0.001 | 0.003 | 0.001 | 0.004 | 0.001 | 0.007 | 0.003 | 0.017 | 0.002 |
| PC-Relate | 0.002 | 0.000 | 0.004 | -0.004 | 0.009 | -0.009 | 0.019 | -0.019 | 0.026 | -0.021 | |
Evaluation was based on SNPs overlapped with the SGVP dataset in the BEAGLE+1KG3 call set of 762 individuals for both SEEKIN and PC-Relate. 40,824 SNPs within target regions and 1,054,229 SNPs across target and off-target regions were included in the analyses. RMSE is the root mean squared error and BIAS is defined as the mean difference to the array-based estimates from PC-Relate for each type of relatedness. Negative values of BIAS suggest underestimation for results based on sparse sequencing data and vice versa.
Heritability estimation for 10 metabolic traits in 762 related Chinese and Malays.
| Trait | Sample size | OmniExpress array (435,314 SNPs) | WES target + off-target (1,054,229 SNPs) | WES target only (40,824 SNPs) |
|---|---|---|---|---|
| BMI | 762 | 0.587 (0.091) | 0.554 (0.090) | 0.553 (0.087) |
| WHR | 762 | 0.355 (0.096) | 0.343 (0.092) | 0.319 (0.087) |
| SBP | 752 | 0.172 (0.098) | 0.164 (0.098) | 0.157 (0.090) |
| DBP | 734 | 0.262 (0.099) | 0.265 (0.097) | 0.187 (0.089) |
| TC | 761 | 0.523 (0.086) | 0.517 (0.086) | 0.438 (0.083) |
| LDL | 761 | 0.602 (0.087) | 0.593 (0.084) | 0.470 (0.086) |
| HDL | 761 | 0.658 (0.077) | 0.632 (0.077) | 0.576 (0.079) |
| TG | 628 | 0.609 (0.101) | 0.588 (0.010) | 0.534 (0.099) |
| FBG | 628 | 0.402 (0.105) | 0.378 (0.103) | 0.338 (0.101) |
| HbA1C | 683 | 0.572 (0.092) | 0.570 (0.090) | 0.549 (0.089) |
The pairwise relatedness matrix (2Φ) was estimated by PC-Relate for array genotyping data and by SEEKIN for sequencing data, based on common SNPs overlapped with the SGVP dataset. Trait heritability was estimated using a linear mixed model, adjusting for age, age2, sex, and the first two ancestry PCs.
Abbreviations of traits: BMI, body-mass index; WHR, waist-to-hip ratio; SBP, systolic blood pressure; DBP, diastolic blood pressure; TC, total cholesterol; LDL, low-density lipoprotein; HDL, high-density lipoprotein; TG, triglycerides; FBG, fasting blood glucose; HbA1C, hemoglobin A1C.
* Values in the parenthesis indicate standard errors of the heritability estimates.
Computational costs for kinship estimation software programs.
| Estimator type | Method | Version | No. of CPUs | M = 100,000 SNPs | M = 1,000,000 SNPs | ||
|---|---|---|---|---|---|---|---|
| Wall-clock time | Peak memory | Wall-clock time | Peak memory | ||||
| For homogeneous samples | SEEKIN-hom | v1.0 | 10 | 13 mins | 2.8 GB | 116 mins | 2.8 GB |
| KING | v2.09 | 10 | 3.0 mins | 0.6 GB | 30.3 mins | 4.8 GB | |
| GCTA | v1.25.3 | 10 | 3.3 mins | 5.8 GB | - | >50 GB | |
| For heterogeneous samples with population structure and admixture | SEEKIN-het | v1.0 | 10 | 55 mins | 3.8 GB | 662 mins | 3.8 GB |
| REAP | v1.2 | 10 | 1168 mins | 3.5 GB | >100 hours | - | |
| PC-Relate | v2.1.6 | 1 | 2550 mins | 15.0 GB | >100 hours | - | |
| RelateAdmix | v0.14 | 1 | >100 hours | - | >100 hours | - | |
Evaluations were based on two synthetic datasets of 10,000 individuals sampled with replacement from the WES data of 762 Chinese and Malays. We set the number of CPUs to 10 if the software program supports multi-threading feature. For all methods, we only evaluated computational cost for kinship estimation, excluding data preparation steps such as genotype calling and calculation of individual allele frequencies. For SEEKIN, we processed SNPs in blocks of size L = 10,000. PC-Relate was implemented in the R package “GENESIS” and the version number is for the “GENESIS” package. Tests were run on a high-performance computing cluster with Intel Xeon CPUs (2.8 GHz). Jobs were terminated if the memory usage exceeded 50 gigabytes (GB) or the run time exceeded 100 hours