| Literature DB >> 28739658 |
Monica D Ramstetter1, Thomas D Dyer2, Donna M Lehman3, Joanne E Curran2, Ravindranath Duggirala2, John Blangero2, Jason G Mezey4,5, Amy L Williams1.
Abstract
Inferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these approaches in real data has been lacking. Here, we report an assessment of 12 state-of-the-art pairwise relatedness inference methods using a data set with 2485 individuals contained in several large pedigrees that span up to six generations. We find that all methods have high accuracy (92-99%) when detecting first- and second-degree relationships, but their accuracy dwindles to <43% for seventh-degree relationships. However, most identical by descent (IBD) segment-based methods inferred seventh-degree relatives correct to within one relatedness degree for >76% of relative pairs. Overall, the most accurate methods are Estimation of Recent Shared Ancestry (ERSA) and approaches that compute total IBD sharing using the output from GERMLINE and Refined IBD to infer relatedness. Combining information from the most accurate methods provides little accuracy improvement, indicating that novel approaches, such as new methods that leverage relatedness signals from multiple samples, are needed to achieve a sizeable jump in performance.Entities:
Keywords: admixture; identical by descent; relatedness estimation
Mesh:
Year: 2017 PMID: 28739658 PMCID: PMC5586387 DOI: 10.1534/genetics.117.1122
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Numbers of pairs of individuals in the SAMAFS data set that passed sample filters and are reported to have relatedness between first- and seventh-degree or as unrelated
| Degree | Number of pairs |
|---|---|
| 1 | 4969 |
| 2 | 6625 |
| 3 | 8241 |
| 4 | 7636 |
| 5 | 3794 |
| 6 | 816 |
| 7 | 73 |
| Unrelated | 3,051,598 |
| Total | 3,083,752 |
We combined reported monozygotic (MZ) twins with the set of first-degree relatives.
Supplemental Note in File S1.
Properties of the 12 relationship inference methods we analyzed
| Method | Version | Citation | Type | Output | Parallelized? | Runtime (× cores if > 1) [× number of runs] | Requires independent markers | Input required from outside program | Accounts for population structure |
|---|---|---|---|---|---|---|---|---|---|
| ERSA | 2.0 | IBD segment-based | Degree of relatedness | N | 14.3 + 96.3 hr (×16) | N | IBD segments | NA | |
| fastIBD | Beagle 3.3.2 | IBD segment-finding | IBD segments | N | 55.2 hr [× 10] | N | NA | NA | |
| GERMLINE (-haploid) | 1.5.1 | IBD segment-finding (distinguishes IBD1 and IBD2) | IBD segments | N | 19.2 min + 96.0 hr (×16) | N | Phased genotypes | NA | |
| HaploScore | NA | IBD segment-based | IBD segments | N | 2.4 + 96.3 hr (×16) | N | IBD segments; phased genotypes | NA | |
| IBDseq | r1206 | IBD segment-finding | IBD segments | Y | 33.1 hr (×16) | N | NA | NA | |
| KING (KING-robust) | 1.4 | Allele frequency-based IBD estimate | IBD 0,1,2 proportions | N | 4.6 min | Y | NA | Y | |
| PC-Relate | 2.0.1 | Allele frequency-based IBD estimate | IBD 0,1,2 proportions | N | 8.9 hr + 4.6 min | Y | Pairwise kinship coefficients | Y | |
| PLINK 1.9 | 1.90b2k | Allele frequency-based IBD estimate | IBD 0,1,2 proportions | N | 18.1 sec | Y | NA | N | |
| PREST-plus | 4.1 | Allele frequency-based; uses linkage model | IBD 0,1,2 proportions | N | 178.9 hr | N | NA | N | |
| REAP | 1.2 | Allele frequency-based IBD estimate | IBD 0,1,2 proportions | N | 3.8 + 2.8 hr | Y | Ancestral population allele frequencies; sample ancestry proportions | Y | |
| Refined IBD | Beagle 4.1 | IBD segment-finding (distinguishes IBD1 and IBD2) | IBD segments | Y | 96.0hr (× 16) [× 3] | N | NA | NA | |
| RelateAdmix | 0.1 | Allele frequency-based IBD estimate | IBD 0,1,2 proportions | Y | 15.8 hr (×16) + 2.8 hr | Y | Ancestral population allele frequencies; sample ancestry proportions | Y |
Type indicates the inference methodology the program uses. Runtime is wall clock time to run the program with any additional time to run programs needed for input as indicated. We ran parallelized programs using the numbers of cores indicated in parentheses, and ran fastIBD and Refined IBD multiple times as recommended by the authors, with counts indicated in square brackets. Input required from outside program indicates extraneous information needed to run the program. Programs that use either principal components, sample ancestral population proportions, or that use a model designed for multiple populations are indicated as accounting for population structure. “Y” indicates yes, “N” indicates no, and “NA” indicates not applicable. Runtimes are from a machine with four AMD Opteron 6176 2.30 GHz processors (64 cores total) and 256 GB memory.
Additional time to phase the data using Beagle 4.1 and run GERMLINE.
Additional time to phase the data using Beagle 4.1.
Additional time to obtain KING relatedness estimates; base PC-Relate time is the sum of time to run this method and PC-AiR (Conomos ).
Additional time to obtain ancestral population proportions using ADMIXTURE (Alexander ).
Figure 1Performance comparison of the evaluated methods using the SAMAFS data set. Bar plots denote the percentage of sample pairs that are reported to have a given degree of relatedness and that are inferred to be related as the indicated degree. The bar plots are separated on the horizontal axis by the reported relatedness degree and on the vertical axis by inferred relatedness degree. For clarity, the plots list above each bar the inferred percentage that the corresponding bar depicts. Program names listed in red are IBD segment-based methods while those in black use allele frequencies for inference. Red horizontal bars under a bar plot indicate that the corresponding inferences agree with the reported relationships.