Literature DB >> 25519375

PREST-plus identifies pedigree errors and cryptic relatedness in the GAW18 sample using genome-wide SNP data.

Abstract

Pedigree errors and cryptic relatedness often appear in families or population samples collected for genetic studies. If not identified, these issues can lead to either increased false negatives or false positives in both linkage and association analyses. To identify pedigree errors and cryptic relatedness among individuals from the 20 San Antonio Family Studies (SAFS) families and cryptic relatedness among the 157 putatively unrelated individuals, we apply PREST-plus to the genome-wide single-nucleotide polymorphism (SNP) data and analyze estimated identity-by-descent (IBD) distributions for all pairs of genotyped individuals. Based on the given pedigrees alone, PREST-plus identifies the following putative pairs: 1091 full-sib, 162 half-sib, 360 grandparent-grandchild, 2269 avuncular, 2717 first cousin, 402 half-avuncular, 559 half-first cousin, 2 half-sib+first cousin, 957 parent-offspring and 440,546 unrelated. Using the genotype data, PREST-plus detects 7 mis-specified relative pairs, with their IBD estimates clearly deviating from the null expectations, and it identifies 4 cryptic related pairs involving 7 individuals from 6 families.

Entities: CellLine Chemical Disease Gene

Year: 2014 PMID： 25519375 PMCID： PMC4143714 DOI： 10.1186/1753-6561-8-S1-S23

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

Mis-specified pedigree relationship and cryptic relatedness often occur in family and population data. The potential causes of such errors are numerous, including undocumented nonpaternity, nonmaternity, adoption, mating between relatives, sample duplication or swap. It is well known that such errors, if undetected, can affect the accuracy or power of both linkage and association studies, as well as have adverse effects on other aspects of the analyses such as population stratification [1-6]. Genome-wide marker data can provide accurate information on the genetic relatedness among individuals. For linkage scans, a number of statistical methods have been proposed and implemented, including RELCHECK [1], RELATIVE [7], PEDCHECK [8], SIBERROR [9], PREST [10,11], GRR [12], and ECLIPSE [13], among others. More recently, PLINK [14] and PREST-plus [5,6,15] have been developed for analysing the high-throughput SNP data collected from genome-wide association studies (GWAS) or next-generation sequencing (NGS) experiments. There are 3 main differences between PREST-plus and PLINK. First, PREST-plus uses the maximum likelihood-based IBD estimation method, which has more statistical power, whereas PLINK relies on the method-of-moments approach, which is computationally more efficient. Second, PREST-plus identifies both pedigree errors and cryptic relatedness in linkage or association studies with family data, population sample, or both, whereas PLINK is primarily suitable for detecting cryptic relatedness in GWAS with population sample. Third, PREST-plus provides a formal hypothesis testing framework, which can be useful when a potential error has been being identified, whereas PLINK provides point IBD estimation only [5,6]. The Genetic Analysis Workshop 18 (GAW18) data, similar to many emerging large-scale GWAS and NGS studies, include multigeneration large pedigrees. The genetic relationships between individuals in such data are not limited to simple types such as parent-offspring or siblings. Consequently, we focus here on the methodology implemented in PREST-plus [5,6,15] and discuss PLINK [14] when relevant.

Methods

Relationship estimation and testing

The relatedness between a pair of individuals can be summarized by the IBD probability distribution, . It describes the probability of a randomly sampled marker to have 0, 1, or 2 common ancestry alleles between 2 individuals. Using the available genotype data, PREST-plus estimates the most likely IBD distribution by obtaining that maximizes the quantity where is the genotype at marker m and denotes the number of alleles shared IBD by the pair at that marker. The maximization is efficiently achieved by an application of the expectation-maximization (EM) algorithm [10,16,17]. In contrast, PLINK [14] estimates the IBD distribution using the method-of-moments approach, which is less powerful than the likelihood-based method. Estimation of IBD distribution can often provide sufficient information to identify pedigree errors and cryptic relatedness [5,6]. However, it is useful to provide statistical evidence beyond point estimation. To this end, we apply the maximum likelihood ratio test (MLRT) to formally evaluate whether the observed genotypes G are compatible with the null putative relationship type, R[10]. Briefly, where is the likelihood calculated for the hypothesized R, and is the maximum likelihood calculated over a set of alternatives as given in Table 1. Statistical significance of MLRT is then assessed using simulation, because 2*MLRT does not follow the usual Chisq distribution [10]. Efficient implementation of the test to high-throughput genotype data requires pruning of the SNPs so that they are not in linkage disequilibrium (LD) [5,6].

Table 1

IBD distribution and kinship coefficient for the relationship types (reltype) considered by PREST-plus

reltype coding in PREST-plus	Relationship type(abbreviation)	Distribution of IBD sharing			Kinship coefficient, φ

		p₀	p₁	p₂
11	MZ-twin (MZ)	0.000	0.000	1.000	0.50000
10	parent-offspring (PO)	0.000	1.000	0.000	0.25000
1	full-sib (FS)	0.250	0.500	0.250	0.25000
9	half-sib+first cousin (HSFC)	0.375	0.500	0.125	0.18750
2	half-sib (HS)	0.500	0.500	0.000	0.12500
3	grandparent-grandchild (GPC)	0.500	0.500	0.000	0.12500
4	avuncular (AV)	0.500	0.500	0.000	0.12500
5	first cousin (FC)	0.750	0.250	0.000	0.06250
7	half-avuncular (HAV)	0.750	0.250	0.000	0.06250
8	half-first cousin (HFC)	0.875	0.125	0.000	0.03125
6	unrelated (UN)	1.000	0.000	0.000	0.00000
99	other types (Others)	NA	NA	NA	NA

Data analyses

The GWAS data of GAW18 consist of 959 individuals genotyped at 472,049 SNPs. Among the 959 individuals, 4 are removed for low genotyping rate (plink -MIND >0.8), and 141 individuals are in the "UNREL.txt" file that contains the maximum set of putatively unrelated individuals. Among the 472,049 SNPs, ~50,000 remain after minor allele frequency (MAF) and LD pruning (plink -indep-pairwise 200 50 0.2 -maf 0.05). We then conduct 3 sets of analyses, with analyses 1 and 2 focusing on relationship estimation and analysis 3 performing hypothesis testing. Analysis 1 detects pedigree errors and cryptic relatedness in the 955 genotyped individuals from the 20 SAFS families using PREST-plus (prest -geno datafamily.ped -map datafamily.map -wped -aped). It estimates the IBD distribution for any pair of individuals within a pedigree (-wped), as well as for any pair of individuals across pedigrees (-aped). Analysis 2 detects cryptic relatedness in the 141 putatively unrelated individuals, using both PREST-plus (prest -geno dataunrel.ped -map dataunrel.map) and PLINK (plink -file plink -genome). Analysis 3 performs formal hypothesis testing on the problematic pairs identified in analyses 1 and 2. For computational efficiency, we first randomly select ~2000 SNPs from the set of ~50,000 SNPs. We then obtain the base pair (bp) physical map of the SNPs using build 36 coordinates and their corresponding centimorgan (cM) genetic map using the Rutgers combined linkage-physical map and linear interpolation. Finally, we perform the MLRT test as implemented in PREST-plus (prest -file data2k.ped -map data2k.map -pair fID1 indID1 fID2 indID2 -mlrt -c).

Results

Analysis 1: Relationship estimation within and across the 20 SAFS families

PREST-plus identifies 455,535 pairs of genotyped individuals, with the total number of genotyped SNPs (commark) ranging from 31,120 to 49,020. Most of the 455,535 pairs have the putative relationship types considered by PREST-plus (see Table 1). (See Figure 1 of reference [5] for graphical illustrations of these relationship types.) IBD distribution and kinship coefficient for the relationship types (reltype) considered by PREST-plus Figure 1 shows the estimated IBD distributions, vs , stratified by the putative relationship, R, with the red cross marking the the IBD distribution expected for R. We observe a substantial number of pairs with their IBD estimates deviating from the null expectations. Table 2 provides detailed information for 7 clear outliers, indicating mis-specified relationships. For example, 2 putative half-sib pairs have close to (0.25, 0.5, 0.25) of full-sib, while 1 putative avuncular pair has close to (1, 0, 0) of unrelated. Analysis 3 below is to determine the statistical significance of the apparent deviation in the IBD estimates.

Figure 1

Table 2

Relationship estimation results for clear outliers in Figure 1 identified by analysis 1.

						Estimated

FID1^a	IID1^b	FID2^a	IID2^b	reltype^c	commark^d	p₀	p₁	p₂
3	T2DG0300174	3	T2DG0300175	1	49009	0.0000	0.0000	1.0000
4	T2DG0400281	4	T2DG0400282	1	48996	0.0000	0.0000	1.0000
4	T2DG0400265	4	T2DG0400266	2	48994	0.358	0.4511	0.1909
21	T2DG2100946	21	T2DG2100947	2	48957	0.3112	0.5566	0.1322
21	T2DG2100952	21	T2DG2100966	4	48949	0.9876	0.0109	0.0015
4	T2DG0400207	4	T2DG0400260	6	48955	0.4759	0.5157	0.0084
4	T2DG0400207	4	T2DG0400247	6	47503	0.6094	0.3723	0.0182

These 7 pairs of individuals have their estimated IBD distributions clearly deviating from the null expected values as specified in Table 1.

a Family ID.

b Individual ID.

c Relationship type as in Table 1.

d The number of common markers genotyped for both individuals.

Results of analysis 1: relationship IBD estimation within and between the 20 SAFS families using PREST-plus. The figures are stratified by the null putative relationship, R, as defined by the given pedigrees. The red cross marks the expected IBD distribution for Ras provided in Table 1. Each black dot shows the estimated pvs. pbased on the obseved genotype data for each of the 455,535 genotyped pairs analyzed, inlucidng 1091 full-sib, 162 half-sib, 360 grandparent-grandchild, 2269 avuncular, 2717 first-cousin, 440,546 unrelated (from both within and across families), 402 half-avuncular, 559 half-first cousin, 2 half-sib+first cousin, 957 parent-offspring, and 6470 other types of pairs. Blue circles mark the obvious outliers as detailed in Table 2. Relationship estimation results for clear outliers in Figure 1 identified by analysis 1. These 7 pairs of individuals have their estimated IBD distributions clearly deviating from the null expected values as specified in Table 1. a Family ID. b Individual ID. c Relationship type as in Table 1. d The number of common markers genotyped for both individuals.

Analysis 2: Relationship estimation among individuals in the "UNREL.txt" file

In this analysis, there are 141*140/2 = 9780 putatively unrelated pairs and the commark ranges from 32,280 to 49,010. Figure 2 displays vs for these pairs based on PREST-plus (left) and PLINK (right). Results clearly demonstrate the statistical efficiency of PREST-plus with overall less variation in the IBD estimates as compared to PLINK: 9338 pairs with PREST vs. 6538 pairs with PLINK . For the 4 potential cryptic relateded pairs, the PREST-plus IBD estimates provide a better identification of the outliers (Table 3).

Figure 2

Table 3

Relationship estimation results for clear outliers in Figure 2 identified by analysis 2

						PREST-plus estimated			PLINK estimated

FID1	IID1	FID2	IID2	reltype	commark	p₀	p₁	p₂	p₀	p₁	p₂
9	T2DG0901244	10	T2DG1000566	6	48912	0.8159	0.1735	0.0105	1.0000	0.0000	0.0000
8	T2DG0800497	9	T2DG0901244	6	48957	0.8304	0.1696	0.0000	1.0000	0.0000	0.0000
21	T2DG2100951	25	T2DG2501033	6	48940	0.8174	0.1826	0.0000	0.7713	0.2287	0.0000
4	T2DG0400207	4	T2DG0400247	6	47503	0.6142	0.3673	0.0185	0.7460	0.1972	0.0568

These four pairs of individuals have their estimated IBD distributions clearly deviating from the values expected for unrelated pairs.

Results of analysis 2: relationship IBD estimation among the 141 genotyped putatively unrelated individuals in the "UNREL.txt" file. The red cross marks the the IBD distribution expected for unrelated. Each black dot shows the estimated pvs. pbased on the obseved genotype data for each of the 9870 putatively unrelated pairs analyzed by PREST-plus (left) and PLINK(right). Blue circles mark the obvious outliers as detailed in Table 3. Relationship estimation results for clear outliers in Figure 2 identified by analysis 2 These four pairs of individuals have their estimated IBD distributions clearly deviating from the values expected for unrelated pairs.

Analysis 3: Relationship hypothesis testing of problematic pairs

As a proof of principle, we focus on the 7 clear outliers identified in analysis 1 and the 4 pairs identified in analysis 2 (Table 4). Among the putatively unrelated pairs, the possible alternative relationship types range from half-first cousin to avuncular.

Table 4

Relationship testing results for clear outliers in Figure 1 identified by analysis 1, and in Figure 2 identified by analysis 2

FID1	IID1	FID2	IID2	null reltype		p value	plausible reltype		p value
The 7 outliers identified by analysis 1
3	T2DG0300174	3	T2DG0300175	1	full-sib	0	11	MZ-twins	N/A
4	T2DG0400281	4	T2DG0400282	1	full-sib	0	11	MZ-twins	N/A
4	T2DG0400265	4	T2DG0400266	2	half-sib	0	9	half-sib+first cousin	0.254
21	T2DG2100946	21	T2DG2100947	2	half-sib	0	9	half-sib+first cousin	0.432
21	T2DG2100952	21	T2DG2100966	4	avunuclar	0	6	unrelated	0.891
4	T2DG0400207	4	T2DG0400260	6	unrelated	0	2	half-sib	0.328
4	T2DG0400207	4	T2DG0400247	6	unrelated	0	5	first cousin	0.752
The 4 outliers identified by analysis 2
9	T2DG0901244	10	T2DG1000566	6	unrelated	0	8	half-first cousin	0.112
8	T2DG0800497	9	T2DG0901244	6	unrelated	0.007	8	half-first cousin	0.673
21	T2DG2100951	25	T2DG2501033	6	unrelated	0	5	first cousin	0.633
4	T2DG0400207	4	T2DG0400247	6	unrelated	0	5	first cousin	0.712

Empirical p-values are based on 25,000 simulated replicates, with genotype data simulated under a specified relationship type. The simulating relationship type can be the null relationship defined by the given pedigrees (i.e. the null reltype) or another relationship type (i.e. the plausible reltype) as listed in Table 1. The possible plausible relationship types are not unique and the table provides the one with the highest p-values. Small p-value for testing the null reltype suggests that the observed genotype data are not compatible with the null relationship defined by the given pedigree, whereas large p-value for testing the plausible reltype suggests that the observed genotype data are compatible with the proposed alternative.

Relationship testing results for clear outliers in Figure 1 identified by analysis 1, and in Figure 2 identified by analysis 2 Empirical p-values are based on 25,000 simulated replicates, with genotype data simulated under a specified relationship type. The simulating relationship type can be the null relationship defined by the given pedigrees (i.e. the null reltype) or another relationship type (i.e. the plausible reltype) as listed in Table 1. The possible plausible relationship types are not unique and the table provides the one with the highest p-values. Small p-value for testing the null reltype suggests that the observed genotype data are not compatible with the null relationship defined by the given pedigree, whereas large p-value for testing the plausible reltype suggests that the observed genotype data are compatible with the proposed alternative.

Discussion

The GAW18 "pedigree information was [previously] verified by estimated kinship coefficients, principal component analysis (PCA), and number of mendelian errors between parent and offspring samples." However, no other details are provided and our analyses show that pedigree errors and cryptic relatedness exist in the data. The results presented here are based on the ~50,000 GWAS SNPs that have MAF greater than 5% and pair-wise LD less than 0.2. Our experience with PREST-plus shows that there is little improvement in estimation accuracy, once more than ~50,000 SNPs were used (typically with MAF >5%). Additional analyses with denser sets of SNPs confirmed this (results not shown). However, substantially more SNPs are needed for PLINK to achieve similar estimation efficiency. Although PREST-plus is more powerful than PLINK, there is a trade-off between statistical power and computational efficiency [5,6]. For large data sets involving analyzing millions of pairs, PLINK could be used as a screening tool for further analysis with PREST-plus. Both PREST-plus and PLINK are sensitive to mis-specified allele frequencies, therefore sensitive to population stratification and population admixture, in contrast to some recent work [eg, [18]]. However, results from other GAW18 study groups suggest that population admixture is not a major concern here. Nevertheless, robust relationship estimation and testing methods warrant further research. The methods considered here focus on global estimation of IBD distribution, which is powerful and efficient to distinguish distinct relationships, for example, full-sibs versus unrelated. However, such global methods are not adequate to distinguish similar relationship types, for example, second-cousin versus unrelated. To this end, the recent local estimation methods [eg, [19]] provide useful research direction.

Conclusions

Pedigree errors and cryptic relatedness often occur in sample despite the best practice in data collection. Genome-wide marker data, collected for linkage or association studies, can provide accurate genealogy information between individuals. Using the GWAS SNP data, PREST-plus analyses the GAW18 sample that had been previously "cleaned," and it identifies 7 clearly mis-specified relative pairs in the 20 SAFS families and 4 cryptic-related pairs in the set of putatively unrelated individuals.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Study design: LS. Analysis: AD and LS. Manuscript drafting: LS and AD. All authors read and approved the final manuscript.

15 in total

1. Statistical tests for detection of misspecified relationships by use of genome-screen data.

Authors: M S McPeek; L Sun
Journal: Am J Hum Genet Date: 2000-03 Impact factor: 11.025

2. Relationship inference from trios of individuals, in the presence of typing error.

Authors: Solveig K Sieberts; Ellen M Wijsman; Elizabeth A Thompson
Journal: Am J Hum Genet Date: 2001-11-28 Impact factor: 11.025

3. GRR: graphical representation of relationship errors.

Authors: G R Abecasis; S S Cherny; W O Cookson; L R Cardon
Journal: Bioinformatics Date: 2001-08 Impact factor: 6.937

4. Enhanced pedigree error detection.

Authors: Lei Sun; Kenneth Wilder; Mary Sara McPeek
Journal: Hum Hered Date: 2002 Impact factor: 0.444

5. Estimating kinship in admixed populations.

Authors: Timothy Thornton; Hua Tang; Thomas J Hoffmann; Heather M Ochs-Balcom; Bette J Caan; Neil Risch
Journal: Am J Hum Genet Date: 2012-06-28 Impact factor: 11.025

6. The estimation of pairwise relationships.

Authors: E A Thompson
Journal: Ann Hum Genet Date: 1975-10 Impact factor: 1.670

7. PedCheck: a program for identification of genotype incompatibilities in linkage analysis.

Authors: J R O'Connell; D E Weeks
Journal: Am J Hum Genet Date: 1998-07 Impact factor: 11.025

8. Relationship estimation in affected sib pair analysis of late-onset diseases.

Authors: H H Göring; J Ott
Journal: Eur J Hum Genet Date: 1997 Mar-Apr Impact factor: 4.246

Review 9. New approaches to population stratification in genome-wide association studies.

Authors: Alkes L Price; Noah A Zaitlen; David Reich; Nick Patterson
Journal: Nat Rev Genet Date: 2010-07 Impact factor: 53.242

10. Confounding from cryptic relatedness in case-control association studies.

Authors: Benjamin F Voight; Jonathan K Pritchard
Journal: PLoS Genet Date: 2005-09 Impact factor: 5.917

18 in total

1. Fast and Accurate Shared Segment Detection and Relatedness Estimation in Un-phased Genetic Data via TRUFFLE.

Authors: Apostolos Dimitromanolakis; Andrew D Paterson; Lei Sun
Journal: Am J Hum Genet Date: 2019-06-06 Impact factor: 11.025

2. Joint Estimation of Pedigrees and Effective Population Size Using Markov Chain Monte Carlo.

Authors: Amy Ko; Rasmus Nielsen
Journal: Genetics Date: 2019-05-22 Impact factor: 4.562

3. Value of Mendelian laws of segregation in families: data quality control, imputation, and beyond.

Authors: Elizabeth M Blue; Lei Sun; Nathan L Tintle; Ellen M Wijsman
Journal: Genet Epidemiol Date: 2014-09 Impact factor: 2.135

4. De novo RRAGC mutation activates mTORC1 signaling in syndromic fetal dilated cardiomyopathy.

Authors: Michael T Zimmermann; Maengjo Kim; Pamela A Long; Jared M Evans; Xiaolei Xu; Timothy M Olson
Journal: Hum Genet Date: 2016-05-27 Impact factor: 4.132

5. Mapping autosomal recessive intellectual disability: combined microarray and exome sequencing identifies 26 novel candidate genes in 192 consanguineous families.

Authors: R Harripaul; N Vasli; A Mikhailov; M A Rafiq; K Mittal; C Windpassinger; T I Sheikh; A Noor; H Mahmood; S Downey; M Johnson; K Vleuten; L Bell; M Ilyas; F S Khan; V Khan; M Moradi; M Ayaz; F Naeem; A Heidari; I Ahmed; S Ghadami; Z Agha; S Zeinali; R Qamar; H Mozhdehipanah; P John; A Mir; M Ansar; L French; M Ayub; J B Vincent
Journal: Mol Psychiatry Date: 2017-04-11 Impact factor: 15.992

6. Impact of polygenic risk for coronary artery disease and cardiovascular medication burden on cognitive impairment in psychotic disorders.

Authors: Lusi Zhang; Scot Kristian Hill; Bin Guo; Baolin Wu; Ney Alliey-Rodriguez; Seenae Eum; Paulo Lizano; Elena I Ivleva; James L Reilly; Richard S E Keefe; Sarah K Keedy; Carol A Tamminga; Godfrey D Pearlson; Brett A Clementz; Matcheri S Keshavan; Elliot S Gershon; John A Sweeney; Jeffrey R Bishop
Journal: Prog Neuropsychopharmacol Biol Psychiatry Date: 2021-10-29 Impact factor: 5.067

7. A Pipeline for Classifying Relationships Using Dense SNP/SNV Data and Putative Pedigree Information.

Authors: Zhen Zeng; Daniel E Weeks; Wei Chen; Nandita Mukhopadhyay; Eleanor Feingold
Journal: Genet Epidemiol Date: 2015-12-28 Impact factor: 2.135

8. Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives.

Authors: Monica D Ramstetter; Thomas D Dyer; Donna M Lehman; Joanne E Curran; Ravindranath Duggirala; John Blangero; Jason G Mezey; Amy L Williams
Journal: Genetics Date: 2017-07-24 Impact factor: 4.562

9. Genome-wide association studies of smooth pursuit and antisaccade eye movements in psychotic disorders: findings from the B-SNIP study.

Authors: R Lencer; L J Mills; N Alliey-Rodriguez; R Shafee; A M Lee; J L Reilly; A Sprenger; J E McDowell; S A McCarroll; M S Keshavan; G D Pearlson; C A Tamminga; B A Clementz; E S Gershon; J A Sweeney; J R Bishop
Journal: Transl Psychiatry Date: 2017-10-24 Impact factor: 6.222

10. Genome-wide association study accounting for anticholinergic burden to examine cognitive dysfunction in psychotic disorders.

Authors: Seenae Eum; S Kristian Hill; Ney Alliey-Rodriguez; James M Stevenson; Leah H Rubin; Adam M Lee; Lauren J Mills; James L Reilly; Rebekka Lencer; Sarah K Keedy; Elena Ivleva; Richard S E Keefe; Godfrey D Pearlson; Brett A Clementz; Carol A Tamminga; Matcheri S Keshavan; Elliot S Gershon; John A Sweeney; Jeffrey R Bishop
Journal: Neuropsychopharmacology Date: 2021-06-18 Impact factor: 8.294