| Literature DB >> 20211853 |
Ralph E McGinnis1, Panos Deloukas, William M McLaren, Michael Inouye.
Abstract
We describe a novel approach for evaluating SNP genotypes of a genome-wide association scan to identify "ethnic outlier" subjects whose ethnicity is different or admixed compared to most other subjects in the genotyped sample set. Each ethnic outlier is detected by counting a genomic excess of "rare" heterozygotes and/or homozygotes whose frequencies are low (<1%) within genotypes of the sample set being evaluated. This method also enables simple and striking visualization of non-Caucasian chromosomal DNA segments interspersed within the chromosomes of ethnically admixed individuals. We show that this visualization of the mosaic structure of admixed human chromosomes gives results similar to another visualization method (SABER) but with much less computational time and burden. We also show that other methods for detecting ethnic outliers are enhanced by evaluating only genomic regions of visualized admixture rather than diluting outlier ancestry by evaluating the entire genome considered in aggregate. We have validated our method in the Wellcome Trust Case Control Consortium (WTCCC) study of 17,000 subjects as well as in HapMap subjects and simulated outliers of known ethnicity and admixture. The method's ability to precisely delineate chromosomal segments of non-Caucasian ethnicity has enabled us to demonstrate previously unreported non-Caucasian admixture in two HapMap Caucasian parents and in a number of WTCCC subjects. Its sensitive detection of ethnic outliers and simple visual discrimination of discrete chromosomal segments of different ethnicity implies that this method of rare heterozygotes and homozygotes (RHH) is likely to have diverse and important applications in humans and other species.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20211853 PMCID: PMC2883336 DOI: 10.1093/hmg/ddq102
Source DB: PubMed Journal: Hum Mol Genet ISSN: 0964-6906 Impact factor: 6.150
Figure 1.Admixed chromosome mosaicism in subjects A1-1-1-1 and B5-4-12-2. The mosaicism is shown by the chromosomal positions of each subject's rare hets (red dashes beside chromosomes in whole genome view; red crosses above chromosome in fine-scale view). The positions of the red dashes and red crosses should be compared to all possible genomic locations of rare hets derived empirically by mapping all rare-het positions observed in the sample set (A or B) of the subject (gray crosses immediately above fine-scale chromosome; gray shading inside whole-genome chromosomes). (A) Subject A1-1-1-1 is the most extreme ethnic outlier in set A as judged by both RHH and PLINK but lacks rare hets in a number of chromosomal regions (see fine-scale view and Table 2) implying that these are regions of unadmixed Caucasian ancestry. (B) Mosaicism in subject B5-4-12-2 is more typical of outliers and is visually obvious with rare-hets densely packed into a few discrete segments that mark the chromosomal locations of non-Caucasian DNA. Tiny triangles in whole-genome and fine-scale views denote the positions of “ethnic” SNPs which are monomorphic in HapMap CEU subjects but have MAF ≥0.4 in HapMap YRI subjects (“YRI SNPs”) or in CHB subjects (“CHB SNPs”). Triangles are enlarged if the subject carries the “non-Caucasian” allele as a heterozygote or homozygote at a YRI SNP (purple triangle) or CHB SNP (green triangle) whereas homozygotes for the “Caucasian” allele are unenlarged gray triangles. The rarity of non-Caucasian alleles (purple/green triangles) outside rare-het segments and their far higher frequency inside the segments confirms the non-Caucasian origin of segments with dense rare hets and Caucasian ethnicity of regions in which rare hets are largely absent.
Figure 2.Rare-het chromosomal mosaicism in four RHH-detected outliers, two of whom were also detected by PLINK. Subjects A8-9-13-7 (A) and B7-7-18-5 (B) exhibit dense rare hets on only a few chromosomes. They are not detected as outliers by PLINK when genotypes are evaluated for the whole genome but are strongly detected when PLINK considers only genotypes from the subject's longest rare-het segment (see Table 2). Subjects A2-6-2-5 (C) and B2-5-2-3 (D) exhibit dense rare hets on most chromosomes and are strongly detected as ethnic outliers when PLINK evaluates the whole genome; but PLINK provides no evidence that the two subjects are outliers when genotypes are only included from regions that lack rare hets (Table 2). These results imply that outlier DNA is largely confined to segments marked by dense rare hets. (See Fig. 1 for definitions of figure annotations.).
Recalculated ethnic-outlier Z-scores for chromosomal region(s) with or without dense rare-hets in subjects exhibiting rare-het mosaicisma
| Subject IDb | PLINK Z-scorec | Characteristics of Partial Genome | |||
|---|---|---|---|---|---|
| Whole Genome | Partial Genome | Chromosome Region(s) Includedd | Putative Ethnicity of Included Region(s) | Viewable genomic rare-hets | |
| A8-9-13-7 | −1.2 | 1(40–115 Mb) | Non-Caucasian | Figure | |
| A19-9-30-7 | −0.4 | 2(145–200 Mb) | Non-Caucasian | Figure S1 | |
| A14-10-25-8 | −0.9 | 1(180–230 Mb) | Non-Caucasian | Figure S2 | |
| B5-4-12-2 | −1.4 | 2(65–140 Mb) | Non-Caucasian | Figure | |
| B7-7-18-5 | −1.1 | 11(0–75 Mb) | Non-Caucasian | Figure | |
| B14-7-22-5 | −0.8 | 6(120–165 Mb) | Non-Caucasian | Figure S3 | |
| A1-1-1-1 | −0.7 | 2(0–6 Mb); 5(150–165 Mb); 6(0–10 Mb); 7(35–55 Mb); 8(60–70 Mb); 14(70–80 Mb) | Caucasian | Figure | |
| A2-6-2-5 | −1.2 | 2(0–140 Mb); 5(55–100 Mb); 7(50–70 Mb); 8(0–100 Mb); 9(25–70 Mb); 10(20–50 Mb); 11(0–95 Mb); 12(15–95 Mb); 13(0–65 Mb); 15(70–90 Mb); 18(0–50 Mb); 19(0–50 Mb); 20(10–50 Mb); 22(0–45 Mb) | Caucasian | Figure | |
| B1-3-1-1 | −2.6 | 1(60–175 Mb); 2(10–40, 140–150, 180–190 Mb); 6(105–120 Mb); 7(105–145 Mb); 8(65–85 Mb); 10(55–115 Mb); 11(25–70 Mb); 13(30–65 Mb); 14(30–55 Mb); 17(55–80 Mb) | Caucasian | Figure S4 | |
| B2-5-2-3 | −1.0 | 1(15–30 Mb); 2(40–220 Mb); 3(0–40,70–125,145–185 Mb); 4(0–20,115–165 Mb); 5(15–70,125–135,165–185 Mb); 7(0–140 Mb); 8(20–115 Mb); 10(75–115 Mb); 11(90–115 Mb); 12(75–135 Mb); 14(65–105 Mb); 15(65–90 Mb); 17(0–50 Mb); 18(15–80 Mb); 19(20–80 Mb); 21(30–50 Mb) | Caucasian | Figure | |
aPLINK Z-scores are based on all Affy500K SNPs in the “Whole Genome” or in a “Partial Genome” marked in subjects 1–6 by their largest rare-het segment and defined in subjects 7–10 by pooling all major regions not marked by dense rare-hets. Each subject shows a dramatic change in Z-score statistical significance for “Whole” versus “Partial” Genome, thus showing that non-Caucasian and Caucasian DNA are respectively marked by presence or absence of dense rare hets.
bSame SubjectID as in Table 1
cLowest PLINK Z-score from 1st thorough 10th nearest-neighbor distributions. Z-scores are bold and underlined if statistically significant (Z<−4.0).
dChromosome number and boundaries of included region(s).
Extreme “tail” of RHH count distribution containing outliers from sample set B, HapMap, and simulated matings of HapMap individualsa
aRHH analysis of simulated HapMap ethnic outliers combined with sample set B; subjects are sorted from highest to lowest rare-het counts under “All SNPs” (column 2) with some set B subjects omitted to show all HapMap-derived outliers. “CEU”, “CHB”, “YRI” denote HapMap subjects of Caucasian, Chinese, and African Yoruban ancestry respectively.
bEach “unadmixed” outlier is a HapMap YRI or CHB subject; other HapMap-derived outliers are progeny of simulated matings denoted by “×”; for example, “(YRI×CEU)×CEU [2 backcrosses]” denotes offspring from mating of a HapMap YRI and CEU subject followed by mating (“backcross”) in next two generations with a CEU subject; set B subjects have same ID used in Table 1
cUnthinned counts under “All SNPs” and thinned counts under “1 Mb apart” are from 401,430 HapMap SNPs genotyped on Affy500K and having resolvable strand for HapMap versus Affy. Counts exceeding permutation-derived threshold are bold and underlined (signifying p<0.001) or only underlined (signifying p<0.05).
dLowest PLINK Z-score from 1st thorough 10th nearest-neighbor distributions. Z-scores are bold and underlined if statistically significant (Z<−4.0).
eMosaicism using Affy500K chip: “Yes” if rare-het mosaicism is visually obvious; “No” if dense rare hets cover entire genome; “Sparse hets” if rare het density is too sparse to clearly discern mosaicism (as in simulated subjects of CHB ancestry).
f“Yes” if subject shows obvious rare-het mosaicism when Affy500K chip is augmented with ∼40,400 HapMap SNPs monomorphic in HapMap CEU but with minor allele frequency above 0.1 in CHB.
Subjects in the extreme “tail” of the rare-het and rare-hom count distributions of two typical sample sets
| SubjectIDa | RHH Counts and p-valuesb | Genotype Confidencec | “Ethnic” Het Countsd | Other Ethnic Outlier methods | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| All SNPs | 1 Mb apart | All 500K | Rare Hets | YRI | CHB | WTCCC PC-MDSe | PLINK Z-scoref | |||
| Hets | Homs | Hets | Homs | |||||||
| Sample set A (UKBS controls) | ||||||||||
| A1-1-1-1 | 0.06 | 0.04 | YES | |||||||
| A2-6-2-5 | 0.04 | 0.04 | YES | |||||||
| A3-9-6-7 | 1 | 1 | 0.04 | 0.04 | 5 | YES | −3.2 | |||
| A4-3-3-3 | 0.04 | 0.04 | YES | |||||||
| A5-4-4-2 | 0.03 | 0.03 | YES | |||||||
| A6-2-5-2 | 0.03 | 0.02 | YES | |||||||
| A7-5-8-4 | 0.04 | 0.04 | YES | −2.7 | ||||||
| A8-9-13-7 | 1 | 1 | 0.03 | 0.03 | 6 | −1.2 | ||||
| A9-10-9-8 | 0 | 0 | 0.03 | 0.04 | 6 | YES | −2.0 | |||
| A10-10-7-8 | 0 | 0 | 0.06 | 6 | −1.4 | |||||
| A11-9-10-7 | 1 | 1 | 0.05 | 0.04 | YES | −3.7 | ||||
| A12-8-12-6 | 2 | 2 | 0.03 | 0.02 | YES | |||||
| A13-7-11-6 | 3 | 2 | 0.05 | 0.05 | YES | |||||
| A14-10-25-8 | 0 | 0 | 0.04 | 0.04 | 3 | −0.9 | ||||
| A15-9-15-7 | 1 | 1 | 0.03 | 0.04 | 21 | YES | ||||
| A16-10-15-8 | 0 | 0 | 0.03 | 0.03 | YES | |||||
| A17-9-18-7 | 1 | 1 | 0.04 | 0.06 | 4 | −1.9 | ||||
| A18-10-20-8 | 0 | 0 | 0.05 | 0.04 | 8 | −2.0 | ||||
| A19-9-30-7 | 1 | 1 | 0.04 | 0.05 | 0 | −0.4 | ||||
| A20-9-24-7 | 1 | 1 | 0.04 | 0.05 | 16 | 2 | −1.6 | |||
| A20-10-26-8 | 0 | 0 | 0.03 | 0.04 | 1 | −1.1 | ||||
| A21-10-22-8 | 0 | 0 | 0.04 | 0.05 | 21 | 3 | −2.2 | |||
| A22-10-21-8 | 0 | 0 | 0.04 | 0.06 | 7 | YES | ||||
| A23-9-28-7 | 1 | 1 | 0.05 | 0.07 | 21 | 7 | −0.6 | |||
| A24-10-26-8 | 0 | 0 | 0.03 | 0.02 | 6 | −1.0 | ||||
| Sample set B (58BC controls) | ||||||||||
| B1-3-1-1 | 0.04 | 0.03 | YES | |||||||
| B2-5-2-3 | 2 | 2 | 0.03 | 0.03 | 9 | YES | ||||
| B3-4-4-2 | 3 | 3 | 0.03 | 0.03 | 5 | YES | −2.4 | |||
| B4-6-5-4 | 1 | 1 | 0.06 | 0.05 | 3 | YES | −2.5 | |||
| B5-4-12-2 | 3 | 3 | 0.04 | 0.05 | 2 | −1.4 | ||||
| B6-7-3-5 | 0 | 0 | 0.07 | 7 | −2.1 | |||||
| B7-7-18-5 | 0 | 0 | 0.03 | 0.03 | 1 | −1.1 | ||||
| B8-6-16-4 | 1 | 1 | 0.04 | 0.04 | 2 | −0.9 | ||||
| B9-7-11-5 | 0 | 0 | 0.07 | 0.07 | 9 | −1.5 | ||||
| B10-5-8-3 | 2 | 2 | 0.06 | 0.06 | YES | −2.5 | ||||
| B11-2-17-2 | 3 | 0.02 | 0.04 | 2 | ||||||
| B12-7-15-5 | 0 | 0 | 0.06 | 0.07 | 3 | −0.5 | ||||
| B13-7-13-5 | 0 | 0 | 0.06 | 0.08 | 1 | −1.1 | ||||
| B14-7-22-5 | 0 | 0 | 0.03 | 0.03 | 1 | −0.8 | ||||
| B15-7-17-5 | 0 | 0 | 0.05 | 0.08 | 19 | 3 | −1.4 | |||
| B16-1-18-1 | 0.03 | 0.03 | 5 | −3.7 | ||||||
| B17-7-15-5 | 0 | 0 | 0.03 | 0.03 | 20 | 5 | −2.7 | |||
| B18-6-24-4 | 1 | 1 | 0.03 | 0.04 | 14 | 6 | −0.3 | |||
| B19-7-14-5 | 0 | 0 | 0.05 | 0.06 | 8 | −2.3 | ||||
| B20-7-7-5 | 0 | 0 | 0.07 | 14 | 7 | −0.7 | ||||
| B21-7-19-5 | 0 | 0 | 0.03 | 0.03 | 3 | −2.1 | ||||
| B22-7-25-5 | 0 | 0 | 0.03 | 0.03 | 2 | −1.2 | ||||
| B23-7-6-5 | 0 | 0 | 0.05 | 14 | 4 | −1.2 | ||||
| B24-7-20-5 | 0 | 0 | 0.03 | 0.03 | 19 | 5 | −2.3 | |||
| B24-7-26-5 | 0 | 0 | 0.04 | 0.04 | 15 | 5 | YES | −1.8 | ||
aSubjects are sorted from highest to lowest rare-het counts for “All SNPs” (column 2); subjectID is sample set followed by count rank in columns 2, 3, 4 and 5.
bCounts from all Affy500K SNPs or “thinned” to derive only from SNPs at least 1 Mb apart. Counts exceeding permutation-derived threshold are bold and underlined (signifying p<0.001) or only underlined (signifying p<0.05).
cMean BRLMM confidence for subject genotypes at all Affy500K SNPs and at all rare hets. Mean rare-het confidence above 0.1 is in to indicate doubtful genotype accuracy and likely false-positive ethnic outlier.
dCounts of heterozygotes at “ethnic” SNPs at least 1 Mb apart. Statistically excess counts (p<0.001 or p<0.05) denoted by bold and underline as in footnote b.
eSubject identified by WTCCC as having “non-Caucasian ancestry” based on PC-MDS analysis (12).
fLowest PLINK Z-score from 1st thorough 10th nearest-neighbor distributions. Z-scores are bold and underlined if statistically significant (Z<−4.0).