| Literature DB >> 19783549 |
Fuli Yu1, Alon Keinan, Hua Chen, Russell J Ferland, Robert S Hill, Andre A Mignault, Christopher A Walsh, David Reich.
Abstract
Historical episodes of natural selection can skew the frequencies of genetic variants, leaving a signature that can persist for many tens or even hundreds of thousands of years. However, formal tests for selection based on allele frequency skew require strong assumptions about demographic history and mutation, which are rarely well understood. Here, we develop an empirical approach to test for signals of selection that compares patterns of genetic variation at a candidate locus with matched random regions of the genome collected in the same way. We apply this approach to four genes that have been implicated in syndromes of impaired neurological development, comparing the pattern of variation in our re-sequencing data with a large-scale, genomic data set that provides an empirical null distribution. We confirm a previously reported signal at FOXP2, and find a novel signal of selection centered at AHI1, a gene that is involved in motor and behavior abnormalities. The locus is marked by many high frequency derived alleles in non-Africans that are of low frequency in Africans, suggesting that selection at this or a closely neighboring gene occurred in the ancestral population of non-Africans. Our study also provides a prototype for how empirical scans for ancient selection can be carried out once many genomes are sequenced.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19783549 PMCID: PMC2778377 DOI: 10.1093/hmg/ddp457
Source DB: PubMed Journal: Hum Mol Genet ISSN: 0964-6906 Impact factor: 6.150
Re-sequenced segments in this study
| Gene | Re-sequenced region | Span in base pairs | Physical coordinates (HG16) | Reason for ascertainment |
|---|---|---|---|---|
| Exon 5–12 | 23 127 | Chr6: 135,748,716-135,771,842 | Non-synonymous mutations, causal for Joubert Syndrome; elevated d | |
| Exon 15–17 | 4875 | Chr6: 135,731,090-135,735,964 | Functionally important coding region ( | |
| Exon 2–4 | 6334 | Chr1: 194,396,156-194,402,489 | Two frameshift deletions in exon 3 causing microcephaly ( | |
| Exon 18 | 6755 | Chr1: 194,356,820-194,363,574 | One nonsense mutation causing microcephaly ( | |
| Exon 20–25 | 6251 | Chr1: 194,346,319-194,352,569 | One nonsense deletion in exon 21 causing microcephaly ( | |
| Exon 4–8 | 26 250 | Chr7: 113,818,098-113,844,347 | Two human lineage specific mutations ( | |
| Exon 14–16 | 4372 | Chr7: 113,855,123-113,859,494 | Non-synonymous mutation putatively causal for a severe speech and language disorder ( | |
| Exon 3–15 | 15 780 | Chr16: 57,458,562-57,474,341 | Numerous non-synonymous/missense mutations ( |
Figure 1.DAF distribution by bin in SNPs from re-sequenced regions compared with SNPs from the ∼2.5 Mb of ENCODE data (black lines) in (A) CEU and (B) YRI. The DAF distribution is shown for each gene, merging the different interrogated segments. The derived allele is inferred by comparison with the chimpanzee allele.
Figure 2.DAF and FST distributions of SNPs ascertained in the regions we re-sequenced (yellow) and HapMap SNPs (grey) within the physical genomic positions spanned by the genes, in (A) AHI1, (B) ASPM, (C) FOXP2 and (D) GPR56. The exon/intron map for the gene is shown at the top; the DAF plotted against physical position for CEU and YRI in the middle; and FST compared with percentiles from all of HapMap (90th, 99th and 99.9th) at the bottom. The scale for each of the genes is different (and hence the density of SNPs is different) because of their different physical distance spans.
Statistical tests in CEU for unusual allele frequencies in re-sequenced segments compared with ENCODE control data
| CEU | Re-sequenced region | Genetic distance span (cM) | #Segregating sites | Matches in empirical comparison | Tajima's D; empirical | Fu and Li's F; empirical | Fay and Wu's H; empirical |
|---|---|---|---|---|---|---|---|
| AHI1 | Exon 5–12 | 0.0017 | 27 | 28 | −1.6; 0.071*, 0.0062; (−2.74) | −1.5; 0.071*; (−4.17) | −9.6; 0.36; (−1.81) |
| Exon 15–17 | 0.00040 | 9 | 105 | −1.3; 0.019*, 0.020; (−2.33) | −1.1; 0.057; (−3.44) | −2.7; 0.29; (−1.03) | |
| ASPM | Exon 2–4 | 0.00024 | 8 | 97 | −1.3; 0.021*, 0.056; (−1.91) | −0.08; 0.16; (−1.72) | −2.3; 0.33; (−0.82) |
| Exon 18 | 0.0028 | 9 | 174 | 0.7; 0.97, 0.93; (−0.085) | 0.2; 0.33; (−1.16) | −3.4; 0.22; (−1.35) | |
| Exon 20–25 | 0.000097 | 9 | 56 | −0.2; 0.32, 0.16; (−1.41) | 1.0; 0.64; (−0.37) | −2.9; 0.36; (−0.96) | |
| FOXP2 | Exon 4–8 | 0.0093 | 14 | 93 | −1.9; 0.022*, 0.0025; (−3.03) | −0.4; 0.11; (−2.40) | −6.0; 0.15; (−1.70) |
| Exon 14–16 | 0.0043 | 6 | 186 | 3.4; 0.011*, 0.0052; (2.79) | 2.2; 0.011*; (1.90) | 0.1; 0.82; (0.55) | |
| GPR56 | Exon 3–15 | 0.093 | 25 | 22 | 1.9; 0.36, 0.18; (1.33) | 1.8; 0.64; (0.53) | 0.9; 0.36; (0.81) |
aIn each cell, the first value reports the value of the statistic. The second value reports the P-value based on rank-ordering compared with ENCODE data (for Tajima's D, this is followed by a P-value for a two-tailed z-test based on the MBB procedure). The third value in parenthesis gives the number of standard deviations (σ) from the mean.
*Indicates observed value that is more extreme than all empirical comparisons. Since P-values are 2-sided, the most extreme rank-ordering P-value that is possible is 2/n, where n is the number of windows in the control data.
Statistical tests in YRI for unusual allele frequencies in re-sequenced segments compared with ENCODE control data
| YRI | Re-sequenced region | Genetic distance span (cM) | #Segregating sites | Matches in empirical comparison | Tajima's D; empirical | Fu and Li's F; empirical | Fay and Wu's H; empirical |
|---|---|---|---|---|---|---|---|
| AHI1 | Exon 5–12 | 0.0017 | 45 | 7 | −0.3; 0.86, 0.73; (−0.35) | 1.2; 0.86; (0.38) | −1.9; 0.86; (−0.11) |
| Exon 15–17 | 0.00040 | 14 | 56 | 0.4; 0.96, 0.98; (0.020) | 1.0; 0.75; (−0.17) | 0.3; 0.75; (0.43) | |
| ASPM | Exon 2–4 | 0.00024 | 12 | 58 | 0.6; 0.97, 0.94; (0.080) | 1.4; 0.79; (0.51) | −1.6; 0.38; (−0.34) |
| Exon 18 | 0.0028 | 9 | 190 | −0.7; 0.32, 0.19; (−1.32) | 0.8; 0.89; (−0.45) | −4.8; 0.074; (−2.68) | |
| Exon 20–25 | 0.000097 | 10 | 53 | 0.8; 0.79, 0.74; (0.33) | 1.5; 0.68; (0.59) | 0.1; 0.98; (0.39) | |
| FOXP2 | Exon 4–8 | 0.0093 | 40 | 56 | −1.1; 0.036*, 0.056; (−1.91) | 0.5; 0.18; (−1.49) | −10.2; 0.18; (−1.87) |
| Exon 14–16 | 0.0043 | 9 | 165 | −0.8; 0.16, 0.15; (−1.45) | −0.4; 0.061 (−2.10) | −3.3; 0.12 (−1.79) | |
| GPR56 | Exon 3–15 | 0.093 | 41 | 21 | 0.1; 0.86, 0.44 (−0.78) | 0.4; 0.095*; (−2.28) | −2.3; 0.86; (−0.36) |
aIn each cell, the first value reports the value of the statistic. The second value reports the P-value based on rank-ordering compared with ENCODE data (for Tajima D, this is followed by a P-value for a two-tailed z-test based on the MBB procedure). The third value in parenthesis gives the number of standard deviations (σ) from the mean.
*Indicates observed value that is more extreme than all empirical comparisons. Since P-values are 2-sided, the most extreme rank-ordering P-value that is possible is 2/n, where n is the number of windows in the control data.
Simulations of empirical tests of selection
| Simulated selection scenario | Empirical approaches for detecting selection | |
|---|---|---|
| Genomic Control power (%) | Rank method power (%) | |
| 94 | 92.7 | |
| 91.9 | 88.5 | |
| 60.9 | 54.3 | |
| 35.8 | 29.1 | |
| 13 | 10 | |
| 3.4 | 2.7 | |
| 2 | 1 | |
| 1.4 | 0.8 | |
| False positive rate for simulated neutral regions under a constant sized population model | 2.3 | 0.2 |
| False positive rate for simulated neutral regions under a population growth model | 0.5 | 1 |
| False positive rate for simulated neutral regions under a population bottleneck growth model | 2.8 | 0.8 |
| False positive rate for simulated neutral regions under a population growth model compared with control regions under a constant size model (this highlights the false-positives that arise under traditional tests of selection) | 41.0 | 35.3 |
Derived allele frequencies for SNPs in the vicinity of AHI1 that are highly differentiated between African and non-African populations
| SNP | Build34 | Region | HapMap | HGDP | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CEU (%) | CHB+JPT (%) | YRI (%) | European (%) | West Asian (%) | Central and South Asian (%) | East Asian (%) | East Oceanian (%) | Native American (%) | African (%) | |||
| rs7453135 | 134,439,442 | proximal | 85 | 99 | 10 | 79 | 72 | 77 | 96 | 65 | 98 | 19 |
| rs9688660 | 134,445,499 | proximal | 85 | 99 | 10 | |||||||
| rs9321439 | 134,708,764 | proximal | 84 | 77 | 6 | |||||||
| rs6922545 | 134,709,544 | proximal | 84 | 77 | 8 | |||||||
| rs7775514 | 134,712,805 | proximal | 84 | 77 | 6 | |||||||
| rs9493942 | 134,713,104 | proximal | 85 | 77 | 7 | 77 | 60 | 83 | 85 | 100 | 98 | 18 |
| rs726948 | 134,714,168 | proximal | 85 | 77 | 13 | |||||||
| rs2327484 | 134,714,298 | proximal | 85 | 77 | 13 | 78 | 60 | 83 | 86 | 96 | 98 | 26 |
| rs1052502 | 135,587,135 | 95 | 86 | 6 | 93 | 85 | 90 | 89 | 30 | 96 | 25 | |
| rs7741046 | 135,595,272 | 95 | 91 | 8 | ||||||||
| rs2327612 | 135,597,189 | 97 | 97 | 13 | ||||||||
| rs2142956 | 135,597,202 | 97 | 97 | 13 | ||||||||
| rs7766656 | 135,598,171 | 92 | 91 | 8 | 92 | 87 | 90 | 92 | 30 | 97 | 30 | |
| rs6933077 | 135,598,904 | 97 | 97 | 15 | ||||||||
| rs9483826 | 135,600,086 | 97 | 97 | 8 | ||||||||
| rs7765602 | 135,601,578 | 97 | 97 | 14 | ||||||||
| rs7765971 | 135,601,742 | 97 | 97 | 14 | ||||||||
| rs7756167 | 135,603,575 | 95 | 91 | 8 | ||||||||
| rs9389294 | 135,787,775 | 94 | 92 | 17 | ||||||||
| rs9402709 | 135,793,821 | 94 | 92 | 17 | ||||||||
| rs4896149 | 135,797,073 | 95 | 92 | 17 | ||||||||
| rs958072 | 135,808,136 | distal | 94 | 92 | 17 | |||||||
| rs9494266 | 135,832,143 | distal | 94 | 92 | 15 | 92 | 86 | 94 | 91 | 43 | 96 | 28 |
| rs7752627 | 135,856,515 | distal | 94 | 92 | 15 | |||||||
| rs9483910 | 136,461,542 | distal | 99 | 86 | 7 | |||||||
| rs9321552 | 136,462,182 | distal | 99 | 86 | 7 | |||||||
| rs3823159 | 136,463,297 | distal | 99 | 86 | 7 | 100 | 94 | 95 | 84 | 46 | 57 | 25 |
| rs6570067 | 136,477,401 | distal | 99 | 86 | 7 | |||||||
| rs1480642 | 136,480,098 | distal | 99 | 86 | 8 | |||||||
| rs3734548 | 136,488,969 | distal | 98 | 78 | 7 | |||||||
| rs3799396 | 136,492,042 | distal | 98 | 78 | 7 | |||||||
| rs7753890 | 136,496,827 | distal | 99 | 79 | 7 | 97 | 91 | 89 | 84 | 37 | 98 | 24 |
| rs11154872 | 136,778,327 | distal | 87 | 64 | 15 | |||||||
| rs3778308 | 136,786,352 | distal | 87 | 64 | 15 | |||||||
| rs9399183 | 136,798,058 | distal | 87 | 64 | 15 | 82 | 81 | 79 | 61 | 32 | 68 | 24 |
Note: This table reports all HapMap Phase II SNPs in the AHI1 region where we observe DAF < 17% in YRI and DAF > 83% in CEU. Most of these SNPs have a similarly elevated DAF in CHB+JPT, suggesting that the selective sweep at this locus occurred in the common ancestral population of North Europeans and East Asians after the split from West Africans. Where available, we also report data for the Human Genome Diversity Panel (HGDP) (51) for a wider range of populations, pooling samples into seven geographical regions following Ref. (61). The only non-African populations that do not consistently exhibit high derived allele frequencies across these regions are East Oceanians.
Figure 3.Empirical significance of the LRH test for all SNPs in the four genes for (A) CEU and (B) YRI. We merged the SNPs discovered in the re-sequenced genic segments with the HapMap SNPs within each gene, and used each SNP across the span as a core to carry out a LRH test, separately reporting the scores at both sides of the core (2,52). LRH values were split into 20 bins (0–5, 5–10, 10–15, … , 95–100%) by their respective allele frequency, and compared to the empirical LRH distribution obtained from the HapMap SNPs using the same LRH methods. A few SNPs in ASPM and FOXP2 in YRI exceed the 99.9th percentile when compared with the HapMap. However, after accounting for multiple hypotheses testing, none of the genes stands out as showing an unusual LRH test compared with the genome-wide distribution (Supplementary Material, Table S1).