| Literature DB >> 21119644 |
Alex Coventry1, Lara M Bull-Otterson, Xiaoming Liu, Andrew G Clark, Taylor J Maxwell, Jacy Crosby, James E Hixson, Thomas J Rea, Donna M Muzny, Lora R Lewis, David A Wheeler, Aniko Sabo, Christine Lusk, Kenneth G Weiss, Humeira Akbar, Andrew Cree, Alicia C Hawes, Irene Newsham, Robin T Varghese, Donna Villasana, Shannon Gross, Vandita Joshi, Jireh Santibanez, Margaret Morgan, Kyle Chang, Walker Hale Iv, Alan R Templeton, Eric Boerwinkle, Richard Gibbs, Charles F Sing.
Abstract
Accurately determining the distribution of rare variants is an important goal of human genetics, but resequencing of a sample large enough for this purpose has been unfeasible until now. Here, we applied Sanger sequencing of genomic PCR amplicons to resequence the diabetes-associated genes KCNJ11 and HHEX in 13,715 people (10,422 European Americans and 3,293 African Americans) and validated amplicons potentially harbouring rare variants using 454 pyrosequencing. We observed far more variation (expected variant-site count ∼578) than would have been predicted on the basis of earlier surveys, which could only capture the distribution of common variants. By comparison with earlier estimates based on common variants, our model shows a clear genetic signal of accelerating population growth, suggesting that humanity harbours a myriad of rare, deleterious variants, and that disease risk and the burden of disease in contemporary populations may be heavily influenced by the distribution of rare variants.Entities:
Year: 2010 PMID: 21119644 PMCID: PMC3060603 DOI: 10.1038/ncomms1130
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Physical location of selected variants.
For each variant shown, the figure shows the reference residue, the location, the variant residue and, in parentheses, the variant's posterior probability. Variants identified by Polyphen13 as potentially damaging to the protein product are shown in magenta, others are in cyan. (a) Variants that change the protein structure in KCNJ11. (b) Variants in HHEX. No sufficiently homologous crystal structure for HHEX is available for homology modelling; hence, we show the gene structure instead. Blue regions depict exons. Green regions depict neighbouring intronic/untranslated regions (30 base pairs in both directions). Black bars indicate excluded intronic sequence. Non-coding variants are shown in grey, and show the reference allele, the build 36 coordinate on chromosome 10, the variant allele and the posterior probability of the variant.
Figure 2Number of variants as a function of sample size.
Counts of the number of observed segregating sites as a function of sample size for (a) HHEX and (b) KCNJ11. Solid blue line shows the total number of segregating sites. Red shows singletons, and yellow, brown and purple lines show the numbers of variants with relative minor allele frequency <0.01, 0.01–0.05 and more than 0.05, respectively. Roughness in these curves indicates stochasticity in the number of variants observed across multiple sample populations. Dashed lines show extrapolations of the expected number of segregating sites in larger samples according to Watterson's classical estimate. In all cases, we found far more segregating sites at larger sample sizes than Watterson's estimate would have predicted.
Figure 3Site-frequency spectra.
Site-frequency spectra in (a) HHEX and (b) KCNJ11 over 'neutral sites' (see Methods) in the two genes for the European sub-population. The x axis depicts the number of variants observed at a site; the y axis depicts the expected number of sites at which that many variants were seen. Green bars show the expected number of sites, as determined by sampling from the posterior genotypic distributions for each sampled individual, and error bars show the 99% confidence intervals from these samples. The black line shows the expected SFS spectrum, given the Wright–Fisher constant population size model and mutation rate Θ estimated by Watterson's method (Equation 4.16, Hartl & Clark (2007)) The blue line shows the mean posterior SFS given the population model used to calculate the mutation rate in Figure 4.
Figure 4Mutation rate estimates.
These estimates are based on drawing an average over 100 coalescent trees per grid point. (a) Estimated marginal posterior distribution over growth rates per generation during the exponential growth phase. Red error bar in the lower left-hand corner shows the 95% confidence interval of the growth rate in the European lineage estimated in Table 1 of Gutenkunst et al.10 which is much lower, because the more common variants used in that estimate pertain to a more remote time in our history. (b) Estimated marginal posterior distributions on the time when variants of various relative minor allele frequencies arose in the population, relative to the logarithm of number of generations ago. Blue, green, red, cyan and magenta lines correspond to distributions for variants with relative minor allele frequency (RMAF) of 5×10−5, 5×10−4, 5×10−3, 5×10−2 and 5×10−1, respectively. A RMAF of 5×10−5 corresponds to singletons in our data set, which, according to our model, mostly arose in the last 2,500 years. Most previous analyses have dealt with SNPs with a RMAF on the order of 5×10−2, corresponding to much earlier mutations. (c) Estimated marginal posterior distribution over mutation rates given the SFS in the two genes. Blue and green lines are for HHEX and KCNJ11, respectively.