| Literature DB >> 29459739 |
Lauren Alpert Sugden1,2, Elizabeth G Atkinson3, Annie P Fischer4, Stephen Rong5,6, Brenna M Henn3, Sohini Ramachandran7,8.
Abstract
Statistical methods for identifying adaptive mutations from population genetic data face several obstacles: assessing the significance of genomic outliers, integrating correlated measures of selection into one analytic framework, and distinguishing adaptive variants from hitchhiking neutral variants. Here, we introduce SWIF(r), a probabilistic method that detects selective sweeps by learning the distributions of multiple selection statistics under different evolutionary scenarios and calculating the posterior probability of a sweep at each genomic site. SWIF(r) is trained using simulations from a user-specified demographic model and explicitly models the joint distributions of selection statistics, thereby increasing its power to both identify regions undergoing sweeps and localize adaptive mutations. Using array and exome data from 45 ‡Khomani San hunter-gatherers of southern Africa, we identify an enrichment of adaptive signals in genes associated with metabolism and obesity. SWIF(r) provides a transparent probabilistic framework for localizing beneficial mutations that is extensible to a variety of evolutionary scenarios.Entities:
Mesh:
Year: 2018 PMID: 29459739 PMCID: PMC5818606 DOI: 10.1038/s41467-018-03100-7
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1SWIF(r) outcompetes existing site-based sweep-detection methods for a range of sweep model parameters. For comparison against window-based sweep-detection methods, see Supplementary Figs. 9 and 10. a Receiver operating characteristic (ROC) curves comparing SWIF(r) with CMS[9], FST, iHS, XP-EHH, and ΔDAF across all simulated neutral and sweep scenarios (see Methods). False-positive rate is the fraction of simulated neutral sites that are incorrectly classified as adaptive, and the true-positive rate is the fraction of simulated sites of adaptive mutations that are correctly classified. SWIF(r) constitutes an improvement in the tradeoff between true and false positives. SweepFinder[4] is not visible here or in b (Supplementary Fig. 6). b ROC curves for incomplete sweeps in which the beneficial allele has a population frequency of 20%. SWIF(r) reduces the false-positive rate by up to 50% relative to CMS. c ROC curves for SWIF(r) and the four component ODEs for incomplete sweeps in which the beneficial allele has a frequency of 20%. Since the AODE is an average of the ODEs, there will always be individual ODEs that match or outperform SWIF(r); in this case, the ODEs conditioned on FST and ΔDAF both achieve this. d The two highest-performing ODEs for different sweep parameters. Performance is defined as area under the ROC curve. The statistic that leads to the highest-performing ODE is listed first in bold, followed by the second best. Colors correspond to ROC curves in panels a–c. While ODEs conditioned on ΔDAF tend to perform extremely well for sweeps that are at lower frequency, ODEs conditioned on FST and XP-EHH tend to perform better for sweeps that are near-complete or complete in the population of interest. By averaging the ODEs, SWIF(r) is robust to uncertainty about the true parameters of the sweep. e Power-FDR curves for SWIF(r), CMS, FST, iHS, XP-EHH, ΔDAF, and SweepFinder. Power is equivalent to true-positive rate as defined in a–c, and false discovery rate is defined as the fraction of sites classified as adaptive that are actually neutral. These curves assume a training set composed of 99.95% neutral variants and 0.05% adaptive variants. See Supplementary Fig. 11 for curves based on other training set compositions
Fig. 2SWIF(r) localizes canonical adaptive mutations and provides additional evidence for suspected sweeps in YRI, CHB and JPT, and CEU[63]. Plotted points indicate the calibrated posterior sweep probability calculated by SWIF(r) at each site, using prior sweep probability π = 10−5. Plots were made with LocusZoom[64]. Where available, functionally verified adaptive SNPs are depicted as filled diamonds and labeled with rsids. a, b In YRI, two loci where SWIF(r) reports high sweep probabilities are DARC and DOCK3. DARC encodes the Duffy antigen, located on the surface of red blood cells, and is the receptor for malaria parasites. The derived allele of the causal SNP shown has been determined to be protective against Plasmodium vivax malaria infection[25]. DOCK3, along with neighboring genes MAPKAPK3 and CISH, are all associated with variation in height, and have previously been shown to harbor signals of selection in Pygmy populations[65]. CISH may also play a role in susceptibility to infectious diseases, including malaria[66]. c, d In CEU, we uncover multiple loci in genes involved in pigmentation, including rs1426654 in SLC24A5, which is involved in light skin color[24], and rs12913832 in the promoter region of OCA2, which is functionally linked with eye color and correlates with skin and hair pigmentation[26]. rs1426654 has the highest sweep probability reported by SWIF(r) in SLC24A5 (0.9992 after smoothed calibration; see Supplementary Data 1); note each panel depicts genomic windows containing multiple genes. e, f In CHB and JPT, SWIF(r) recovers a strong adaptive signal in the vicinity of EDAR; multiple GWA studies have shown rs3827760 to be associated with hair and tooth morphology[67,68]. SWIF(r) also identifies variants with high sweep probability in ADAM17, which is involved in pigmentation[69], and has been identified in other positive selection scans in East Asian individuals[70,71]
Fig. 3Genome-wide SWIF(r) scan for adaptation in ‡Khomani San SNP array data. a Empirical genome-wide univariate distributions of three of the component statistics, XP-EHH, ΔDAF, and iHS are shown in gray, as is the empirical joint distribution of ΔDAF and iHS (darker bins have more observations than lighter bins). The number of sites in each genome-wide univariate distribution differs due to some component statistics being undefined more often than others. In pink are the corresponding distributions for the 108 variants that SWIF(r) identifies as having posterior sweep probabilities of >50% (variants above the dashed pink line in b). The full set of distributions can be found in Supplementary Fig. 25. b The value plotted for each position along the genome is the calibrated posterior probability of adaptation computed by SWIF(r) (per-site prior for a selective sweep is π = 10−4 to detect signals of relatively old sweeps given the high long-term Ne of the ‡Khomani San); only SNPs with a calibrated posterior sweep probability >1% are plotted and the horizontal line indicates a probability cutoff of 50%. A strong signal of adaptation over the major histocompatibility complex on chromosome 6 is shown in black. Gene names are listed for genes previously associated with metabolism-related and obesity-related traits (colors match categories in c; open circles denote genes of interest that are not in any category in c). c We used gene set enrichment analysis tool Enrichr[35] to identify categories that had an overrepresentation of genes containing SWIF(r) signals (Supplementary Data 3). We found multiple enriched dbGaP categories related to metabolism and obesity, including adiponectin (a protein hormone that influences multiple metabolic processes, including glucose regulation and fatty acid oxidation), body mass index, and triglycerides. Genes in these categories containing SWIF(r) signals are listed next to category names. p values, q values, and the total number of genes are shown for each category, and categories are ranked by a combined score computed by Enrichr[35]. Adiponectin, body mass index, and γ-glutamyltransferase all have q values below 5%
Multiple published functional and association studies link genes identified by SWIF(r) to metabolism-related and obesity-related phenotypes
| Gene |
| DAF | Studies relating gene to metabolism-obesity phenotype | Refs. | |
|---|---|---|---|---|---|
|
| rs12121064 | 54% | 70% | Associated with waist–hip ratio in Europeans (rs1011731, GWA |
[ |
| Associated with waist circumference in Hispanic women (GWA replication |
[ | ||||
| Associated with weight loss after gastric bypass surgery |
[ | ||||
|
| rs11808388 | 62% | 99.2% | Regulates adipocyte differentiation by modulating the expression of adipogenesis-related genes |
[ |
| Candidate obesity-susceptibility gene based on epigenetic profile and association with BMI |
[ | ||||
| May be involved in increasing the potential for energy expenditure in brown adipocytes |
[ | ||||
| Mediates hepatic gluconeogenesis |
[ | ||||
| Contributes toward maintenance of hepatic glucose homeostasis |
[ | ||||
| Necessary for metabolic maturation of pancreatic |
[ | ||||
| Significantly up-regulated under treatment with cholesterol drug fenofibrate |
[ | ||||
|
| rs16866534 | 44% | 76% | Isoform composition in cardiac tissue is regulated by insulin signaling, possibly contributing to altered diastolic function in diabetic cardiomyopathy |
[ |
|
| rs3957559 | 49% | 76% | Variant rs3900940, along with four other variants, contributes to elevated risk for coronary heart disease |
[ |
|
| rs6444174 | 56% | 90% |
[ | |
| Associated with plasma adiponectin levels in Europeans (rs17366568, GWA |
[ | ||||
| Associated with plasma adiponectin levels in African Americans (rs4686807, GWA |
[ | ||||
| Associated with plasma adiponectin levels in East Asians (rs822391, GWA |
[ | ||||
| Associated with coronary heart disease, BMI, childhood obesity, metabolic syndrome, and type II diabetes |
[ | ||||
| Serum adiponectin levels are associated with metabolic health and cardiovascular risk |
[ | ||||
|
| rs4530695 | 64% | 75% | Used in molecular biology as a marker for white adipocytes |
[ |
| Controls pancreatic |
[ | ||||
| Plays a role in the link between obesity and inhibited placental development |
[ | ||||
|
| rs16934033 | 50% | 76% | Contributes to genetic variation of plasma triglyceride concentrations |
[ |
| marginally associated with childhood obesity in Hispanic individuals (GWA |
[ | ||||
|
| rs11605217 | 27% | 57% | Important regulator of insulin secretion |
[ |
| Mice without the gene are glucose intolerant and have decreased serum insulin |
[ | ||||
| Associated with triglyceride levels (rs1242229, GWA |
[ | ||||
|
| rs16929850 | 61% | 95.3% | Expression is significantly altered by fasting in mice |
[ |
| rs16929965 | 64% | 99.9% | |||
| rs2729646 | 68% | 98.4% | |||
| rs956627 | 64% | 99.9% | |||
|
| rs11637235 | 33% | 76% | Missense variant causes a syndrome characterized in part by early onset diabetes mellitus |
[ |
|
| rs12975240 | 62% | 76% | Associated with adiponectin levels in multiple populations (rs731839, rs4805885, rs8182584, rs889139, rs889140, GWA |
[ |
| Associated with type II diabetes (rs3786897, GWA |
[ | ||||
| Associated with fasting insulin levels (rs731839, GWA |
[ | ||||
| Associated with serum lipid levels |
[ | ||||
| Expression is modulated by n-3 fatty acids |
[ | ||||
|
| rs1182507 | 54% | 76% | Regulation in adipose tissue is BMI-dependent |
[ |
| Candidate obesity gene based on epigenetic profile |
[ |
The second column contains all variants within the genes listed that have posterior sweep probability ≥50% as calculated by SWIF(r). Column 3 shows the DAF in the ‡Khomani San at the SNP in column 2, and column 4 shows the calibrated posterior sweep probability calculated by SWIF(r) at that site. For GWA studies, GWA p values are given for the strongest SNP associations. Bold rsid indicates a result about the specific SNP identified by SWIF(r) in column 2. All genes highlighted in Fig. 3b are included in this table except MCC and MLIP, for which additional associations to metabolism-related and obesity-related phenotypes could not be found beyond the dbGaP categories in Fig. 3c
BMI, body mass index; DFA, derived allele frequency; GWA, genome-wide association; SNP, single-nucleotide polymorphism; SWIF(r), SWeep Inference Framework (controlling for correlation)
Fig. 4Missense mutation rs11316447 is a potential causal mutation in ADIPOQ. Worldwide distribution of rs11316447 generated by the Geography of Genetic Variants Browser[72] (http://popgen.uchicago.edu/ggv/) shows that the T allele carried by 27% of ‡Khomani San individuals (pie chart outlined in black) is extremely rare throughout phase 3 of the 1000 Genomes[63], at a maximum of 0.5% in the Luhya population of Kenya. The diagram of ADIPOQ highlights the positions of the variant identified by SWIF(r) (rs6444174) and the nearby missense variant (rs11316447). These two variants are within 1 kb of each other, suggesting that the SWIF(r) signal at rs6444174 is tagging this missense variant