Literature DB >> 29846678

Adaptive Landscape of Protein Variation in Human Exomes.

Ravi Patel^1,2, Laura B Scheinfeldt^1,2,3, Maxwell D Sanderford¹, Tamera R Lanham¹, Koichiro Tamura⁴, Alexander Platt^1,2,5, Benjamin S Glicksberg⁶, Ke Xu⁶, Joel T Dudley⁶, Sudhir Kumar^1,2,7.

Abstract

The human genome contains hundreds of thousands of missense mutations. However, only a handful of these variants are known to be adaptive, which implies that adaptation through protein sequence change is an extremely rare phenomenon in human evolution. Alternatively, existing methods may lack the power to pinpoint adaptive variation. We have developed and applied an Evolutionary Probability Approach (EPA) to discover candidate adaptive polymorphisms (CAPs) through the discordance between allelic evolutionary probabilities and their observed frequencies in human populations. EPA reveals thousands of missense CAPs, which suggest that a large number of previously optimal alleles experienced a reversal of fortune in the human lineage. We explored nonadaptive mechanisms to explain CAPs, including the effects of demography, mutation rate variability, and negative and positive selective pressures in modern humans. Many nonadaptive hypotheses were tested, but failed to explain the data, which suggests that a large proportion of CAP alleles have increased in frequency due to beneficial selection. This suggestion is supported by the fact that a vast majority of adaptive missense variants discovered previously in humans are CAPs, and hundreds of CAP alleles are protective in genotype-phenotype association data. Our integrated phylogenomic and population genetic EPA approach predicts the existence of thousands of nonneutral candidate variants in the human proteome. We expect this collection to be enriched in beneficial variation. The EPA approach can be applied to discover candidate adaptive variation in any protein, population, or species for which allele frequency data and reliable multispecies alignments are available.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2018 PMID： 29846678 PMCID： PMC6063297 DOI： 10.1093/molbev/msy107

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

Over half a million missense variants have been identified in human populations, of which a substantial number occurs at significant frequency (>1%; 33,369 missense variants) (1000 Genomes Project Consortium 2015). Although previous studies have shown the potential for ample adaptive coding variation in the human genome (Boyko et al. 2008; Enard et al. 2014), they have pinpointed only a few missense polymorphisms to be adaptive (Hernandez et al. 2011; Grossman et al. 2013) (table 1). It is possible that virtually all of the common human missense polymorphisms are either selectively neutral or deleterious (i.e., subject to purifying selection), but an alternative explanation is that existing methods lack sufficient power to locate adaptive coding variation. Furthermore, population genomic approaches to date are typically designed to identify recent selective pressures acting on candidate genes or genetic regions that vary within modern human populations, a segment of time that is only a minor fraction of the depth of the human lineage. We, therefore, have the opportunity to discover thousands of novel adaptive changes by using complementary approaches.

Table 1.

Known Adaptive Missense Polymorphisms and Their Candidate Adaptive Polymorphism (CAP) Status with Empirical Probability (Pneu).

Protein	SNP Identifier	CAP?	P value
ALMS1	rs10193972	Yes	<0.02
	rs2056486	Yes	<0.02
	rs3813227	Yes	<0.02
	rs6546837	Yes	<0.02
	rs6546838	Yes	<0.02
	rs6546839	Yes	<0.02
	rs6724782	Yes	<0.02
APOL1	rs73885319	No	n/a
DARC	rs12075	Yes	<0.02
EDAR	rs3827760	Yes	<0.03
G6PD	rs1050828	Marginal	n/a
G6PD	rs1050829	Yes	<0.03
HBB	rs334	Marginal	n/a
MC1R	rs1805007	No	n/a
	rs1805008	No	n/a
	rs885479	Yes	<0.03
SLC24A5	rs1426654	Yes	<0.02
SLC45A2	rs16891982	Yes	<0.02
TLR4	rs4986790	Yes	<0.04
TLR4	rs4986791	Marginal	n/a
TLR5	rs5744174	No	n/a
TRPV6	rs4987657	Yes	<0.01
	rs4987667	Yes	<0.01
	rs4987682	Yes	<0.01

Note.—A candidate adaptive polymorphism (CAP) is an amino acid polymorphism with the evolutionary probability (EP)< 0.05 and population allele frequency (AF)> 5%. n/a marks alleles for which at least one of these two conditions was not met. Supplementary table 1, Supplementary Material online, presents more details on each of these polymorphisms and the source references. Marginal status is given to alleles with EP< 0.05 and global AF> 2%.

Known Adaptive Missense Polymorphisms and Their Candidate Adaptive Polymorphism (CAP) Status with Empirical Probability (Pneu). Note.—A candidate adaptive polymorphism (CAP) is an amino acid polymorphism with the evolutionary probability (EP)< 0.05 and population allele frequency (AF)> 5%. n/a marks alleles for which at least one of these two conditions was not met. Supplementary table 1, Supplementary Material online, presents more details on each of these polymorphisms and the source references. Marginal status is given to alleles with EP< 0.05 and global AF> 2%. In this article, we integrate phylogenomics and population genomics to discover candidate adaptive polymorphisms (CAPs) in the human exome. This integrative approach advances beyond the current phylogenomic methods that compare patterns across species, but are blind to variation segregating within a given species (Goldman and Yang 1994; Muse and Gaut 1994; Yang and Bielawski 2000; Hurst 2002; Nielsen et al. 2005; Pollard et al. 2006; Anisimova and Yang 2007; Shapiro and Alm 2008; Lindblad-Toh et al. 2011; Peter et al. 2012). It is also distinct from the current population genomic methods that utilize patterns of population variation to identify candidate adaptive genes or genetic regions, but do not distinguish specific amino acid variants (Akey et al. 2002; Li and Stephan 2006; Teshima et al. 2006; Voight et al. 2006; Sabeti et al. 2007; Akey 2009; Grossman et al. 2013; Moon and Akey 2016). Together, evolutionary information from both short- and long-term time scales is harnessed in our approach. We applied this new approach to over 500,000 polymorphic missense alleles (1000 Genomes Project Consortium 2015) reported in human proteins, which revealed over 18,000 variants that exhibit nonneutral evolutionary patterns. We explored a wide variety of nonadaptive phenomena to explain the existence of these variants and investigated available genotype–phenotype association studies to determine if the nonneutral variants revealed by our new approach have had significant impact on human phenotypic variation. The result is a large catalog of polymorphisms that will be interesting to consider in future evolutionary and functional analysis of human genomes.

New Approaches

Our approach exploits the neutral theory framework, where variation arising from long-term molecular evolution among species informs a null model of observed within-species patterns of selectively neutral variation (i.e., no fitness effect) (Kimura 1983). This relationship is useful to identify adaptive proteins that deviate from neutral expectations and have undergone adaptive evolution (Hudson et al. 1987; McDonald and Kreitman 1991). In our novel allelic approach, we first capture long-term evolutionary history with estimates of the neutral evolutionary probability (EP) of observing each of the possible 20 segregating amino acid residues (alleles) at a given amino acid position. EP is computed using a Bayesian framework and a multispecies alignment; it is an average of posterior probabilities weighted by the divergence time of each of the species relative to humans in the species timetree used (Liu et al. 2016). The sum of all allelic EPs is 1.0 for each amino acid position. Importantly, EP for an amino acid allele at a given protein position is not affected by the presence of a consensus base at that position in the human reference genome or by the corresponding alleles that segregate in humans, because this information is excluded from the multispecies alignment when EP is calculated (Liu et al. 2016). EP of an allele at a given position is, therefore, completely independent of intraspecific variation. Under neutral theory, alleles with low EP (<0.05) are not expected to persist within populations and are, therefore, predicted to impact function and fitness (Liu et al. 2016). Indeed, <1% of simulated neutral EPs fall <0.05 in computer simulations, where we used the 46 species time tree in figure 1, branch lengths from UCSC (Murphy et al. 2001; Kent et al. 2002; Siepel and Haussler 2005; Liu et al. 2016), and pyvolve (Spielman and Wilke 2015) to simulate amino acid sequences (see Materials and Methods).

. 1.

Evolutionary Probability Approach. The evolutionary probabilities (EPs) and their application to discover candidate adaptive polymorphisms (CAPs). (a) Timetree of 46 vertebrates (Hedges et al. 2015), which was used along with alignments of orthologous amino acid sequences for all human proteins (Kent et al. 2002) to compute the probability of observing each amino acid residue at a given position. Under neutral theory, we expect a strong relationship between EP and allele frequency (AF) such that evolutionarily unexpected alleles (EP < 0.05) will be rare. (b) Relationship between EP and AF. Average EP (y axis) was calculated for 0.05 sized AF bins (x axis) for all polymorphic missense alleles in the 1000 Genomes Project Phase 3 whole genome sequencing data, which confirms the general relationship between EP and AF to be consistent with neutral expectations. The standard deviation is visualized with gray lines (averages are in blue), which is expected to be large because contemporary AFs are a product of time of origin, natural selection, and genetic drift experienced by a mutation. (c) Distribution of empirical P values (–log10) generated from the empirical framework (AF | EP < 0.05). The cutoff used to identify CAPs is shown with a dashed red line and is more extreme than a false positive rate of 0.05. Therefore, EP can serve as a null expectation that predicts the neutral probability of observed within-species variation. Contrasting the former against the latter produces a direct neutrality comparison, for example, nonneutral alleles with low EP (<0.05) are expected to correspond to missense mutations that are found at low allele frequencies (AFs) due to purifying selection (Liu et al. 2016). Consistent with this expectation, 91% of disease-associated missense variants in HumVar (Adzhubei et al. 2010) have low EP (<0.05) and low AF (<1%). More generally, EP shows agreement with observed global AFs calculated from the 1000 Genomes data (fig. 1R2 = 0.83, P < 10−15). We used the above considerations to build an Evolutionary Probability Approach (EPA) to identify nonneutral (EP < 0.05) alleles that occur with unexpectedly high population AF. When applied to protein sequence variation, such alleles are predicted to impact protein function, and their prevalence may be due to adaptive pressures. Therefore, we refer to them as CAPs. An observed allele is designated a CAP, if it has an EP < 0.05 and AF > 5%. These thresholds were chosen because the empirical probability of observing a CAP for neutral alleles, Pneu, falls below 0.05 for 1000 Genomes Project data (fig. 1), which represents a significant departure from selective neutrality and forms the basis of EPA. EPA is analogous to empirical outlier approaches frequently utilized in population genomics, including those that identify CAPs with metrics such as FST or Tajima’s D (Lewontin and Krakauer 1973; Tajima 1989). A critical difference is that we use information from both phylogenomics (EP) and population genetics (AF) to identify CAPs, which makes EPA a two-dimensional approach and complementary to available methods.

Results and Discussion

We applied EPA to 515,700 polymorphic missense alleles (1000 Genomes Project Consortium 2015) reported in human proteins. We retrieved EPs for each allele from http://www.mypeg.info; last accessed November 10, 2015 (Kumar et al. 2012; Liu and Kumar 2013). The EPs were calculated by Liu et al. (2016) using a 46 species alignment of orthologous amino acid sequences (Kent et al. 2002; Liu et al. 2016). The timetree of these species covers a very large evolutionary timespan (∼5.8 billion years; fig. 1), such that each amino acid position has had ample time to experience mutation and purifying selection. EPA revealed 18,724 CAPs (EP < 0.05) whose allele frequencies showed significant departure from neutrality (Pneu < 0.05). These CAPs were found in 7,815 proteins (see www.mypeg.info/caps for a list of residues) distributed across all autosomal chromosomes (fig. 2). Many proteins harbor multiple CAPs (fig. 3) and have a large number of CAPs per amino acid (fig. 3), but protein length was not strongly correlated with the number of CAPs (correlation coefficient = 0.32). More than 20 CAPs were found in MUC4 and multiple HLA genes (fig. 2), which were among the top 0.2% of the CAP rich proteins (fig. 3). Both of these gene families play a role in immune response (Parham 2005; Pelaseyed et al. 2014) and are implicated in human adaptations (Andres et al. 2009; Vahdati and Wagner 2016). Moreover, as expected, the functional categories of “antigen processing and presentation” and “sensory perception” were among the most enriched in terms of the number of CAPs per amino acid coding position (fig. 4).

. 2.

. 3.

Properties of candidate adaptive alleles. Distribution of all (red bars) and phenotype-associated (pink bars) (a) CAP counts across proteins, and (b) number of CAPs found per amino acid position in each protein coding gene.

. 4.

Functional distribution of CAPs. The top 75 GO-slim biological process categories with the most CAPs per amino acid position (red bars). The number of proteins found in each biological process annotation is in parentheses. The number of CAPs found in each biological process is listed next to the corresponding bar. Additional information for all PANTHER GO-slim biological process categories can be found as a Supplementary Material.

Chromosomal distribution of CAPs. (a) The distribution of candidate adaptive alleles (CAPs) across autosomal chromosomes (red points). Chromosomal banding patterns are also visualized for reference. (b) A plot of −log10(Pneu) generated from the Evolutionary Probability Approach (y axis) against chromosome position (x axis) for the MHC region of chromosome 6. CAPs are shaded red and non-CAPs are shaded gray. The CAP Pneu cutoff is shown with a dashed red line. Notable HLA genes with >20 CAPs are indicated. Properties of candidate adaptive alleles. Distribution of all (red bars) and phenotype-associated (pink bars) (a) CAP counts across proteins, and (b) number of CAPs found per amino acid position in each protein coding gene. Functional distribution of CAPs. The top 75 GO-slim biological process categories with the most CAPs per amino acid position (red bars). The number of proteins found in each biological process annotation is in parentheses. The number of CAPs found in each biological process is listed next to the corresponding bar. Additional information for all PANTHER GO-slim biological process categories can be found as a Supplementary Material. Furthermore, a vast majority (> 70%) of known adaptive amino acid polymorphisms were found to be CAPs (table 1 and supplementary table 1, Supplementary Material online), which is a significant enrichment (permutation P < 107). EPA also discovers a majority of the protein polymorphisms predicted to be adaptive in previous population genomic analyses (supplementary table 2, Supplementary Material online), which suggests that the CAP catalog contains many truly adaptive alleles. Still, the size of the CAP catalog is over 200 times larger than the number of previously identified adaptive polymorphisms. One potential explanation for the high frequency of low EP alleles is that they are mildly deleterious or selectively neutral alleles whose frequencies are primarily driven by genetic drift (Kryukov et al. 2007; Zhu et al. 2011). Another possibility is that their high-allele frequencies reflect some combination of drift, compensatory variation, and epistasis. In addition, several nonadaptive phenomena could artificially inflate neutral or deleterious missense allele frequencies. We, therefore, examined the extent to which genomic features and demographic processes could have given rise to CAPs.

Mutation Rate Differences and Biased Gene Conversion

Given that mutation rates are known to affect allele frequencies (Harpak et al. 2016), we investigated the potential for mutation rate variation to result in false positive CAPs. We first examined if mutation rates were elevated in codons containing CAPs by comparing the rate of occurrence of synonymous variants in codons that contained CAPs with codons that did not contain CAPs. These two rates were very similar, as 5.7% of the CAP-containing codons also harbored a synonymous polymorphism and 5.4% of non-CAP codons harbored a synonymous polymorphism. This result suggests that mutation rate differences do not explain the observed distribution of CAP allele frequencies. In addition, the hypermutability of CpG sites did not explain the persistence of low EP alleles at high frequency due to recurrent mutations. We found a smaller proportion of CpG overlapping CAPs relative to non-CAPs (26% and 33%, respectively). Furthermore, we considered whether biased gene conversion could result in false positive CAPs (Ratnakumar et al. 2010). However, fewer than 1% of CAPs were within regions of known biased gene conversion (Capra et al. 2013; Rosenbloom et al. 2015), and the frequencies of weak to strong (W→S) and strong to weak (S→W) changes (Lachance and Tishkoff 2014) for non-CAP alleles (with EP < 0.05 and AF < 5%) were not significantly different from CAP alleles (P = 0.90). In addition, the relative frequencies of different base changes were similar between CAP and non-CAP codons (R2 > 0.99, regression slope > 0.95), which showed that CAP-containing codons show nucleotide substitution patterns similar to non-CAP codons at each position in the codon.

Relaxation of Purifying Selection

We also examined the possibility that CAP-containing human proteins have experienced relaxation of function in the human lineage. Although we think this is unlikely, because it would require a vast fraction of human proteins (> 7,000 out of 22,000) to be under reduced selection, we investigated missense mutations that cause Mendelian diseases and compared the frequency of these mutations in CAP-containing proteins and non-CAP proteins (see Materials and Methods). We did not find a significant difference in the preponderance of disease mutations in CAP and non-CAP proteins. Therefore, it is unlikely that CAP-containing proteins have become less functionally important relative to other human proteins.

Adaptive Hitchhiking

Deleterious alleles located in genomic regions, which have undergone selective sweeps, can hitchhike to higher than expected frequencies merely due to proximity to and linkage disequilibrium with nearby adaptive alleles (Chun and Fay 2011). Only a small number of CAPs (6.7%) are located in selective sweep regions (Schrider and Kern 2017). This observation is supported by previous studies (Chun and Fay 2011) that investigated the impact of hitchhiking on deleterious allele frequencies and found only a few hundred deleterious hitchhiking nonsynonymous SNPs with common allele frequencies (≥ 5.9%) in the 1000 Genomes Project data. Therefore, hitchhiking of deleterious alleles with selective sweeps does not appear to explain an overwhelming majority of CAPs.

Human Demography

Human demographic history may explain the prevalence of CAPs, because the migration of modern humans out of Africa and subsequent population expansions could have resulted in higher than expected frequencies of deleterious and mildly deleterious alleles. However, it is not likely that these alleles overwhelm the set of CAPs identified, since even a purely neutral model of human evolution does not explain the fraction of alleles found at high-allele frequencies: the SFS of empirical CAPs shows a dramatic skew toward high-frequency alleles relative to neutral expectation (fig. 5). We then tested if the CAPs SFS can be generated by human demographic history in combination with various models of selection. We employed a model based on differential equations to approximate the evolution of allele frequencies (Jouganous et al. 2017) and simulated a wide range of negative and positive selection coefficients for a demographic model of recent human history (Gravel et al. 2011) with a range of gamma parameter values (see Materials and Methods). A model containing negative and positive selections provided the best fit for the CAPs SFS (lnL = −3,080; P ≪ 10−10; fig. 5). In this model, 47% of the observed alleles were predicted to be weakly deleterious (s = −8 × 10−4) and the remaining 53% were beneficial (s = +1 × 10−3).

. 5.

Selection model fits to observed CAPs. Site frequency spectra (SFS) for SNPs with AF > 5%. Site frequency spectra (SFS) were scaled to have the same number of sites for AF > 5%. Black bars represent all EP < 0.05 alleles observed in 1000G Phase 3 individuals. (a) Observed and fitted SFS for all candidate adaptive polymorphisms (CAPs). A neutral model (blue) does not explain the preponderance of alleles found at very high AF, and does not fit the observed data well (lnL= −4,124). (b) Observed and fitted SFS for all CAPs. A model with weakly deleterious (purple) and beneficial (green) showed the best fit (lnL = −3,080). It was significantly better than any other combination of models (LRT P ≪ 10−10). All CAP alleles shared with great apes (5%) were excluded from observed SFS. However, even the best-fit simulated selection model failed to explain the preponderance of polymorphisms with very high frequency (>95%). The number of empirical CAPs in this category was over three times greater than expected (fig. 5). In order to determine whether the highest frequency class (AF > 95%) was driving the signal of positive selection during our model fitting, we conducted SFS analysis by removing all CAPs in the 95–99.99% frequency class (observed as well as simulated). For this comparison, the signal of positive selection persisted, as the best fit model was the one with s = +2 × 10−2, which was significantly better than a purely neutral model (P ≪ 10−10). This result further indicated that positive selection would need to be invoked to explain the observed distribution of CAPs. The above results led us to consider whether CAPs represent ancestral standing variation, for example, found in the ancestors of modern humans. We examined the proportion of CAPs that were shared with archaic hominins (Neanderthals and Denisovans) (Green et al. 2010; Meyer et al. 2012; Prufer et al. 2014) and found that a large percentage of CAPs (43%) are shared with modern humans. This proportion is significantly higher than what is expected by chance (permutation P < 10−7). Although some of the shared CAPs could have resulted from archaic gene flow, the majority of these CAPs were likely present in the last common ancestor of modern humans and archaic hominids, because most (93.6%) shared CAPs occur at very high frequencies (AF > 95%) in modern humans. One such possibility is a CAP (rs4987682) in TRPV6, which is present in the Altai Neanderthal genome (Prufer et al. 2014). TRPV6 is involved in calcium absorption (Hughes et al. 2008) and located in a region of the genome that has been identified in several previous genome-wide scans for selection (Akey et al. 2006; Hughes et al. 2008). This region is hypothesized to have been subjected to multiple selective events (Hughes et al. 2008). The EPA approach is able to detect these more ancient adaptive variants by integrating inter- and intraspecies information. This is in contrast to common methods that are primarily powered to detect recent adaptive events, as evident in the list of known human adaptive variants, which consists of relatively recent polymorphisms.

Validating CAPs

Generally, traditional functional evaluation of CAPs that arose in the human lineage is challenging, because in vitro and in vivo approaches are low-throughput, require a priori functional information for experimental design, and do not provide the impact of individual alleles on higher level human phenotypes. Furthermore, it is not possible to test human fitness in a controlled/laboratory setting, and it is often not relevant to test the functional impact of CAPs in nonhuman model systems. It is, however, possible to take an organismal approach to investigate allelic impact on natural, population-level human variation using phenotype-association studies. For example, many well-known adaptive missense variants (table 1) are also significantly associated with phenotypes in genome-wide studies: rs334 with malaria and severe malaria (Timmann et al. 2012; Band et al. 2013), rs4987667 with intermediate gene expression phenotypes involving HLA (Fehrmann et al. 2011), and rs1426654 with skin pigmentation (Stokowski et al. 2007). Therefore, we searched the Human Gene Mutation Database (HGMD) (Stenson et al. 2009) for high EP alleles associated with reduced fitness, that is, the low EP CAP alleles associated with fitness benefits. That is, the evolutionarily preferred allele prior to the divergence of humans and chimpanzees (high EP, EP > 0.5) has experienced a reversal of fortune and become detrimental. We found 253 high EP alleles to be associated with disease phenotypes in contemporary humans, where the low EP CAP allele occurs with AF > 5%. We also scanned the NHGRI-EBI catalog (MacArthur et al. 2017) of curated GWAS studies to identify additional associations and found 158 CAPs. Of these, 101 showed odds ratio (OR) <1 for at least one discrete trait related to reduction in the incidence of the associated abnormal phenotype. That is, 60% of the CAPs are protective against increased disease risk (supplementary table 3, Supplementary Material online). One such example is a CAP found in the LOXL1 protein that confers a 20-fold decrease in risk for developing exfoliation glaucoma, a leading cause of irreversible blindness (Thorleifsson et al. 2007). Another example is a CAP found in APOE genotypes ɛ2 and ɛ3. The APOE genotype ɛ4, which does not contain the CAP allele, is known to confer 5-fold higher risk of cerebral amyloid deposition as compared with the CAP-containing genotypes ɛ2 and ɛ3 (Li et al. 2015). In these cases, the CAP allele is protective, and the non-CAP allele is associated with a detrimental phenotype. These findings not only suggest functional implications of CAPs but also that some CAPs may be associated with health benefits. Beyond the limited number of variants in the NHGRI-EBI GWAS catalog, we investigated phenotypic associations in GWAS database that contains a large catalog of genotype–phenotype association studies. We mined data available from GRASP2 (Leslie et al. 2014) to determine whether CAPs have had significant impact on human phenotypes more broadly. We found that 11% of CAPs were significantly associated with tested phenotypes (2,073 alleles at a significance threshold of P < 10−8), which we refer to as pheno-CAPs. This prevalence of pheno-CAPs is significantly higher than what is expected by chance (permutation P < 10−7). Moreover, <1% of frequency matched non-CAP alleles are significantly phenotype-associated in GRASP2 (P < 10−8). We tested the possibility that low-EP deleterious recessive alleles have persisted at significant population frequencies. If this had been the case, we would expect an excess of heterozygote CAPs relative to neutral expectations. However, very few CAPs (2.5%) displayed a significant excess of heterozygosity (χ2P value < 0.05). Moreover, after excluding pheno-CAPs that are not shared across all 1000 Genomes continental samples (1000 Genomes Project Consortium 2015), that are located in previously identified selective sweeps (Schrider and Kern 2017) and that are located in previously identified regions containing CpG sites and biased gene conversion regions (Rosenbloom et al. 2015), over 1000 proteins contain one or more pheno-CAPs. We expect pheno-CAPs to be enriched for causal alleles. There are many reasons for this expectation. First, amino acid polymorphisms alter the sequence of functional genome entities (proteins). Second, if pheno-CAPs are causal alleles then we would expect them to show the strongest association P values among all tested missense variants. This is indeed the case for 92% of CAP proteins, where a pheno-CAP has the strongest association of all missense variants in that protein for a given phenotype in the GRASP2 database (Leslie et al. 2014). Third, a vast majority of putative adaptive variants in humans are CAPs (table 1) and are derived variants in modern humans; they are not shared with archaic hominins. In conclusion, we have found over 18,000 missense human polymorphisms that are candidates of beneficial selection. This new adaptive allele catalog is made possible by the EP approach, which is sensitive to a timeframe that predates the out of Africa migration of modern humans, but is not limited to fixed differences between species (Goldman and Yang 1994; Muse and Gaut 1994; Yang and Bielawski 2000; Hurst 2002; Nielsen et al. 2005; Pollard et al. 2006; Anisimova and Yang 2007; Holt et al. 2008; Shapiro and Alm 2008; Lindblad-Toh et al. 2011; Peter et al. 2012). The former timeframe has been addressed by methods that are sensitive to recent classic sweeps and regionally restricted adaptation, which have been the focus of the majority of human adaptation studies to date (Akey et al. 2002; Li and Stephan 2006; Teshima et al. 2006; Voight et al. 2006; Sabeti et al. 2007; Akey 2009; Grossman et al. 2013; Moon and Akey 2016). These studies have yielded only a few adaptive coding variants, leading some to argue that regulatory variation is the predominant raw material for adaptive change (Akey 2009; Fraser 2013; Grossman et al. 2013). Our results suggest that the temporal sensitivity of the EP approach is able to generate a catalog of CAPs that is enriched in functional as well as beneficial variation. We expect many CAPs to be involved in compensatory evolution and synergistic epistasis to counter genetic load exerted by deleterious variants that have risen to high frequencies due to human demography and genetic drift. Therefore, CAPs provide ready hypotheses to test in future computational and experimental investigations.

Materials and Methods

1000 Genomes Allele Frequencies

Global allele frequencies (AFs) for all missense single nucleotide polymorphisms (SNPs) (n = 515,700) in the 1000 Genomes Project phase 3 data (1000 Genomes Project Consortium 2015) were calculated for all unrelated individuals (n = 2,405). More specifically, one of each related pair of individuals identified in the Phase 3 release (ftp://ftp.1000genomes.ebi.ac.uk/Vol03508/ftp/release/20130502/20140625_related_individuals.txt) was removed before calculating global allele frequencies. For each polymorphic nucleotide position, EP estimates for the codons corresponding to the reference (hg19) and nonreference nucleotides were used. For each allele, we tested for an overrepresentation of potentially deleterious recessive CAP heterozygotes and evaluated the proportion of CAPs that were in Hardy–Weinberg (HW) disequilibrium (HW χ2P value < 0.05). We note that variants found in genes with duplicated homologs in the genome (e.g., multigene families) are expected to be mapped to the correct genomic location as the Phase 3 1000 Genomes Project data set only included unambiguously mapped reads during variant calling (1000 Genomes Project Consortium 2010).

Evolutionary Probabilities

Evolutionary probabilities (EPs) were calculated for each amino acid residue using the method of Liu et al. (2016) and a 46 species alignment of orthologous amino acid sequences (Kent et al. 2002; Liu et al. 2016). They are available from http://www.mypeg.info (Kumar et al. 2012; Liu and Kumar 2013). The timetree (Hedges et al. 2006) of these species covers a very large evolutionary timespan (∼5.8 billion years; Hedges et al. 2015; fig. 1), such that each amino acid position has had ample time to experience mutation and purifying selection. We designed a simulation to verify that the EP was over 0.05 for neutral alleles, by using the 46 species time tree in figure 1 and branch lengths from UCSC (Murphy et al. 2001; Kent et al. 2002; Siepel and Haussler 2005; Liu et al. 2016). Using pyvolve v0.8.7 (Spielman and Wilke 2015), we generated 1000 replicate data sets of proteins with 500 amino acid positions and calculated EP for alleles at each site.

Evolutionary Probability Approach Framework

We began with the premise that for a given amino acid position, the probability the position has been neutral (EP) over long-term evolutionary history (inferred from interspecies comparisons as described in Liu et al. 2016) combined with the orthogonal shorter term intraspecific purifying and directional selective pressures (captured by population allele frequency, AF) produces a categorical framework for genome-wide variation. This framework distinguishes neutral, potentially deleterious, and potentially adaptive variation. The sum of all allelic EPs is 1 at each amino acid position, and residues with low EP (< 0.05) are unexpected under neutral theory (Liu et al. 2016). We developed an empirical framework to identify CAPs: Prob(AF | EP < 0.05), and for each allele, calculated a one-sided cumulative empirical P value using a cumulative distribution function (CDF) implemented with a custom R script (R Core Team 2014).

Misinference of Ancestral State

In genomic scans for selection, misidentification of ancestral states may cause false signatures of selection (Baudry and Depaulis 2003). EPA fortunately does not suffer from this problem, because it requires EP < 0.05. An allele with such a low EP will likely arise in the human lineage after their divergence from chimpanzees. Additionally, EP calculation utilizes a probabilistic model that integrates over all the outgroup species in an alignment, which makes it better than methods that utilize one or a few outgroups to properly identify the derived allele (Hernandez et al. 2007; Keightley et al. 2016). Consistent with this property, we did not find any CAP alleles in all three of the Great Ape species (chimpanzee, gorilla, and orangutan) in our multispecies protein alignments. A comparison with chimpanzee proteins revealed 3.5% CAP allele sharing, and gorilla and orangutan showed 0.7% and 1.1% CAP allele sharing, respectively, with humans. We excluded all of these alleles from all the population genetic analyses, because these CAP residues may have arisen prior to the origin of human lineage.

Identifying Allele Sharing with Archaic Genomes

To determine allele sharing among modern humans and archaic hominins, we collected genome sequencing data for five archaic hominins (four Neanderthal individuals, and one Denisovan individual). One Neanderthal sequence and one Denisovan sequence were acquired from the Max Planck Institute for Evolutionary Anthropology site (http://cdna.eva.mpg.de/neandertal/altai/Denisovan; last accessed November 10, 2015). The three remaining Neanderthal alignments were retrieved from the UCSC Neanderthal Sequence Track (https://genome.ucsc.edu/cgi-bin/hgTrackUi? db=hg19&g=ntSeqReads; last accessed November 10, 2015). We only used sequences that provided > 45% genomic coverage. We defined an allele as shared if it was present in any of these five archaic individuals. A shared allele can be polymorphic or fixed in this aggregated archaic sample.

Scanning Genotype–Phenotype Association Catalogs

We scanned 75,810 phenotype associated missense mutations in the Human Gene Mutation Database (HGMD) (Stenson et al. 2009) for those that occur at CAP sites. We found 973 such mutations, which we checked for high-EP risk alleles (causing the abnormal phenotype). A high-EP risk allele at a CAP site was considered a “reversal,” since this previously favored allele (based on EP) leads to an unfavorable phenotype. We also scanned the NHGRI-EBI GWAS catalog (MacArthur et al. 2017) (January 16, 2018 update) for similar reversals. Filtering the SNPs, we find 158 missense mutations at CAP sites. The NHGRI-EBI GWAS Catalog always reports the risk-allele (the allele that increases phenotypic measurement, e.g., increases disease risk). In order to determine the odds ratio (OR) for the CAP allele, which is often not the reported risk allele, we calculated the inverse (1/reported OR) when the risk allele was in fact the reversal (high EP allele). An OR < 1 indicates that the allele confers a decrease in abnormal phenotype risk, whereas an OR > 1 indicates that the allele increases risk for the associated abnormal or case phenotype. Multiple associations were occasionally found for CAPs in the GWAS catalog. We simply reported the study that had the lowest risk factor (OR) for abnormal phenotypes per CAP allele found.

Gene Ontology Analysis

We downloaded the PANTHER sequence classifications (Mi et al. 2017) containing GO biological processes for all genes in the database for humans (ftp://ftp.pantherdb.org/sequence_classifications/13.1/PANTHER_Sequence_Classification_files/PTHR13.1_human; last accessed April 10, 2018). For each biological process category, we counted the number of total protein sequence length (amino acid positions) and the number of CAPs in all the proteins in the category.

Demographic Simulations

We performed 10,000 forward simulations of human history for 58,000 generations before current time; the simulation scheme includes the out-of-Africa migration of humans (OoA), as well as a subsequent split between simulated European and East Asian populations. The population model includes three representative continental groups (African, European, East Asian). SLiM2 (Haller and Messer 2017) was used for the simulations, with parameters obtained by Gravel et al. (2011). Using a modified SliM2 script to output MS (Hudson) format chromosomes, we sampled individual sequences (50,000 base pairs in length) from the simulated populations at each of the following time points: 1) the generation immediately before the OoA split (ancestral population), 2) the generation immediately before the European and East Asian split, 3) the contemporary African population, 4) the contemporary European population, and 5) the contemporary East Asian population. Using allele frequencies (AF) from these samples, we followed variants at different AF (0.1%, 1%, and 10%) in the ancestral population and traced their trajectories into the modern day human populations (contemporary populations). For each of these variants, we determined the fraction that achieved > 5% AF (required for CAP status), and were shared among one, two, and three of the contemporary population samples.

Simulating Selection and Fitting Distributions of Fitness Effects

We simulated site frequency spectra (SFS) using Moments (Jouganous et al. 2017) to infer distributions of fitness effects (DFE) that explain CAPs for which the human alleles were not shared with any of the three great ape species (chimpanzee, gorilla, and orangutan). Using dadi (Gutenkunst et al. 2009), we calculated multinomial log-likelihoods (lnLs) of the observed data (CAPs) for simulated deleterious, neutral, and beneficial selection models (as above). We also calculated lnL of DFE fit for all possible combinations: deleterious and neutral; neutral and positive; deleterious and beneficial; and, deleterious and, neutral, and beneficial. In this case, we used a single point mass fixed for each type of selection and explored various 2Nes values. The model with the highest lnL provides the best fit for the observed data. We excluded all CAPs shared with great apes in these analyses. The best fit model and lnL values for all the CAPs are shown in figure 5. We used likelihood fits and Akaike information criterion (AIC) to select the best model.

Examination of the Relaxation of Purifying Selection

We examined the possibility that CAP-containing human proteins have experienced relaxation of function in the human lineage. We investigated missense mutations that cause Mendelian diseases and compared the frequency of these mutations in CAP-containing proteins and non-CAP proteins. This analysis used the HumVar (Adzhubei et al. 2010) data set and obtained the number of disease mutations normalized by the total sequence length and evolutionary rate of CAP and non-CAP proteins. This normalization is required because longer proteins are known to contain more disease mutations as do slower evolving proteins (Miller and Kumar 2001). The ratio of two normalized counts was 0.98, which is close to the expected value of 1.0 corresponding to no difference in the preponderance of disease mutations in CAP and non-CAP proteins.

Permutation Testing

In order to determine whether the observed proportion of CAPs that have been previously identified as adaptive in humans is higher than would be expected by chance, we randomly sampled 18,724 variants from the set of all human missense variants (regardless of EP), and calculated Nsim, which captures how often the simulated proportion of phenotype-associated variants was as high or higher than the empirical result. In total, we ran 106 permutations, and calculated a permutation P value with the following equation: (Nsim + 1)/1000001. Similarly, we tested whether the observed proportion of CAPs that are shared with archaic genomes is higher than would be expected by chance. We randomly sampled 18,724 variants from the set of all human missense variants, and calculated Nsim, which captures how often the simulated proportion of archaic-shared variants was as high or higher than the empirical result (6,916 for P < 0.05 and 2,075 for P < 10−8). In total, we ran 106 permutations, and calculated a permutation P value with the following equation: (Nsim + 1)/1000001. In order to determine whether the observed proportion of CAPs that are also associated with phenotypes in the GRASP2 database (Leslie et al. 2014) is higher than would be expected by chance, we randomly sampled 18,724 variants from the set of all human missense variants with an AF > 1% (regardless of EP), and calculated Nsim, which captures how often the simulated proportion of phenotype-associated variants was as high or higher than the empirical result (6,916 for P < 0.05 and 2,075 for P < 10−8). In total, we ran 106 permutations, and calculated a permutation P value with the following equation: (Nsim + 1)/1000001.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

74 in total

1. TRPV6 exhibits unusual patterns of polymorphism and divergence in worldwide populations.

Authors: Joshua M Akey; Willie J Swanson; Jennifer Madeoy; Michael Eberle; Mark D Shriver
Journal: Hum Mol Genet Date: 2006-05-22 Impact factor: 6.150

2. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

3. Evolutionary diagnosis method for variants in personal exomes.

Authors: Sudhir Kumar; Maxwell Sanderford; Vanessa E Gray; Jieping Ye; Li Liu
Journal: Nat Methods Date: 2012-09 Impact factor: 28.547

4. High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi.

Authors: Kathryn E Holt; Julian Parkhill; Camila J Mazzoni; Philippe Roumagnac; François-Xavier Weill; Ian Goodhead; Richard Rance; Stephen Baker; Duncan J Maskell; John Wain; Christiane Dolecek; Mark Achtman; Gordon Dougan
Journal: Nat Genet Date: 2008-07-27 Impact factor: 38.330

5. Evidence for hitchhiking of deleterious mutations within the human genome.

Authors: Sung Chun; Justin C Fay
Journal: PLoS Genet Date: 2011-08-25 Impact factor: 5.917

6. A high-resolution map of human evolutionary constraint using 29 mammals.

Authors: Kerstin Lindblad-Toh; Manuel Garber; Or Zuk; Michael F Lin; Brian J Parker; Stefan Washietl; Pouya Kheradpour; Jason Ernst; Gregory Jordan; Evan Mauceli; Lucas D Ward; Craig B Lowe; Alisha K Holloway; Michele Clamp; Sante Gnerre; Jessica Alföldi; Kathryn Beal; Jean Chang; Hiram Clawson; James Cuff; Federica Di Palma; Stephen Fitzgerald; Paul Flicek; Mitchell Guttman; Melissa J Hubisz; David B Jaffe; Irwin Jungreis; W James Kent; Dennis Kostka; Marcia Lara; Andre L Martins; Tim Massingham; Ida Moltke; Brian J Raney; Matthew D Rasmussen; Jim Robinson; Alexander Stark; Albert J Vilella; Jiayu Wen; Xiaohui Xie; Michael C Zody; Jen Baldwin; Toby Bloom; Chee Whye Chin; Dave Heiman; Robert Nicol; Chad Nusbaum; Sarah Young; Jane Wilkinson; Kim C Worley; Christie L Kovar; Donna M Muzny; Richard A Gibbs; Andrew Cree; Huyen H Dihn; Gerald Fowler; Shalili Jhangiani; Vandita Joshi; Sandra Lee; Lora R Lewis; Lynne V Nazareth; Geoffrey Okwuonu; Jireh Santibanez; Wesley C Warren; Elaine R Mardis; George M Weinstock; Richard K Wilson; Kim Delehaunty; David Dooling; Catrina Fronik; Lucinda Fulton; Bob Fulton; Tina Graves; Patrick Minx; Erica Sodergren; Ewan Birney; Elliott H Margulies; Javier Herrero; Eric D Green; David Haussler; Adam Siepel; Nick Goldman; Katherine S Pollard; Jakob S Pedersen; Eric S Lander; Manolis Kellis
Journal: Nature Date: 2011-10-12 Impact factor: 49.962

7. The complete genome sequence of a Neanderthal from the Altai Mountains.

Authors: Kay Prüfer; Fernando Racimo; Nick Patterson; Flora Jay; Sriram Sankararaman; Susanna Sawyer; Anja Heinze; Gabriel Renaud; Peter H Sudmant; Cesare de Filippo; Heng Li; Swapan Mallick; Michael Dannemann; Qiaomei Fu; Martin Kircher; Martin Kuhlwilm; Michael Lachmann; Matthias Meyer; Matthias Ongyerth; Michael Siebauer; Christoph Theunert; Arti Tandon; Priya Moorjani; Joseph Pickrell; James C Mullikin; Samuel H Vohr; Richard E Green; Ines Hellmann; Philip L F Johnson; Hélène Blanche; Howard Cann; Jacob O Kitzman; Jay Shendure; Evan E Eichler; Ed S Lein; Trygve E Bakken; Liubov V Golovanova; Vladimir B Doronichev; Michael V Shunkov; Anatoli P Derevianko; Bence Viola; Montgomery Slatkin; David Reich; Janet Kelso; Svante Pääbo
Journal: Nature Date: 2013-12-18 Impact factor: 49.962