Michelle P Harwood1,2, Isabel Alves3, Hilary Edgington4, Mawusse Agbessi1, Vanessa Bruat1, David Soave1,5, Fabien C Lamaze1,6, Marie-Julie Favé1, Philip Awadalla1,2,7. 1. Ontario Institute for Cancer Research, Toronto, ON, Canada. 2. Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada. 3. Université de Nantes, CHU Nantes, CNRS, INSERM, L'Institut du thorax, F-44000 Nantes, France. 4. Department of Biology, College of Wooster, Wooster, OH, USA. 5. Department of Mathematics, Wilfrid Laurier University, Waterloo, ON, Canada. 6. Institut universitaire de cardiologie et de pneumologie de Québec, Université Laval, Québec, QC, Canada. 7. Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.
Abstract
How the genetic composition of a population changes through stochastic processes, such as genetic drift, in combination with deterministic processes, such as selection, is critical to understanding how phenotypes vary in space and time. Here, we show how evolutionary forces affecting selection, including recombination and effective population size, drive genomic patterns of allele-specific expression (ASE). Integrating tissue-specific genotypic and transcriptomic data from 1500 individuals from two different cohorts, we demonstrate that ASE is less often observed in regions of low recombination, and loci in high or normal recombination regions are more efficient at using ASE to underexpress harmful mutations. By tracking genetic ancestry, we discriminate between ASE variability due to past demographic effects, including subsequent bottlenecks, versus local environment. We observe that ASE is not randomly distributed along the genome and that population parameters influencing the efficacy of natural selection alter ASE levels genome wide.
How the genetic composition of a population changes through stochastic processes, such as genetic drift, in combination with deterministic processes, such as selection, is critical to understanding how phenotypes vary in space and time. Here, we show how evolutionary forces affecting selection, including recombination and effective population size, drive genomic patterns of allele-specific expression (ASE). Integrating tissue-specific genotypic and transcriptomic data from 1500 individuals from two different cohorts, we demonstrate that ASE is less often observed in regions of low recombination, and loci in high or normal recombination regions are more efficient at using ASE to underexpress harmful mutations. By tracking genetic ancestry, we discriminate between ASE variability due to past demographic effects, including subsequent bottlenecks, versus local environment. We observe that ASE is not randomly distributed along the genome and that population parameters influencing the efficacy of natural selection alter ASE levels genome wide.
The consequences of mutations on individual health are influenced by the regulation of gene expression, causing individuals with the same genotype to display highly variable phenotypes. Allele-specific approaches are used to investigate gene regulation by detecting differences in expression of each allele at a particular locus (). While previous studies have investigated the specific genomic location, tissue specificity, and disease phenotype impact of allele-specific expression (ASE) (–), few have documented how evolutionary forces, such as natural selection and genetic drift, regulate gene expression evolution genome wide (, ). In addition to evolutionary forces and population demographics, variability in gene expression explained by genic interactions with the environment has been limited to only a few instances where external environmental exposures have been measured (–).Natural selection acts on phenotypes, causing underlying genotype and allele frequencies to shift over time; however, the strength of selection in a population is affected by many factors, including effective population size (Ne) (, ). In general, purifying selection depletes haplotypes that are likely pathogenic (–). Castel et al. () observed that the overexpression of haplotypes with disease-associated variants was enriched in cohorts of patients with autism and cancer. Despite evidence of selection acting on cis-regulation (, ), little is known about the degree to which population and genomic parameters influence selection efficiency in relation to ASE. Selection efficiency varies across the genome based on local recombination rates (–). At regions with low levels of recombination, genetic variation and effective population size are greatly reduced, making the population less able to respond to selection and eliminate deleterious mutations (, ). Without recombination, deleterious mutations accumulate irreversibly, where a mutation-free haplotype cannot be regenerated once lost—a process termed as “Muller’s ratchet” (). In addition, functional diversity is affected when mutations compete with each other to become fixed in a population, known as Hill-Robertson (HR) interference (, ). We have previously shown that recombination heterogeneity can have profound consequences on how genomic variation is structured across genomes, as selection appears more efficient at removing harmful mutations in recombination hotspots and elicit an enrichment of damaging mutations in cold spots (CSs) (). Consistent with this model is the observation that, in larger populations, selection is more effective at removing damaging mutations in recombination hotspots (, –). Here, we demonstrate how recombination and Ne collectively affect the expression of deleterious variants, specifically through the regulation of ASE. To improve our understanding of the role of evolution on gene expression variability, we investigated how selection efficiency in different demographic populations and environments alters levels of ASE genome wide. We demonstrate that selection efficiency, through recombination and population histories influencing Ne, contributes to ASE evolution and maintenance by reducing the functional effects of deleterious variants.
RESULTS
ASE controls the expression of deleterious mutations
We confirm that ASE within individuals can diminish the relative expression of pathogenic mutations at heterozygous loci (). We tested whether derived alleles that accumulated in human genomes were under stronger ASE, reflecting negative selection compared to ancestral alleles. We quantified ASE by combining RNA sequencing and genotyping of 844 individuals from three regions in Quebec of multiple ancestries from the CARTaGENE cohort (, , ) and 752 individuals from the Genotype-Tissue Expression (GTEx) project (). We quantified ASE at missense and synonymous single-nucleotide polymorphisms (SNPs), which included 3083 SNPs from CARTaGENE whole blood and 5974, 5738, 4431, 6795, 9035, and 5091 from whole blood, muscle, brain, ovarian, lung, and liver tissue from GTEx, respectively (Materials and Methods and Fig. 1) (, ). In CARTaGENE, we identified 23,589 (4.9%) individual-SNP pairs with significant ASE, as determined by binomial tests after correcting for multiple testing and mapping bias using WASP software (Materials and Methods and table S1) (). We investigated the magnitude of ASE relative to the derived allele (Materials and Methods). Among the 23,589 individual-SNP pairs with significant ASE, 12,133 pairs (51.4%) underexpress the derived allele, which were relatively consistent across populations and tissues (tables S1 and S2). Near-equal proportions of directional ASE also aligned with previous observations in humans ().
Fig. 1.
The ASE framework: Investigating genome-wide ASE across two ethnic groups, three French-Canadian regions, and six tissues.
A total of 844 individuals were genotyped and RNA sequenced from the whole blood from individuals in Quebec in the CARTaGENE cohort, with ancestry from Africa (n = 19), Europe (n = 140), Quebec City (n = 180), Montreal (n = 298), and Saguenay (n = 182) (, ). We also use the whole blood (n = 654), muscle (n = 288), brain (n = 175), ovarian (n = 91), lung (n = 210), and liver (n = 84) tissue from GTEx v8 release (, ). The ASE pipeline for the two cohorts differs in their methods for genotyping, but the downstream approaches are consistent (Materials and Methods). ASE is tested using binomial tests with Benjamini-Hochberg (BH) multiple testing correction, and ancestral alignments from 1000 Genomes Phase 1 are used to identify the ancestral and derived alleles (Materials and Methods). Recombination annotations are from population-specific linkage disequilibrium (LD) genetic maps by Hussin et al. (). Missense and synonymous variants were retained for both cohorts (Materials and Methods), resulting in 3083 SNPs used from CARTaGENE and 5974, 4431, 5738, 6795, 9035, and 5091 used from GTEx blood, brain, muscle, ovarian, lung, and liver tissue, respectively.
The ASE framework: Investigating genome-wide ASE across two ethnic groups, three French-Canadian regions, and six tissues.
A total of 844 individuals were genotyped and RNA sequenced from the whole blood from individuals in Quebec in the CARTaGENE cohort, with ancestry from Africa (n = 19), Europe (n = 140), Quebec City (n = 180), Montreal (n = 298), and Saguenay (n = 182) (, ). We also use the whole blood (n = 654), muscle (n = 288), brain (n = 175), ovarian (n = 91), lung (n = 210), and liver (n = 84) tissue from GTEx v8 release (, ). The ASE pipeline for the two cohorts differs in their methods for genotyping, but the downstream approaches are consistent (Materials and Methods). ASE is tested using binomial tests with Benjamini-Hochberg (BH) multiple testing correction, and ancestral alignments from 1000 Genomes Phase 1 are used to identify the ancestral and derived alleles (Materials and Methods). Recombination annotations are from population-specific linkage disequilibrium (LD) genetic maps by Hussin et al. (). Missense and synonymous variants were retained for both cohorts (Materials and Methods), resulting in 3083 SNPs used from CARTaGENE and 5974, 4431, 5738, 6795, 9035, and 5091 used from GTEx blood, brain, muscle, ovarian, lung, and liver tissue, respectively.We confirmed that deleterious mutations tend to underexpress the derived allele in general human populations, as demonstrated by Castel et al. (fig. S1) (). We tested whether putatively deleterious mutations were more commonly over- or underexpressed by scoring variants using the Combined Annotation Dependent Depletion (CADD) score (). The CADD score estimates the level of deleteriousness of an SNP using a combination of functional predictions and conservation measurements (). Using a binomial mixed-effects regression treating individuals as a random effect (Materials and Methods), we observe that sites underexpressing the derived allele have larger CADD scores compared to sites overexpressing the derived allele (β = −4.190 ×10−3, P < 2 ×10−16; fig. S1) despite having comparable proportions of synonymous and missense variants (fig. S2). These findings suggest that ASE can reduce the exposure of putatively deleterious variants to natural selection by reducing their allelic expression.
High recombination regions are more efficient at regulating the expression of deleterious mutations
Next, we exposed how natural selection acts on ASE by testing how different evolutionary forces that contribute to selection efficiency—specifically recombination, per-site expression level, and population histories—are associated with ASE directionality. Regions of the genome with higher recombination rates have more efficient selection, because recombination both increases local Ne () and reduces HR interference (). SNPs were classified as either being located in low recombination, referred to as CS, normal recombination, or high recombination region (HRR) using population-specific linkage disequilibrium-based genetic maps (, ). We grouped normal recombination with HRR in several downstream analyses (referred to as HRR/Normal) due to the relatively low number of segregating sites in HRR (6.7%) and the similar direction of odds ratio and β estimates observed between normal and HRR (fig. S3). We modeled recombination class as a function of derived allele expression (Materials and Methods) and identified that SNPs with statistical evidence of ASE are more likely to be observed in HRR/Normal compared to CS (Fig. 2A). Recombination rates in human genomes are known to correlate with genomic features such as guanine-cytosine (GC) content, average exon expression, and exon size (, , ); however, it is unlikely that these genomic features explain the enrichment of ASE in HRR/Normal regions, as (i) models stratified by low, medium, and high values of these features had comparable odds ratios to original estimates (table S3); (ii) odds ratio estimates were comparable with models including these genomic features (table S4); and (iii) models tested with the various genomic features alone were significantly improved when adding recombination (table S5). Enrichment of ASE in HRR/Normal regions was observed across the whole blood, brain, muscle, ovarian, lung, and liver tissue (Fig. 2A). These results suggest that ASE is not randomly distributed along the genome and that varying recombination rates and its impact on the efficiency of selection contributes to genomic variation in ASE evolution.
Fig. 2.
ASE is enriched in regions of high recombination, regulating the expression of deleterious and disease associated variants.
(A) We modeled recombination class as a function of derived allele expression for CARTaGENE whole blood and GTEx whole blood, brain, muscle, ovarian, lung, and liver tissue (Materials and Methods). Odds ratio (OR) > 1 means increased odds that variants exhibit the ASE behavior in HRR/Normal compared to CS. Error bars represent 95% confidence intervals. (B) Equivalent models of recombination class as a function of derived allele expression for variants found in ClinVar and OMIM databases. (C) ORs modeling recombination class as a function of ASE-eQTL concordance, stratified by CADD score, where CADD > 15 represents putatively deleterious variants. ASE-eQTL concordance is defined as haplotype combinations, where a cis-regulatory variant underexpresses the ASE SNP (Materials and Methods). OR > 1 means haplotype combinations with ASE-eQTL concordance are more likely to be in HRR/Normal compared to CS. Error bars represent 95% confidence intervals.(D) We modeled the proportion of the derived allele expression to total expression as a function of CADD score using binomial mixed-effects regression (Materials and Methods), stratified by HRR (β = −0.049725, P < 2 ×10−16) and CS (β = −0.001245, P < 2 ×10−16), with 95% confidence bands. Equivalent models of the proportion of the derived allele as a function of CADD score on (E) GTEx whole blood, stratified by HRR (β = −0.0026557, P < 2 ×10−16) and CS (β = −0.0019145, P < 2 ×10−16) and (F) GTEx liver, stratified by HRR (β = −0.0166216, P < 2 ×10−16) and CS (β = −3.13 ×10−3, P < 2×10−16). Equivalent models for the brain, ovarian, lung, and muscle tissue are found in fig S4.
ASE is enriched in regions of high recombination, regulating the expression of deleterious and disease associated variants.
(A) We modeled recombination class as a function of derived allele expression for CARTaGENE whole blood and GTEx whole blood, brain, muscle, ovarian, lung, and liver tissue (Materials and Methods). Odds ratio (OR) > 1 means increased odds that variants exhibit the ASE behavior in HRR/Normal compared to CS. Error bars represent 95% confidence intervals. (B) Equivalent models of recombination class as a function of derived allele expression for variants found in ClinVar and OMIM databases. (C) ORs modeling recombination class as a function of ASE-eQTL concordance, stratified by CADD score, where CADD > 15 represents putatively deleterious variants. ASE-eQTL concordance is defined as haplotype combinations, where a cis-regulatory variant underexpresses the ASE SNP (Materials and Methods). OR > 1 means haplotype combinations with ASE-eQTL concordance are more likely to be in HRR/Normal compared to CS. Error bars represent 95% confidence intervals.(D) We modeled the proportion of the derived allele expression to total expression as a function of CADD score using binomial mixed-effects regression (Materials and Methods), stratified by HRR (β = −0.049725, P < 2 ×10−16) and CS (β = −0.001245, P < 2 ×10−16), with 95% confidence bands. Equivalent models of the proportion of the derived allele as a function of CADD score on (E) GTEx whole blood, stratified by HRR (β = −0.0026557, P < 2 ×10−16) and CS (β = −0.0019145, P < 2 ×10−16) and (F) GTEx liver, stratified by HRR (β = −0.0166216, P < 2 ×10−16) and CS (β = −3.13 ×10−3, P < 2×10−16). Equivalent models for the brain, ovarian, lung, and muscle tissue are found in fig S4.The effect of ASE and recombination rate was observed for both over- and underexpression of the derived allele (Fig. 2A); however, on the basis of the results from Castel et al. (), we would expect to have a stronger effect of underexpression of the derived allele in disease-associated sites. Therefore, we filtered the data to only include variants with putative disease associations in ClinVar () and Online Mendelian Inheritance in Man (OMIM) (). Here, we found that variants with putative disease associations had significant odds of underexpressing the derived allele in HRR/Normal regions; however, sites overexpressing the derived allele were more likely to be observed in CS compared to HRR/Normal (Fig. 2B).We next tested the hypothesis that protective haplotype combinations would be observed at higher frequencies in HRR/Normal regions because of more efficient selection removing haplotype combinations that overexpress deleterious alleles in these regions (). To test this hypothesis, we identified ASE SNPs that are in genes associated with expression quantitative trait loci (eQTLs) (Materials and Methods). Using haplotype phasing information, we identified ASE-eQTL concordant sites, where an eQTL cis-regulatory variant that decreases expression is on the same haplotype as the derived SNP under ASE and therefore underexpressing the derived allele at the ASE site (Materials and Methods). We observed that concordant haplotypes that underexpress deleterious variants (CADD > 15) are more likely to be observed in HRR/Normal regions compared to CS (Fig. 2C). This result supports the hypothesis that protective haplotype combinations of ASE underexpressing a deleterious variant is more common in HRR/Normal regions, likely due to the higher efficiency of selection acting on both the cis-regulatory variant and the ASE variant.We modeled the proportion of the derived allele expression to total expression as a function of CADD score using binomial mixed-effects regression and tested for interaction between CADD score and recombination region (Materials and Methods). We observe that HRRs underexpress putatively deleterious mutations more efficiently (β = −0.049725) compared to CS (β = −0.001245, P < 2 × 10−16) (Fig. 2D). This finding of HRR underexpressing deleterious mutations more efficiently was replicated in the whole blood, liver, (Fig. 2, E and F), lung, and muscle tissue from GTEx (fig. S4, C and D) but not observed in the brain and ovarian tissue (fig. S4, A and B). Together, these results suggest that regions of high recombination are more efficient at minimizing the expression of transcripts harboring deleterious and/or disease-associated variants.
Total expression level contributes to ASE variation at sites with deleterious mutations
Next, we found that per-site total expression is positively associated with ASE, potentially because genes with higher expression are under stronger evolutionary pressures (). Mean expression can influence the efficiency of selection, because evolvability by natural selection is associated with increased variance, as described in Fisher’s fundamental theorem (). We tested whether the previous result of HRR efficiently minimizing the expression of deleterious variants was consistent across SNPs with different per-site total expression levels (Materials and Methods). We observed an increased odds of exhibiting ASE in HRR/Normal compared to CS for SNPs in the high-expression bin (above third quartile) compared to medium (between first and third quartiles) and low-expression (below first quartile) bins (Fig. 3A). Variants with lower per-site total expression do not efficiently underexpress deleterious mutations, regardless of recombination region (HRR: β = 0.008657, P = 1.33 × 10−12; CS: β = 0.0008008, P = 0.008466); however, sites with higher per-site expression more efficiently underexpress the derived allele at highly deleterious mutations in HRR (β = −0.0509201, P < 2.2 × 10−16) compared to CS (β = −1.71 × 10−3, P < 2 × 10−16) (Fig. 3B). Since it is possible that the difference in expression observed is related to read coverage, we used a sampling approach to randomly sample reads, with replacement, from each position, such that all positions have 250 reads (Materials and Methods). We fit the mixed-effects regression models, stratified by expression, on the randomly sampled reads for 25 iterations and took the mean β, intercept, and 95% confidence intervals from all iterations (Materials and Methods). This random sampling approach observed the same relationship of HRR more efficiently underexpressing the derived allele at deleterious variants for sites with high expression, supporting the notion that the observed relationship is due to total expression and likely not an artifact of read coverage (fig. S5). This relationship with HRR more efficiently underexpressing highly deleterious mutations at sites with high expression was replicated in the blood, muscle, brain, lung, and liver tissue, stratified by per-site total expression (fig. S7). These results suggest that sites with higher expression and recombination regulate ASE more efficiently, likely because selection is more efficient at these sites, resulting in the removal of haplotype combinations that overexpress deleterious mutations.
Fig. 3.
Per-site total expression contributes to ASE variability and tissue specificity.
(A) We modeled recombination class as a function of derived allele expression, stratified by total expression (Materials and Methods). “High” expression categorizes SNPs above the third quartile, and “low” expression categorizes below the first quartile. OR > 1 indicates that variants in an expression class have increased odds of exhibiting ASE in HRR/Normal compared to CS. Error bars represent 95% confidence intervals. Data used from CARTaGENE whole blood. Equivalent models for GTEx whole blood, brain, muscle, ovarian, lung, and liver tissue are found in fig S6. (B) We modeled the proportion of the derived allele expression as a function of CADD score, stratified by expression for HRR (high: β = −0.0509201, P < 2.2 × 10−16; medium: β = −5.82 × 10−6, P = 0.993; and low: β = 0.008657, P = 1.33 × 10−12) and CS (high: β = −1.71 × 10−3, P < 2 × 10−16; medium: β = −8.99 × 10−5, P = 0.492; and low: β = 0.0008008, P = 0.008466). Data used from CARTaGENE whole blood. Equivalent models for GTEx blood, brain, muscle, ovarian, lung, and liver tissue are found in fig S7. (C) We identified 79 individuals with measurements in the blood, muscle, and brain tissue at 448 sites with significant ASE in at least one tissue. Heatmap demonstrates each individual-SNP pair in columns and ASE levels for each tissue in rows. (D) Using the same individual-SNP pairs from (C), ORs were calculated after modeling recombination class as a function of the total expression for tissue-specific ASE (Materials and Methods). OR > 1 indicates that variants in an expression class have increased odds of having tissue-specific ASE in HRR/Normal compared to CS. Error bars represent 95% confidence intervals.
Per-site total expression contributes to ASE variability and tissue specificity.
(A) We modeled recombination class as a function of derived allele expression, stratified by total expression (Materials and Methods). “High” expression categorizes SNPs above the third quartile, and “low” expression categorizes below the first quartile. OR > 1 indicates that variants in an expression class have increased odds of exhibiting ASE in HRR/Normal compared to CS. Error bars represent 95% confidence intervals. Data used from CARTaGENE whole blood. Equivalent models for GTEx whole blood, brain, muscle, ovarian, lung, and liver tissue are found in fig S6. (B) We modeled the proportion of the derived allele expression as a function of CADD score, stratified by expression for HRR (high: β = −0.0509201, P < 2.2 × 10−16; medium: β = −5.82 × 10−6, P = 0.993; and low: β = 0.008657, P = 1.33 × 10−12) and CS (high: β = −1.71 × 10−3, P < 2 × 10−16; medium: β = −8.99 × 10−5, P = 0.492; and low: β = 0.0008008, P = 0.008466). Data used from CARTaGENE whole blood. Equivalent models for GTEx blood, brain, muscle, ovarian, lung, and liver tissue are found in fig S7. (C) We identified 79 individuals with measurements in the blood, muscle, and brain tissue at 448 sites with significant ASE in at least one tissue. Heatmap demonstrates each individual-SNP pair in columns and ASE levels for each tissue in rows. (D) Using the same individual-SNP pairs from (C), ORs were calculated after modeling recombination class as a function of the total expression for tissue-specific ASE (Materials and Methods). OR > 1 indicates that variants in an expression class have increased odds of having tissue-specific ASE in HRR/Normal compared to CS. Error bars represent 95% confidence intervals.Since tissues vary in expression level, we next tested whether expression and recombination contributes to tissue-specific ASE. To test this, we identified 79 individuals with ASE calculations in the blood, muscle, and brain tissues at 448 sites with ASE in at least one tissue (Materials and Methods). We confirm that ASE is highly variable across tissues within the same individual (Fig. 3C) (). Variable tissue-specific ASE did not appear to be directly linked to recombination alone (Fig. 3C); however, sites that had tissue-specific ASE and were highly expressed had increased odds of being in HRR/Normal recombination compared to CS, which was not observed for medium and lowly expressed sites (Fig. 3D). This result suggests that sites under stronger selective pressures, specifically higher expression and recombination, are more likely to exhibit tissue-specific ASE.
Populations with smaller Ne have weaker enrichment of ASE in recombination regions
As recombination influences Ne, we next investigated whether ancestral populations with smaller Ne had a reduced enrichment of ASE in HRR/Normal. Higher recombination rates and larger Ne together increase the number of haplotype combinations, making natural selection more efficient (). We made use of the data from historically founding populations and contrasted other populations with smaller Ne to explore the relationship of Ne in association with ASE variation (). The French-Canadian population originates from a founder event 400 years ago from the French population, and smaller subpopulations were established in remote areas through subsequent founder events, such as in the Saguenay–Lac-Saint-Jean region (). The Saguenay region remained relatively isolated for many decades and now shows higher levels of relatedness and homozygosity (, ). In addition, we examined samples of African and European descent in the analysis to reveal the impact of Ne on ASE profiles. Larger and older populations, such as Africa (), have more efficient selection and reduced genetic drift, whereas founder populations with small Ne, such as the three Quebec subpopulations, have less efficient selection and stronger effects of genetic drift (, , ). Using data from CARTaGENE and GTEx blood, we found increased odds of significant ASE in HRR/Normal regions compared to CS, which showed greater estimates in populations with larger Ne, such as Africa (Ne = 13,900) (, –), compared to Europe (Ne = 10,427), Montreal (Ne = 11,401), Quebec City (Ne = 11,202), and Saguenay (Ne = 9123), which have smaller Ne (Fig. 4A) (, ). The three subpopulations of Quebec have overlapping confidence intervals for Ne estimates, and hence, the differences that we observed are among continents, as we do not have the resolution to observe differences within subpopulations studied here. To test the impact of sample size, we subsampled the CARTaGENE data, such that each population had equal sample size (n = 19) and observed similar signatures of African individuals having increased odds in HRR/Normal for overexpressing the derived allele (fig. S8). The signature of African individuals having increased odds of ASE in HRR/Normal compared to CS was also demonstrated in GTEx muscle, brain, ovarian, lung, and liver tissue (fig. S9). We tested individuals from East Asia from GTEx (n = 9) and CARTaGENE (n = 2) and observed consistent results of significant ASE in HRR/Normal regions with odds ratio estimates between European and African populations, which is consistent with Ne expectations, although the large confidence intervals are very likely due to small sample sizes (fig. S10). In addition, when only including pathogenic variants in OMIM, the observed pattern of increased odds of ASE in HRR/Normal compared to CS was specifically captured for sites that underexpress the derived allele, although population differences consistent with expected difference in Ne among populations were still observed (Fig. 4B). The signature of increased odds of overexpressing the derived allele for pathogenic variants in CS was observed for all populations except for Africa, which may be due to an overall larger effective population size in Africa genome wide (Fig. 4B). These findings additionally support the notion that recombination can influence the evolution of gene expression regulation by affecting Ne at genomic regions in some populations and implicate variation in demographic histories at a global scale of a population contributing to differences in ASE.
Fig. 4.
Larger effective populations are more efficient at regulating ASE.
(A) We modeled recombination class as a function of the derived allele expression, stratified by ancestry (Materials and Methods). OR > 1 indicates increased odds of exhibiting ASE in HRR/Normal compared to CS. Error bars represent 95% confidence intervals. Data are from CARTaGENE and GTEx blood. Equivalent models for GTEx brain, ovarian, lung, and liver are found in fig. S9. (B) Equivalent model of recombination class as a function of derived allele expression, stratified by population for OMIM variants. (C) Differential expression (DE) analyses were computed to determine population-varying ASE sites at 133 genes (Materials and Methods). Upset plot demonstrates the number of genes that vary in significance (FDR < 0.05) across populations, indicated by the connecting dots. (D) Population-level ASE was compared between Europe and French-Canadian subpopulations using the log2 fold change values from the DE analyses. Each dot represents a gene. The log2 fold change values from European individuals are on the x axis, and log2 fold change values from one of the three French-Canadian populations are on the y axis. Positive log2 fold change demonstrates overexpression of the derived allele, negative values demonstrate underexpression, and 0 was assigned for no significance. Dashed line shows 1:1 line where the log2 fold change values of the populations would be theoretically equal. Genes were scored on the basis of CADD scores at ASE sites, where CADD > 15 signifies putatively deleterious variants. (E) Comparison of genes with CADD > 15 that demonstrated different ASE between European (EU) and French-Canadians (FC) based on DE analysis.
Larger effective populations are more efficient at regulating ASE.
(A) We modeled recombination class as a function of the derived allele expression, stratified by ancestry (Materials and Methods). OR > 1 indicates increased odds of exhibiting ASE in HRR/Normal compared to CS. Error bars represent 95% confidence intervals. Data are from CARTaGENE and GTEx blood. Equivalent models for GTEx brain, ovarian, lung, and liver are found in fig. S9. (B) Equivalent model of recombination class as a function of derived allele expression, stratified by population for OMIM variants. (C) Differential expression (DE) analyses were computed to determine population-varying ASE sites at 133 genes (Materials and Methods). Upset plot demonstrates the number of genes that vary in significance (FDR < 0.05) across populations, indicated by the connecting dots. (D) Population-level ASE was compared between Europe and French-Canadian subpopulations using the log2 fold change values from the DE analyses. Each dot represents a gene. The log2 fold change values from European individuals are on the x axis, and log2 fold change values from one of the three French-Canadian populations are on the y axis. Positive log2 fold change demonstrates overexpression of the derived allele, negative values demonstrate underexpression, and 0 was assigned for no significance. Dashed line shows 1:1 line where the log2 fold change values of the populations would be theoretically equal. Genes were scored on the basis of CADD scores at ASE sites, where CADD > 15 signifies putatively deleterious variants. (E) Comparison of genes with CADD > 15 that demonstrated different ASE between European (EU) and French-Canadians (FC) based on DE analysis.Next, we explored population-specific ASE patterns using differential expression (DE) analyses. DE analyses were run separately on each ancestral region, comparing the derived and ancestral allele counts at each gene (Materials and Methods). Significant DE suggests population-wide ASE at the gene, allowing us to compare differential ASE genes across four of the geographic regions. Africa was removed from the analyses due to a small sample size, with concerns of observing large quantities of false negatives and incorrectly identifying population-specific ASE (Materials and Methods). We investigated 1045 genes that had expression data across all four regions, 133 of which demonstrated significant DE [false discovery rate (FDR) < 0.05] and an absolute log2 fold change above the threshold of 0.25 for at least one population. We detected 86, 61, 40, and 37 significant ASE genes in Europe, Quebec City, Montreal, and Saguenay, respectively (Fig. 4C). Of those significant ASE genes, 33, 24, 9, and 5 of the genes demonstrate population-specific ASE in Europe, Quebec City, Saguenay, and Montreal, respectively (Fig. 4C).To compare DE genes between populations, we analyze differences in significance and effect size (log2 fold change values) between regions. Positive log2 fold change values demonstrate overexpression of the derived allele, negative values demonstrate underexpression, and a value of 0 was reassigned for genes that were not significant. We observed that the European population had more variation in ASE effect size compared to the three French-Canadian regions, as there are more genes that are not significant in the French-Canadian regions but are significant in Europe (Fig. 4D). Gene ontology analysis did not demonstrate significant over- or underrepresentation of a particular gene category for genes that are significant for ASE in European individuals but not French-Canadians. To investigate genes with differential ASE across populations, we annotated the genes based on CADD scores and explored genes with putatively deleterious variants (CADD > 15), including NDUFS2, SERPINB8, TOR1A, TMEM176B, URGCP, GBP3, TAPBPL, and UBP1 (Fig. 4D). NDUFS2 encodes a protein that is a core subunit of the mitochondrial membrane respiratory chain, SERPINB8 belongs to a family of serine protease inhibitors, TOR1A is in a family of adenosine triphosphatases, TMEM176B likely plays a role in maturation of dendritic cells, URGCP may promote hepatocellular growth and survival, GBP3 encodes a member of the guanylate-binding protein (GBP) family, TAPBPL is a member of the immunoglobulin superfamily that links major histocompatibility complex class I to the transporter for antigen processing, and UBP1 encodes a transcriptional activator in a promoter context-dependent manner (). Some genes with differential ASE across populations were also located at genomic regions with differing estimates of recombination rate across populations (Fig. 4E). For example, genes SAMM50 and TRAF3IP3 both demonstrate significant ASE in the European population but have no statistical evidence of ASE in the three French-Canadian populations. SAMM50 and TRAF3IP3 are in a normal recombination region in Europeans, and, in a CS for French-Canadians, perhaps demonstrating how changes in population recombination histories could influence population-specific ASE patterns. The variation in ASE and recombination rates at these genes across populations suggests that Ne and recombination influence the efficiency of genome regulation at a continental scale. Overall, these findings highlight genes with population-specific ASE, which may be explained by recombination and Ne variability within and across genomes.
Environmental modifiers affect the ASE regulation at heterozygous loci containing a deleterious mutation
Population differences in ASE may not only be attributable to the haplotypic genetic variation as a consequence of Ne and recombination but may also be modulated by epigenetic modifiers through environmental exposure. Evolutionary change through natural selection is a driver of genetic adaptation in many populations, and these evolutionary forces, in combination with the interaction between genetic and environmental variables, contribute to the epigenetic nature of disease ().To test for potential environmental effects on ASE regulation at specific genes, we investigated ASE differences between French-Canadian individuals whose regional ancestry is different from their current residential region, in which we previously identified environmental variables associated with profiles of gene expression, including pollutant levels (). We computed DE analysis using subpopulations of Quebec with differing demographic bottleneck and migratory histories using the method previously described for ancestral comparisons, separating individuals as either population locals, those whose ancestry are identical to the current region where they reside, or population migrants, whose ancestry and current region differ (Materials and Methods). We investigated 92 genes that had expression data across all regions and environments. We identified genes with differential ASE across individuals having the same ancestry but living in different environments by comparing individuals with ancestry from Saguenay but living in Montreal, Quebec City, or Saguenay (Fig. 5A). We captured genes with deleterious variants that show different ASE profiles between individuals with the same ancestry but living in different geographic environments, indicating a possible effect of the environment on ASE regulation. These genes include GBP3, PTPRA, UBE3B, and NDUFS2 (Fig. 5, C and D). GBP3 encodes a member of the GBP family, PTPRA encodes a protein tyrosine phosphatase, UBE3B is involved in protein degradation, and NDUFS2 encodes a protein that is a core subunit of the mitochondrial membrane respiratory chain (). Conversely, we also identified genes with differential ASE between individuals living in the same regional environment but having different genetic ancestries, indicating a possible effect of recent ancestry itself on ASE regulation (Fig. 5B). These genes include CHPT1, GBP3, CTSB, KIAA0922, and MCM3AP (Fig. 5D). CHPT1 plays a role in formation and maintenance of vesicular membrane, CTSB is involved in intracellular degradation and turnover of proteins, KIAA0922 antagonizes canonical Wnt signaling, and MCM3AP is essential for the initiation of DNA replication (). Similarly with gene ontology analyses, we did not observe significant overrepresentation of a particular gene category. GBP3 was observed to have variable ASE in both the ancestrality and environmental analyses (Figs. 4D and 5, C and D). GBP3 exhibits antiviral activity against influenza virus, which might suggest that ASE changes may be in response to changing antiviral activity (). Together, these results suggest that environmental modifiers also contribute to ASE variability in addition to the genetic ancestry of an individual.
Fig. 5.
Environment contributes to differential ASE across populations.
DE analyses were computed at 92 genes (Materials and Methods). Ancestry-environment relationships were evaluated using migrant French-Canadian individuals. (A) Upset plot demonstrating the number of overlapping significant DE genes (FDR < 0.05) from individuals with Saguenay ancestry but varying environments. (B) Upset plot demonstrating the number of overlapping significant DE genes (FDR < 0.05) from individuals living in Montreal with different ancestry. (C) Log2 fold change from individuals with Saguenay ancestry and environment compared to individuals with Saguenay ancestry living in Montreal or Quebec City. Each dot represents a gene. Log2 fold change on the x axis is based on individuals with ancestry and environment in Saguenay, whereas the y axis includes individuals with ancestry from Saguenay with Montreal or Quebec City environments. Positive log2 fold change demonstrates overexpression of the derived allele, negative demonstrates underexpression, and zero is not significant. Dashed line shows if the values were theoretically equal. (D) Log2 fold change from individuals with Montreal ancestry and environment compared to individuals with Montreal environment with ancestry from Quebec or Saguenay. Each dot represents a gene. Log2 fold change on the x axis is based on individuals whose ancestry and environment is Montreal, whereas the y axis includes individuals with Montreal environment with Saguenay or Quebec City ancestry. Dashed line demonstrates if the population values were theoretically equal. (E) Genes with deleterious variants (CADD > 15) that demonstrated different ASE from (C). (F) Genes with deleterious variants (CADD > 15) that demonstrated different ASE from (D).
Environment contributes to differential ASE across populations.
DE analyses were computed at 92 genes (Materials and Methods). Ancestry-environment relationships were evaluated using migrant French-Canadian individuals. (A) Upset plot demonstrating the number of overlapping significant DE genes (FDR < 0.05) from individuals with Saguenay ancestry but varying environments. (B) Upset plot demonstrating the number of overlapping significant DE genes (FDR < 0.05) from individuals living in Montreal with different ancestry. (C) Log2 fold change from individuals with Saguenay ancestry and environment compared to individuals with Saguenay ancestry living in Montreal or Quebec City. Each dot represents a gene. Log2 fold change on the x axis is based on individuals with ancestry and environment in Saguenay, whereas the y axis includes individuals with ancestry from Saguenay with Montreal or Quebec City environments. Positive log2 fold change demonstrates overexpression of the derived allele, negative demonstrates underexpression, and zero is not significant. Dashed line shows if the values were theoretically equal. (D) Log2 fold change from individuals with Montreal ancestry and environment compared to individuals with Montreal environment with ancestry from Quebec or Saguenay. Each dot represents a gene. Log2 fold change on the x axis is based on individuals whose ancestry and environment is Montreal, whereas the y axis includes individuals with Montreal environment with Saguenay or Quebec City ancestry. Dashed line demonstrates if the population values were theoretically equal. (E) Genes with deleterious variants (CADD > 15) that demonstrated different ASE from (C). (F) Genes with deleterious variants (CADD > 15) that demonstrated different ASE from (D).
DISCUSSION
The regulation of gene expression has a leading role in the link between genotypes and phenotypes. In diploid organisms, the independent regulation of two alleles of a gene can have profound implications on gene expression, ultimately affecting the phenotypic penetrance. Regions of low recombination are enriched for deleterious variants in human populations due to consequences of HR interference (); however, the effects of HR interference and recombination rate on the regulation of gene expression have remained largely unexplored. Recently, it has been documented that haplotype combinations that reduce the expression of deleterious alleles are depleted in healthy individuals, providing evidence of purifying selection on gene expression regulation (). Here, we asked how natural selection influences the genomic distribution of ASE. We demonstrate that ASE is not randomly distributed along the genome and that population parameters such as varying recombination rate, per-site and tissue-specific expression level, and population histories may facilitate ASE evolution due to their impact on the efficiency of natural selection.In this work, we demonstrate that patterns of ASE are also affected by HR effects, as we observe less ASE regulation in regions of recombination CS. In CS, fewer recombination events are expected between each ASE variant and their regulatory variants, leading to a faster loss of diversity in regulatory regions and thus a smaller Ne (). A possible mechanism based on our results may be that the loss of diversity in regulatory regions in CS can lead to an equal ratio of allelic expression in these regions versus HRR because of regulatory polymorphisms eventually becoming fixed in the population. In addition, we demonstrate that protective haplotype combinations, where a cis-regulatory variant is linked to the underexpression of a deleterious coding variant, are more commonly observed in HRR/Normal regions compared to CS. This result supports and extends previous findings by Castel et al. (), suggesting that selection removes haplotype combinations that are more damaging, as HRR/Normal regions have more efficient selection and have more haplotype combinations that underexpress a deleterious allele. These results show how purifying selection and recombination have acted together on the combination of cis-regulatory variation with coding variation, which influences the penetrance of pathogenic variants in human populations.Within the CARTaGENE population cohort and replicated across several tissues in GTEx, we show that deleterious derived alleles are generally underexpressed, which likely tampers downstream effects on phenotypes. We observe a higher prevalence of ASE in genomic regions with HRR/Normal recombination, specifically underexpressing the derived allele for sites with pathogenic variants, and consistent with HRR/Normal regions of the genome having higher Ne and more efficiently underexpressing pathogenic variants. This pattern is observed at variants with high total expression levels, as this is an additional factor that contributes to selection efficiency. We also observe more ASE in HRR/Normal regions across a number of human populations.The influence of natural selection through recombination on ASE was documented across tissue types; however, we did not observe HRR underexpressing deleterious variants more efficiently in brain or ovarian tissue. The divergence of pattern for some tissues is likely attributed to the tissue specificity of gene expression at many loci, where brain tissue often has the largest deviation from other tissue types (). In addition, brain and ovarian tissues had the lowest proportion of sites with statistically significant ASE, with 3.7 and 3.8%, respectively (table S2), which may also contribute to altered results. We also hypothesize that tissue specificity observed in the blood may be due to allele-specific changes during immune response (), as immunity is strongly targeted by natural selection. We observed that GBP3, a gene involved in antiviral activity against influenza virus (), had variable ASE across ancestral populations and environments, further supporting the notion that immune-related genes may demonstrate variable ASE during immune response.We further highlight numerous genes with population-specific ASE based on genetic ancestry or geographic environment. Population-specific ASE may be partially explained by past demographic histories, such as recent bottlenecks, or different exposures (). Strong bottlenecks lead to reduced genetic diversity, and any disease-associated alleles that are present in the founder individuals may rise in frequency in the descending population due to strong forces of genetic drift and weaker selection, which consequently explains the higher prevalence of many genetic diseases in these founder populations (). Similarly to low recombination regions of the genome, small post-bottleneck populations have a smaller Ne, and thus the reduced genetic diversity in regulatory regions leads to increased probability of equal allelic expression.There is a long-standing investigation in evolutionary biology about how past demographic changes have affected modern phenotypes and disease risk (, ). The impact of evolutionary factors, such as natural selection and genetic drift, on gene expression variation has been understudied, likely due to the lack of empirical data previously available (). Using large population cohorts, we have assessed ASE variation across the genome and global populations with different population sizes. Gene expression is a key intermediate step in translating genotypes into phenotypes, and thus understanding how gene expression is regulated and evolves is critical for deconvoluting the relationship between phenotypic variation and disease penetrance across human populations.
MATERIALS AND METHODS
Data
The CARTaGENE biobank comprises more than 40,000 participants aged between 40 and 60 years, recruited at random among three urban centers in the province of Quebec: Montreal, Quebec City, and Saguenay. Data used in this study have been previously published by Favé et al. (). The study protocol was approved by the University of Toronto, and all participants provided informed consent. In total, 1000 samples were genotyped on the Illumina Omni2.5 array to obtain high-density SNP genotyping data. A total of 1,213,103 SNPs were retained after filtering and quality control (Hardy-Weinberg P value > 0.001, minor allele frequency > 5%, and percent of missing data < 1%). Paired-end RNA sequencing was performed on a HiSeq 2000 platform at the Genome Quebec Innovation Center (Montreal, Canada). Sequencing was performed for freeze 1 (708 samples) using three samples per lane and using six samples per lane for freeze 2 (292 samples), yielding about 60 million reads per sample. Mapping bias was corrected using WASP v0.3.4 (), which also corrects for GC content affecting read depth inconsistencies by fitting polynomials to the read counts and calculating a corrected read depth per region (). Possible CNVs were identified from the Database of Genomic Variants (), and genes overlapping with these variants were removed. Missense or synonymous SNPs with >30 total reads were retained for ASE calculations (n = 3083 SNPs).
Genetic ancestry
Favé et al. () determined genetic ancestry from CARTaGENE participants using ChromoPainter v0.04 () to detect fine-scale genetic structure, followed by the fineSTRUCTURE algorithm. Regional ancestry for each French-Canadian was determined on the basis of the three clusters obtained from the fineSTRUCTURE tree, which was in agreement with Quebec settlement history revealing subpopulations of individuals that follow a North-South structure.
Ancestral allele identification
Ancestral allele annotations were captured from the 1000 Genomes Phase 1 (). Ancestral alleles were defined as the allele found in human, chimpanzee, and rhesus macaque genomes. Only alleles that overlapped all three genomes were used, resulting in 289,212 individual-SNP pairs retained, where 51.3% had the reference allele match the ancestral allele, and 48.6% had the alternative allele match the ancestral allele.
ASE calculations
ASE estimates were calculated for germline heterozygous positions from CARTaGENE. Significant ASE sites were determined using two-tailed binomial tests with Benjamini-Hochberg multiple testing correction (FDR < 0.05) to test the null hypothesis that P = 0.5, where P is the proportion of reads with the derived allele divided by the total read count for the site. For the 289,212 individual-SNP pairs with ancestral allele annotations, significant ASE was separated into overexpression and underexpression of the derived allele by dividing the derived allele count by the total read count. A derived allele proportion of >0.5 demonstrates overexpression of the derived allele and <0.5 demonstrates underexpression of the derived allele.
Tissue replication with GTEx
GTEx was used for replication of the results observed in CARTaGENE. We used the whole blood (n = 654), muscle (n = 288), brain (n = 175), ovarian (n = 91), lung (n = 210), and liver (n = 84) tissue for 752 individuals from GTEx v8 release () using ASE calculations published by Castel et al. (). GTEx also uses WASP for mapping bias correction. The total number of sites tested for ASE significance using binomial tests and after filtering were 5974, 5738, 4431, 6795, 9035, and 5091 SNPs for the whole blood, muscle, brain, ovarian, lung, and liver tissue, respectively. All individuals included had ancestry from Europe or Africa.
Recombination mapping
Recombination maps were built by Hussin et al. (), who built linkage disequilibrium-based genetic maps using genotyping data from French-Canadian populations from CARTaGENE and CEU (Northern Europeans from Utah) and YRI (Yorubans from Ibadan, Nigeria) populations from HapMap 3 (). Maps were used to locate CSs and hotspots of recombination. CSs are defined as regions longer than 50 kb, with a recombination rate below 0.5 cM/Mb. Hotspots are defined as short segments (<15 kb), with recombination rates above 5 cM/Mb. HRRs are regions with high density of hotspots, where the distance between neighboring hotspots is less than 50 kb. CSs and HRR were identified for each population independently. We identified 1560 CSs and 165 HRR in French-Canadians, 1140 CS and 81 HRR in Europeans, and 288 CSs and 16 HRR in Africans. Many analyses grouped normal recombination with HRR (referred to as HRR/Normal) due to the low number of sites in HRR and the similar direction of effect observed between normal and HRRs (fig. S2). Estimates of effective population size were also obtained by Hussin et al. () and McEvoy et al. ().
Variant annotation
We annotated SNPs based on function using SeattleSeq Annotation v9.10 (). We retained missense and synonymous variants for analysis. Nonsense and splicing variants can also result in ASE through nonsense-mediated decay and alternative splicing; however, it may have different selective pressures compared to ASE as a result of regulatory mechanisms. To remove potential noise from nonsense mediated decay and alternative splicing, we removed nonsense and splice variants from downstream analyses. We annotated SNPs using CADD v1.6 for a measure of deleteriousness, where a deleterious mutation was classified as CADD > 15, as previously reported (). Variants with potential and validated disease associations were retrieved from the ClinVar () and OMIM databases (). We retrieved 653 variants in the CARTaGENE dataset in ClinVar and 340 sites in OMIM genes. Absolute gene expression for each SNP is determined by the total read count, correcting for library size. Per-site total gene expression was calculated on the basis of tissue-specific quartiles, where low expression is observed within the first quartile and high expression is observed above the third quartile. GC content, exon size, average exon expression, and CpG islands were obtained from Hussin et al. (), which determined these metrics based on the same CARTaGENE French-Canadian samples.
Recombination class as a function of derived allele expression class
We modeled the recombination class as a function of the observed behavior of derived allele expression (overexpressed/no ASE/underexpressed). Specifically, three separate models were fit by regressing the binary outcome of recombination class (CS versus HRR/Normal) separately on each of three binary covariates for derived allele expression behavior: (i) overexpressed versus no ASE, (ii) no ASE versus ASE, and (iii) under expressed versus no ASE. Random effects due to individual were included in the models to account for repeated measurements using the glmer() function with “binomial” link function (), with confidence intervals calculated using the Wald method. Significance (P values) of the estimated main effect (odds ratio) of derived allele expression behavior on recombination class was estimated using likelihood ratio tests. These models were fit to the datasets stratified by population/tissue type and data subsets for putatively disease-associated variants (defined by ClinVar and OMIM).
Recombination class as a function of ASE-eQTL concordance
We tested for eQTLs using an additive model with Matrix eQTL v2.1.0 () with a Bonferroni correction to identify eQTL sites and their associated eGenes. We removed the effects of hidden covariates using surrogate variable analysis, as described by Favé et al. (), where five surrogate variables (technical and biological) and variance associated to batch (technical) were used for correction. One significant eQTL per eGene was selected on the basis of having the smallest P value (FDR < 0.05), and we identified ASE SNPs within each eGenes. Ancestral alignments for the eQTL site were used to determine ancestral and derived alleles using the same method described above. Haplotype phasing was performed using SHAPEIT (v2.r64410) () using default parameters, as described in Favé et al. (). A total of 333 ASE SNPs and their associated eQTLs were used to identify ASE-eQTL concordance. ASE-eQTL haplotype concordance were defined as combinations that either (i) ASE-derived allele and eQTL-derived allele were on the same haplotype, the eQTL β value was negative, and ASE was underexpressing the derived allele or (ii) ASE-derived allele and eQTL-derived allele were on different haplotypes, the eQTL β value was positive, and ASE was underexpressing the derived allele. All other combinations were considered discordant. We modeled recombination class (CS versus HRR/Normal) as a function of concordance, stratified by CADD score, where CADD of >15 signifies likely deleterious variants. Random effects due to individual were included in the models to account for repeated measurements using the glmer() function with binomial link function in the lme4 package () in the R Project software (), with confidence intervals calculated using the Wald method. Significance (P values) of the estimated main effect (odds ratio) of derived allele expression behavior on recombination class was estimated using likelihood ratio tests.
Proportion of derived allele expression and CADD score
We modeled the proportion of the derived allele expression to the total expression as a function of CADD score using binomial mixed-effects regression. The primary model regressed the bivariate vector of the derived and total allele expression on CADD score with the inclusion of random effect due to individual, thus accounting for repeated measurements using the generalized linear mixed-effects regression function glmer() function with binomial link function from the lme4 package v1.1-27.1 () in the R Project software (). In addition to the primary model with the fixed main effect of interest (CADD score), we additionally stratified by population. Figures visualize the predictions from the models using ggpredict() in the ggeffects package v1.1.0 () in R, with 95% confidence bands. Significance (P values) of the estimated main effect (odds ratio) of CADD score in the primary model was estimated using a likelihood ratio test comparing “full” models with nested submodels, where the covariate of interest was removed.
Proportion of derived allele expression and CADD score with total expression as an interaction effect
In addition to the primary model with the fixed main effect of interest (CADD score), we additionally tested for the presence of an interaction effect between CADD score and the level of total expression (coded as a factor variable with three levels: low, medium, and high). These models were fit to the datasets stratified by recombination region (CS/HRR). Figures visualize the predictions from the models using ggpredict() in the ggeffects package v1.1.0 () in R, with 95% confidence bands. In addition, to determine whether the results observed with total expression as an interaction effect was explained by read coverage, we randomly sampled 250 reads, with replacement, from each position, regardless of expression class, and fit the mixed-effects regression models stratified by expression on the randomly sampled read proportions. The random sampling procedure was performed for 25 iterations, with the mean β, intercept, and 95% confidence intervals being compared after all iterations.
DE by ancestry and environment
DE per gene was computed using DESeq2 v1.32.0 () using the model ~Derived/Ancestral + ID, where Derived/Ancestral represents the read counts for each allele and ID represents the individual. Genes were retained only if (i) there were only one SNP per gene for each individual or multiple SNPs with ASE directionality concordance where the read counts were summed and (ii) with expression data from all populations. These restrictions resulted in a total of 1045 genes tested. The model was fit after stratifying for population ancestry or environment. Genes with FDR < 0.05 and where the absolute[log2(Derived) − log2(Ancestral)] > 0.25 were considered differentially expressed. Genes with DE in one or more populations were compared for a total of 133 genes.Africa was removed from analyses due to the small sample size causing concerns regarding large quantities of false negatives. Since the DE analysis calculates across all individuals in the population, we would only be comparing 19 data points, as that is the sample size that we have for African individuals. We would therefore observe less significant genes from African individuals due to low power at most genes. Thus, when comparing significant genes directly across the populations, we may incorrectly identify certain genes with ASE in other populations but not Africa, which may be an artifact of sample size and not biological. This was not a concern for previous analyses, because we calculated ASE within each individual; therefore, we are more confident with the ASE estimates because the read coverage that we have per individual has passed quality checks and thresholds, providing enough power to accurately test for ASE, regardless of sample size.DE analyses were also computed to compare ancestry and environmental effects on ASE using migrant French-Canadian individuals. DE analysis on ancestry-environment effects used the same model but stratified by ancestry-environment groups. We compared Saguenay ancestry living in Saguenay (n = 93), Saguenay ancestry living in Montreal (n = 27), and Saguenay ancestry living in Quebec City (n = 38) in addition to Montreal ancestry living in Montreal (n = 215), Montreal ancestry living in Saguenay (n = 4), and Montreal ancestry living in Quebec (n = 53). A total of 93 genes with expression across all geographic regions were used.DE analyses were compared investigating the effect size (log2 fold change) for each group, where a positive effect size signifies overexpression of the derived allele and negative values signify underexpression of the derived allele, and genes that were not significant were reassigned a value of 0. CADD scores for variants in the genes identified were also investigated to identify genes with highly deleterious variants (CADD > 15) that had altered ASE across populations.
Authors: Albert Tenesa; Pau Navarro; Ben J Hayes; David L Duffy; Geraldine M Clarke; Mike E Goddard; Peter M Visscher Journal: Genome Res Date: 2007-03-09 Impact factor: 9.043
Authors: Ana Prohaska; Fernando Racimo; Andrew J Schork; Martin Sikora; Aaron J Stern; Melissa Ilardo; Morten Erik Allentoft; Lasse Folkersen; Alfonso Buil; J Víctor Moreno-Mayar; Thorfinn Korneliussen; Daniel Geschwind; Andrés Ingason; Thomas Werge; Rasmus Nielsen; Eske Willerslev Journal: Cell Date: 2019-03-21 Impact factor: 41.582