Literature DB >> 25714927

The genetic architecture of gene expression levels in wild baboons.

Jenny Tung¹, Xiang Zhou¹, Susan C Alberts², Matthew Stephens¹, Yoav Gilad¹.

Abstract

Primate evolution has been argued to result, in part, from changes in how genes are regulated. However, we still know little about gene regulation in natural primate populations. We conducted an RNA sequencing (RNA-seq)-based study of baboons from an intensively studied wild population. We performed complementary expression quantitative trait locus (eQTL) mapping and allele-specific expression analyses, discovering substantial evidence for, and surprising power to detect, genetic effects on gene expression levels in the baboons. eQTL were most likely to be identified for lineage-specific, rapidly evolving genes; interestingly, genes with eQTL significantly overlapped between baboons and a comparable human eQTL data set. Our results suggest that genes vary in their tolerance of genetic perturbation, and that this property may be conserved across species. Further, they establish the feasibility of eQTL mapping using RNA-seq data alone, and represent an important step towards understanding the genetic architecture of gene expression in primates.

Entities: Chemical Disease Gene Species

Keywords: RNA-seq; allele-specific expression; baboon; evolutionary biology; expression quantitative trait locus; genomics; yellow baboon (papio cynocephalus)

Mesh：

Year: 2015 PMID： 25714927 PMCID： PMC4383332 DOI： 10.7554/eLife.04729

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

Gene regulatory variation has been shown to make fundamental contributions to phenotypic variation in every species examined to date. This relationship has been demonstrated most clearly at the level of gene expression, which captures the integrated output of a large suite of other regulatory mechanisms. Variation in gene expression levels has been linked to fitness-related morphological, physiological, and behavioral variation in both lab settings and natural populations (e.g., Abzhanov et al., 2004; Hammock and Young, 2005; Tishkoff et al., 2006; Chan et al., 2010; reviewed in Wray, 2007), and is a robust biomarker of disease in humans (e.g., Golub et al., 1999; Borovecki et al., 2005). In addition, patterns of gene expression are often associated with signatures of natural selection (Rifkin et al., 2003; Denver et al., 2005; Gilad et al., 2006a; Blekhman et al., 2008), suggesting their functional importance even when their phenotypic significance remains unknown. In primates, the majority of research on the evolution of gene expression has concentrated on cross species comparisons, particularly using humans, chimpanzees, and rhesus macaques (Enard et al., 2002; Cáceres et al., 2003; Khaitovich et al., 2004; Gilad et al., 2005, 2006b; Haygood et al., 2007; Blekhman et al., 2008; Babbitt et al., 2010; Barreiro et al., 2010; Blekhman et al., 2010; Brawand et al., 2011; Perry et al., 2012). These studies—motivated by a long-standing argument about the importance of gene regulation in primate evolution (King and Wilson, 1975)—have been important for identifying patterns of constraint on gene expression phenotypes over long evolutionary time scales, and for suggesting candidate loci that might contribute to phenotypic uniqueness in humans or other species. For example, gene expression patterns associated with neurological development appear to have experienced an accelerated rate of change in primates relative to other mammals, with axonogenesis-related and cell adhesion-related genes accelerated specifically in the human lineage (Brawand et al., 2011). Similarly, differentially expressed genes in human liver are enriched for metabolic function (Blekhman et al., 2008), suggesting a potential molecular basis for arguments implicating dietary shifts in the emergence of modern humans (Kaplan et al., 2000; Ungar and Teaford, 2002; Wrangham, 2009). Adaptively relevant changes in gene expression levels across species implicate selection on gene expression phenotypes within species, and particularly within populations, the basic unit of evolutionary change. However, in contrast to cross species comparisons, we still know little about the genetic architecture of gene expression levels in natural nonhuman primate populations. No estimates of the heritability of gene expression traits are available, even for populations that have been intensively studied for many decades. We also do not know whether segregating genetic variation that affects gene expression is common or rare, how the effect sizes of such variants are distributed, or whether they carry a signature indicative of natural selection. If gene regulatory variation has indeed been key to primate evolution, as classic arguments suggest (King and Wilson, 1975), then large gaps therefore remain in our understanding of this process. Three primary reasons combine to account for the absence of such data. First, until relatively recently, the only feasible approach for measuring genome-wide gene expression levels on a population scale was microarray technology. This constraint limited the diversity of systems that could be assessed because cost-effective, commercially available arrays have only been developed for a handful of taxa. Second, genomic resources, especially detailed catalogs of known genetic variants (e.g., 1000 Genomes Project Consortium et al., 2010, International HapMap Consortium, 2005), are also limited to a small set of species. The lack of such resources creates major barriers to genome-scale studies of the genetics of gene expression in other organisms, which rely on complementary gene expression and genotype data. Finally, for many taxa, samples suitable for gene expression profiling can be challenging to collect. In nonhuman primates, for example, RNA samples are rarely available even for the most intensively studied natural populations. Recently, sequencing-based methods for measuring gene expression levels (e.g., RNA-seq) have eliminated the need for species-specific arrays. Comparative genomic studies using RNA-seq have thus vastly expanded the set of taxa for which genome-wide expression data are available (including primates: Brawand et al., 2011; Perry et al., 2012). Importantly, because fragments of expressed genes are resequenced many times in RNA-seq studies, data on genetic variation are also generated in the process. Although these data can be affected by technical biases, several studies have demonstrated the generally high reliability of genotypes inferred from RNA-seq reads (Perry et al., 2012; Piskol et al., 2013). Such data can provide important insight into genetic diversity in species for which little other information exists (Perry et al., 2012). Additionally, they provide the two ingredients necessary for mapping gene expression traits to genotype, at moderate cost and without the requirement for previously ascertained genetic variants. Here, we evaluate the potential for such work in an intensively studied wild primate population, the baboons (Papio cynocephalus) of the Amboseli basin in Kenya. 43 years of prior research on this population have established it as an important model for human social behavior, health, and aging (Alberts and Altmann, 2012), and have facilitated the development of protocols for collecting samples appropriate for gene expression analysis (Tung et al., 2009; Babbitt et al., 2012; Runcie et al., 2013). We generated RNA-seq data for 63 individually recognized members of the Amboseli study population. We used these data to explore the frequency, impact, and potential selective relevance of variants associated with variation in gene expression levels, using complementary expression quantitative trait locus (eQTL) mapping and allele-specific expression (ASE) approaches. We found evidence for abundant functional regulatory variation in the Amboseli baboons, and a surprising amount of power to detect these variants even with a modest sample size. We also found that functional variants are depleted in highly conserved genes, consistent with constraint on gene expression patterns. However, among genes with eQTL, we did not find strong support for a relationship between effect size and minor allele frequency. Such a relationship would be consistent with pervasive negative selection on gene expression phenotypes (i.e., selection against variants that produce large perturbations in gene expression levels) and has been suggested by work in humans (Battle et al., 2014). Finally, we used our data set to provide the first estimates of the heritability of gene expression levels in wild primates, including the relative contributions of cis-acting and trans-acting genetic variation.

Results

Functional regulatory variation is common in the Amboseli baboons

We obtained blood samples from 63 individually recognized adult baboons in the Amboseli population (Figure 1—figure supplement 1). From these samples, we produced a total of 1.89 billion RNA-seq reads (mean of 30.0 ± 4.5 s.d. million reads per individual, with 8.6 ± 1.8 s.d. million reads uniquely mapped to exons: Supplementary file 1A). On average, 67.2% of reads mapped to the most recent release of the baboon genome (Panu2.0), 69.2% of which could be assigned to a unique location. We used the set of uniquely mapped reads to estimate gene-wise gene expression levels for NCBI-annotated baboon RefSeq genes. After subsequent read processing and normalization steps (‘Materials and methods’, Figure 1—figure supplement 1 and Figure 1—figure supplement 2), we considered variation in gene expression levels for 10,409 genes expressed in whole blood (i.e., all genes for which we could test for cis-acting genetic effects on gene expression).

Figure 1—figure supplement 1.

Detailed workflow for gene expression level estimation.

DOI: http://dx.doi.org/10.7554/eLife.04729.004

Figure 1—figure supplement 2.

Elimination of GC bias via quantile normalization.

Each plot shows gene GC content (x-axis) vs the log of the ratio of the individual's RPKM for that gene to mean RPKM across all individuals. Data for three individuals are shown in pairs (A and B, C and D, E and F) for prior to (left) and after (right) quantile normalization.

DOI: http://dx.doi.org/10.7554/eLife.04729.005

We also used the RNA-seq reads to identify segregating genetic variants in the Amboseli population. We considered only high confidence sites that were variable within the Amboseli population (‘Materials and methods’; Figure 1—figure supplement 3). As expected (Piskol et al., 2013), these sites were highly enriched in annotated gene bodies (Figure 1; Figure 1—figure supplement 4). Based on parallel analyses applied to human RNA-seq data, we estimated approximately 97% of these sites to be true positives, and a median correlation between true genotypes and inferred genotypes of 98.7% (‘Materials and methods’; Figure 1—figure supplements 5–6). To identify putative expression quantitative trait loci (eQTL), we focused on variants that passed quality control filters, within 200 kb of the gene of interest. Such variants represent likely cis-acting eQTL, which are more readily identifiable in small sample sizes than trans-eQTL. To identify cases of allele-specific expression, which provides independent but complementary evidence for functional cis-regulatory variation, we focused on genes for which multiple heterozygotes were identified for variants in the exonic regions of expressed genes. We also required a minimum total read depth at exonic heterozygous sites of 300 reads (which should provide high power to detect modest ASE: Fontanillas et al., 2010), resulting in a total set of 2280 genes tested for ASE.

Figure 1—figure supplement 3.

Detailed workflow for SNP genotyping.

DOI: http://dx.doi.org/10.7554/eLife.04729.006

Figure 1.

Baboon eQTLs are enriched in and near genes.

The locations of all SNPs tested in the eQTL analysis are shown in gold relative to the 5′ most gene transcription start site (TSS) and the 3′ most gene transcription end site (TES) for all 10,409 genes. SNPs detected as eQTL are overplotted in blue, and are enriched, relative to all SNPs tested, near transcription start sites, transcription end sites, and within gene bodies. Gray shaded rectangle denotes the region bounded by the TSS and TES, with gene lengths divided into 20 bins for visibility (because the gene body is thus artificially enlarged, SNP density within genes cannot be directly compared with SNP density outside of genes). Note that SNPs that fall outside of one focal gene may fall within the boundaries of other genes. Inset: distribution of all SNPs tested relative to the location of genes, highlighting the concentration of SNPs in genes (the peak at the center of the plot). See Figure 1—figure supplements 1–14 for additional details on workflow, variant calling validation, location of all analyzed SNPs relative to genes, agreement between eQTL and ASE detection, and effects of local structure.

DOI: http://dx.doi.org/10.7554/eLife.04729.003

DOI: http://dx.doi.org/10.7554/eLife.04729.004

DOI: http://dx.doi.org/10.7554/eLife.04729.005

DOI: http://dx.doi.org/10.7554/eLife.04729.006

The locations of all SNPs tested in the eQTL analysis are shown in gold relative to the 5′ most gene transcription start site (TSS) and the 3′ most gene transcription end site (TES) for all 10,409 genes. The location of all SNPs tested in association with eQTL genes is overplotted in blue. Gray shaded rectangle denotes the region bounded by the TSS and TES, with gene lengths divided into 20 bins for visibility.

DOI: http://dx.doi.org/10.7554/eLife.04729.007

(A) Distribution of correlations between SNPs called using RNA-seq data and SNPs called independently by HapMap3 (n = 9919 variants). (B) Estimated homozygosity levels for n = 69 YRI individuals at the same set of sites; outliers (denoted with red stars) reflect those individuals with the lowest correlation between RNA-seq-based genotypes and HapMap3 genotypes. The four starred outliers in (B) include the three lowest accuracy individuals in the boxplots in (A).

DOI: http://dx.doi.org/10.7554/eLife.04729.008

PCA projection of genotype data from the RNA-seq-based pipeline and the HapMap3 data place individual samples very close together. (A) and (B) show the same data, but (B) zooms in on the central cluster for better visibility.

DOI: http://dx.doi.org/10.7554/eLife.04729.009

(A) Venn diagram depicting the overlap between genes with significant eQTL and ASE, among genes tested in both cases (note that the number of genes with eQTL is smaller in this figure than in the overall data set because we consider only the set of genes that were testable for both eQTL and ASE, n = 2280 instead of n = 10,409). Genes with significant eQTL are more likely to have significantly detectable ASE and vice-versa (n = 2280; p < 10−25). (B) eQTL SNPs in exonic regions that could also be tested for ASE reveal correlated effect sizes (n = 123; p < 10−20). (C) Similarly, ASE SNPs exhibit effect sizes that are correlated with evidence for eQTL at the same sites (n = 510; p < 10−45).

DOI: http://dx.doi.org/10.7554/eLife.04729.010

(A) Detection of ASE is favored for genes with higher expression levels (p = 3.99 × 10−209), (B) whereas detection of eQTL is favored for genes with greater cis-regulatory SNP density (p = 1.05 × 10−73).

DOI: http://dx.doi.org/10.7554/eLife.04729.011

Boxplot differences between eQTL identified in the YRI data set using chip-based genotype data vs RNA-seq-based genotype data for (A) gene expression levels in RPKM (Wilcoxon test p = 6.53 × 10−9); (B) conservation levels measured by average phyloP per gene (p = 0.707); (C) conservation levels measured using Homologene conservation scores (p = 0.600); and (D) magnitude of the eQTL effect size (p = 0.137).

DOI: http://dx.doi.org/10.7554/eLife.04729.012

(A) Difference in the magnitude of ASE estimated for pairs of tested sites (i.e., absolute difference of the absolute values of z-scores), by distance between sites. (B) Difference in the magnitude of ASE estimated for pairs of tested sites for genes with significant ASE only, where one site in the pair is the site with the best ASE support for the gene. In both plots, distance categories reflect the range from the previous category to the labeled max value.

DOI: http://dx.doi.org/10.7554/eLife.04729.013

The locations of all eQTL SNPs (n = 1787) identified in the main eQTL analysis are shown in gold relative to the 5′ most gene transcription start site (TSS) and the 3′ most gene transcription end site (TES). eQTL SNPs detected in a parallel analysis controlling for local structure (n = 1583) are overplotted in blue. Gray shaded rectangle denotes the region bounded by the TSS and TES, with gene lengths divided into 20 bins for visibility. Note that SNPs that fall outside of one focal gene may fall within the boundaries of other genes. Inset: Quantile–quantile plot of eQTL locations in models that do and do not control for local structure (Kolmogorov-Smirnov test, p = 0.577).

DOI: http://dx.doi.org/10.7554/eLife.04729.014

DOI: http://dx.doi.org/10.7554/eLife.04729.015

Mean coverage by genotype class for (A) all SNPs tested in the baboon eQTL analysis (n = 64,432), and (B) SNPs identified as eQTL (n = 1693). QQ plot of mean coverage in homozygotes for the reference allele vs homozygotes for the alternate allele for (C) all SNPs and (D) SNPs identified as eQTL. The magnitude of increased coverage in reference allele homozygotes indicates the degree of systematic reference allele mapping bias (dashed line shows the expectation for no mapping bias). Reference allele homozygotes tend to have higher coverage, on average, than alternate allele homozygotes (K-S test: p < 2.2 × 10−16 for all SNPs; p = 3.9 × 10−5 for eQTL SNPs), suggesting some degree of mapping bias; however the effect is actually smaller for eQTL SNPs than for all SNPs (K-S D = 0.167 for all SNPs; K-S D = 0.084 for eQTL SNPs).

DOI: http://dx.doi.org/10.7554/eLife.04729.016

DOI: http://dx.doi.org/10.7554/eLife.04729.017

Figure 1—figure supplement 4.

Location of analyzed SNPs relative to genes.

The locations of all SNPs tested in the eQTL analysis are shown in gold relative to the 5′ most gene transcription start site (TSS) and the 3′ most gene transcription end site (TES) for all 10,409 genes. The location of all SNPs tested in association with eQTL genes is overplotted in blue. Gray shaded rectangle denotes the region bounded by the TSS and TES, with gene lengths divided into 20 bins for visibility.

DOI: http://dx.doi.org/10.7554/eLife.04729.007

Baboon eQTLs are enriched in and near genes.

Detailed workflow for gene expression level estimation.

DOI: http://dx.doi.org/10.7554/eLife.04729.004

Elimination of GC bias via quantile normalization.

Detailed workflow for SNP genotyping.

DOI: http://dx.doi.org/10.7554/eLife.04729.006

Location of analyzed SNPs relative to genes.

The locations of all SNPs tested in the eQTL analysis are shown in gold relative to the 5′ most gene transcription start site (TSS) and the 3′ most gene transcription end site (TES) for all 10,409 genes. The location of all SNPs tested in association with eQTL genes is overplotted in blue. Gray shaded rectangle denotes the region bounded by the TSS and TES, with gene lengths divided into 20 bins for visibility. DOI: http://dx.doi.org/10.7554/eLife.04729.007

Accuracy of genotype calls for SNPs independently typed in HapMap3.

PCA projection of YRI samples using the RNA-seq-based pipeline vs independently typed SNPs.

Agreement between eQTL and ASE approaches for identifying functional variants.

Power to detect ASE vs eQTL.

Characteristics of YRI eQTL identified in the RNA-seq vs conventional pipelines.

Differences in the magnitude of ASE vs distance between sites.

Location of eQTL SNPs relative to genes with and without controlling for local structure.

Number of eQTL identified by PCs removed from the gene expression data set.

DOI: http://dx.doi.org/10.7554/eLife.04729.015

Coverage by genotype call.

Detection of ASE is not dependent on number of heterozygotes, conditional on total read depth.

SNPs within the set tested for ASE (n = 8145) were divided into deciles based on total read depth. The evidence for a relationship (−log10 of the p-value from a Wilcoxon test) between number of heterozygous individuals at each site and detection of significant ASE is shown on the y-axis for each decile. Dashed line shows a nominal significance threshold of p = 0.01. Blue numbers above each point show the number of sites that fall within the decile; purple numbers below each point show the maximum total read depth for that decile (minimum total read depth is the maximum depth for the previous decile, or 300 for the lowest decile). DOI: http://dx.doi.org/10.7554/eLife.04729.017 Both analyses converged to reveal extensive segregating genetic variation affecting gene expression levels in the Amboseli population. At a 10% false discovery rate, we identified eQTL for 1787 (17.2%) of the genes we analyzed, and evidence for ASE for 510 (23.4%) of tested genes. Consistent with reports in humans (e.g., Veyrieras et al., 2008; Pickrell et al., 2010a), eQTL were strongly enriched near gene transcription start sites and in gene bodies (Figure 1; controlling for the background distribution of sites tested, which were also enriched in and around genes). Within gene bodies, eQTL were particularly likely to be detected near transcription end sites; this potentially reflects enrichment in 3′ untranslated regions, which are poorly annotated in baboon. Also as expected, genes with eQTL were more likely to exhibit significant ASE and vice-versa (hypergeometric test: p < 10−25; Figure 1—figure supplement 7). The magnitude and direction of ASE and eQTL were significantly correlated when an eQTL SNP could also be assessed for ASE (n = 123 genes; r = 0.719, p < 10−20, Figure 1—figure supplement 7), and when ASE SNPs were assessed as eQTL (n = 510 genes; r = 0.575, p < 10−45, Figure 1—figure supplement 7). Detection of ASE was most strongly favored for highly expressed genes (i.e., higher RPKM: Wilcoxon test: p < 10−208; Figure 1—figure supplement 8), whereas detection of eQTL was most strongly favored for genes with high local SNP density (p < 10−72; Figure 1—figure supplement 8).

Figure 1—figure supplement 7.

Agreement between eQTL and ASE approaches for identifying functional variants.

DOI: http://dx.doi.org/10.7554/eLife.04729.010

Figure 1—figure supplement 8.

Power to detect ASE vs eQTL.

DOI: http://dx.doi.org/10.7554/eLife.04729.011

Increased power to detect eQTL in baboons relative to humans

The number and effect sizes of the eQTL we detected indicate that our power to detect eQTL in the Amboseli population was surprisingly high, especially given that our genotyping data set was limited only to those sites represented in RNA-seq data (i.e., primarily within transcribed regions of moderately to highly expressed genes). Further, while thousands of cis-eQTL have been mapped in single human populations, doing so has generally required sample sizes several fold larger than ours (Lappalainen et al., 2013; Battle et al., 2014). To provide a more informative estimate of the difference in power to detect eQTL in baboons relative to humans, we applied the same mapping, data processing, variant calling, and eQTL modeling pipeline to a similarly sized RNA-seq data set on 69 Yoruba (YRI) HapMap samples, in which samples were sequenced to a similar depth (Pickrell et al., 2010a). Using our approach for estimating and modeling the gene expression data, but obtaining the genotype data from an independent array platform, we could identify 700 genes with significant eQTL in the YRI data set at a 10% FDR. Approximately half (51%) could be recovered if we only focused on SNPs in transcribed regions. This number (n = 357) therefore reflects the likely theoretical limit of detection for performing eQTL mapping in which SNPs are called based on RNA-seq data. Indeed, when eQTL mapping for the YRI was conducted using genotype data obtained from RNA-seq reads (i.e., the same pipeline used for the baboons), we identified 290 genes with eQTL (41.4% of those identified using independently collected genotype data). eQTL identified in the RNA-seq pipeline do not differ from those identified only in the conventional pipeline in either effect size or in surrounding sequence conservation, but do tend to fall in more highly expressed genes (Wilcoxon test on RPKM values: p = 6.53 × 10−9; Figure 1—figure supplement 9), suggesting that sequencing coverage considerations reduce the number of identifiable eQTL below the theoretical maximum. The RNA-seq-based pipeline therefore reduces the number of genes with detectable eQTL by 50–60%, suggesting that if genotyping array data had been available for the baboons, we might have identified eQTL for ∼3500–4000 genes, comparable to results from human data sets with more than 350 samples (Lappalainen et al., 2013). To better understand the reasons behind this difference, we investigated three possible explanations.

Figure 1—figure supplement 9.

Characteristics of YRI eQTL identified in the RNA-seq vs conventional pipelines.

DOI: http://dx.doi.org/10.7554/eLife.04729.012

Shifts in the minor allele frequency spectrum

We observed that the minor allele frequency (MAF) spectrum of variants called in the baboon data set included proportionally more intermediate frequency variants and proportionally fewer low frequency variants than in the human data set (Figure 2A, inset). To investigate the degree to which this shift conferred greater power to detect eQTL in the baboons, we simulated eQTL for 10% of the genes in the study by randomly choosing a SNP near each of these genes.

Figure 2.

Power to detect eQTL in the Amboseli baboons compared to the HapMap YRI population.

(A) Simulated eQTL data sets demonstrate that the baboon data set has greater power to detect eQTL (at a 10% FDR threshold) when eQTL are simulated based on effect size (solid lines and triangles) but not when eQTL are simulated based on proportion of variance in gene expression levels explained (PVE: dashed lines and circles). This result likely stems from differences in the minor allele frequency (MAF) spectrum between baboons and YRI (inset), which favors eQTL mapping in the baboons; simulations based on effect size are sensitive to MAF, but simulations based on PVE are not. (B) Masking the simulated eQTL SNP demonstrates that the baboon data set has greater power to detect eQTL due to both increased cis-regulatory SNP density and more extended LD (inset). Subsampling the SNP density in the baboon data set to the level of the YRI data set reduces the difference in power but does not remove it completely. In B, all results are shown for PVE-based simulations to exclude the effects of the MAF. See Figure 4—figure supplement 1 for power simulations for masked SNPs based on effect size.

DOI: http://dx.doi.org/10.7554/eLife.04729.018

DOI: http://dx.doi.org/10.7554/eLife.04729.019

Power to detect eQTL in the Amboseli baboons compared to the HapMap YRI population.

Figure 4—figure supplement 1.

PVE explained by demographic and early environmental variables.

QQ plots of PVE explained by a variable of interest vs PVE explained by that variable with permuted data, for (A) age; and (B) maternal social connectedness to males (SCI-M). Bottom panels show the difference between evidence for significant PVE by sex for (C) genes on autosomes vs (D) genes on the X chromosome (bottom right).

DOI: http://dx.doi.org/10.7554/eLife.04729.023

DOI: http://dx.doi.org/10.7554/eLife.04729.018

Relationship between power to detect eQTL and simulated effect size, when the true eQTL is masked.

Purple line shows the baboon data; pink line shows the baboon data with SNP density subsampled to match the YRI; orange line shows the YRI data. Masking the simulated eQTL SNP demonstrates that the baboon data set has greater power to detect eQTL due to both increased cis-regulatory SNP density and more extended LD. Subsampling the SNP density in baboon to the level of the YRI data set reduces the difference in power but does not remove it completely. DOI: http://dx.doi.org/10.7554/eLife.04729.019 We did so in two ways. First, we simulated the effect size of the eQTL, with possible effect sizes ranging from 0.25 to 2.5, in intervals of 0.25 (effect sizes are relative to a standard normal distribution). The power to detect an eQTL of a given effect size is contingent on the relative representation of different genotype classes in a population, and hence MAF (larger MAFs produce a more balanced set of alternative genotypes, and thus more power). Second, we simulated the proportion of variance in gene expression levels (PVE) explained by the eQTL, with possible PVE values ranging from 5% to 50%, in intervals of 5%. In this case, power to detect an eQTL does not depend on MAF because simulating the PVE directly integrates across the combined impact of effect size and MAF (a simulated high PVE eQTL with low MAF implies a large effect size variant). Thus, the impact of the MAF spectrum on the power to detect eQTL is reflected in the differences in power between the baboon data set and the YRI data set in the effect size-based vs the PVE-based simulations. In all cases, we calculated power as the proportion of genes with simulated eQTL recovered at a 10% FDR. In PVE-based simulations, power to detect simulated eQTL was greater in the YRI data set (Figure 2A, dashed orange line vs dashed purple line), although this advantage disappeared when the YRI data set was subsampled to the same size as the baboon data set (Figure 2A, dashed gray line). However, the baboon data set provided more power to detect eQTL than the YRI data set (whether subsampled or not) when simulations were based on effect size, where power scales with MAF (Figure 2A, solid lines). Based on these differences, we estimate that the power to identify an eQTL of effect size equal to the mean estimated beta in baboons (0.96), is increased in the Amboseli baboons by approximately 1.34-fold (Figure 2A, solid purple line vs solid orange line) as a function of differences in the MAF spectrum alone.

Differences in genetic diversity and linkage disequilibrium

Because our RNA-seq-based approach does not identify variants outside of transcribed regions, causal SNPs were probably often not typed. To quantify the power to detect eQTL under this scenario, we again simulated eQTL among genes in the baboon and YRI data sets, but masked the causal sites. Doing so revealed much greater power to identify eQTL in baboons than in humans, across all values of simulated PVE or effect size (Figure 2B; Figure 2—figure supplement 1). One possible explanation for this observation stems from increased genetic diversity in the baboons compared to the YRI. Indeed, in baboons we tested an average of 45.4 (±57.0 s.d.) genetic variants for each gene, whereas applying the same pipeline in YRI yielded an average of 20.3 (±21.4 s.d.) testable variants per gene. An alternative explanation relates to patterns of LD, which we estimate to decay somewhat more slowly in the baboons (Figure 2B, inset). Higher SNP density in baboons increases the likelihood that, when a causal SNP is not typed, a nearby SNP will be available that tags it. Longer range LD suggests that a given SNP could also tag distant causal variants more effectively.

Figure 2—figure supplement 1.

Relationship between power to detect eQTL and simulated effect size, when the true eQTL is masked.

DOI: http://dx.doi.org/10.7554/eLife.04729.019

To assess the contributions of SNP density and LD, we refined our simulations by first thinning the SNP density in the baboons to match SNP density in the YRI, and again masking the simulated causal eQTL. As expected, reducing genetic diversity in the baboons reduced the power to detect genes with a true eQTL (Figure 2B, purple dashed line vs pink dashed line). However, it did not completely account for the difference between the human population and the baboon population, suggesting that LD patterns probably contribute to higher eQTL mapping power in baboons as well as SNP density. Specifically, for an eQTL that explains 28% of the variance in gene expression levels (the mean PVE detected in baboons for genes with significant eQTL), we estimate that SNP density and LD effects increase power by 1.21-fold (Figure 2B, purple dashed line vs pink dashed line) and 1.43-fold (Figure 2B, pink dashed line vs orange dashed line), respectively, when causal SNPs are not typed. Together, our simulations suggest that the MAF spectrum, genetic diversity, and LD patterns increase the number of genes with detectable eQTL in baboons vs the YRI by 2.35-fold overall (1.34× from the MAF, 1.21× from SNP density effects, and 1.43× from LD effects). Further, considering that the effect size estimates in baboons tended to be larger than in the YRI (mean of 0.96 in baboons vs mean of 0.80 in YRI), the actual fold increase estimated from simulations is approximately 6-fold (Figure 4—figure supplement 1: ratio of purple vs orange lines at these effect sizes). This estimate is remarkably consistent with empirical results from our comparison of the real baboon and YRI data, in which we identified 6.16-fold the number of eQTL in the baboons. One possibility is that this difference arises from a history of known admixture in Amboseli between the dominant yellow baboon population and immigrant anubis baboons (Papio anubis: Alberts and Altmann, 2001; Tung et al., 2008). Thus, it might reflect the difference between an admixed population and an unadmixed population rather than a difference between species. However, this explanation seems unlikely because evidence for ASE does not extend further from tested genes in baboons compared to YRI (Figure 1—figure supplement 10), and because adding controls for local (chromosome-specific) structure when testing for eQTL still results in a large excess of eQTL detected in the baboon data set (∼7× higher than in YRI: ‘Materials and methods’ and Figure 1—figure supplement 11)

Figure 1—figure supplement 10.

Differences in the magnitude of ASE vs distance between sites.

DOI: http://dx.doi.org/10.7554/eLife.04729.013

Figure 1—figure supplement 11.

Location of eQTL SNPs relative to genes with and without controlling for local structure.

DOI: http://dx.doi.org/10.7554/eLife.04729.014

Mixed evidence for natural selection on gene expression levels

Interestingly, we found that genes harboring eQTL in baboons were also more likely to have detectable eQTL in the YRI (hypergeometric test, p = 2.39 × 10−7). Given the sample size limitations of the data sets we considered, this overlap suggests that large effect eQTL tend to be nonrandomly concentrated in specific gene orthologues. This pattern could arise if the regulation of some genes has been selectively constrained over long periods of evolutionary time, whereas others have been more permissible to genetic perturbation. Indeed, we found that the mean per-gene phyloP score calculated based on a 46-way primate comparison was significantly reduced (reflecting less conservation) for genes with detectable eQTL in both species, and greatest for genes in which eQTL were not detected in either case (p < 10−53; Figure 3A). We obtained similar results using phyloP scores based on a 100-way vertebrate comparison (p < 10−21; Figure 3—figure supplement 1).

Figure 3.

Mixed evidence for negative selection on variants affecting gene expression level.

(A) Genes that harbor detectable eQTL in baboons, the YRI, or both are more likely to be conserved across long stretches of evolutionary time, based on mean phyloP scores in a 46-way primate genome comparison (n = 7268; p < 10−53). (B) These genes are also more likely to be lineage-specific, based on Homologene annotations (n = 7065; p = 1.78 × 10−8). (C) Although we detect a strong negative correlation between eQTL effect size and eQTL minor allele frequency, in support of pervasive selection against alleles with large effects on gene expression levels, this correlation also appears when simulating constant eQTL effect sizes, suggesting winner's curse effects. See Figure 3—figure supplement 1 for phyloP results based on a 100-way vertebrate genome comparison.

DOI: http://dx.doi.org/10.7554/eLife.04729.020

Genes with eQTL in both data set or one data set are less conserved across vertebrates than genes for which no eQTL were detected (n = 7,268, p < 10−19).

DOI: http://dx.doi.org/10.7554/eLife.04729.021

Figure 3—figure supplement 1.

Correlation between eQTL detection and mean phyloP scores based on 100-way vertebrate comparison.

Genes with eQTL in both data set or one data set are less conserved across vertebrates than genes for which no eQTL were detected (n = 7,268, p < 10−19).

DOI: http://dx.doi.org/10.7554/eLife.04729.021

Mixed evidence for negative selection on variants affecting gene expression level.

Correlation between eQTL detection and mean phyloP scores based on 100-way vertebrate comparison.

Genes with eQTL in both data set or one data set are less conserved across vertebrates than genes for which no eQTL were detected (n = 7,268, p < 10−19). DOI: http://dx.doi.org/10.7554/eLife.04729.021 eQTL were more likely to be identified for genes with higher genetic diversity (Figure 1—figure supplement 8), which may account for the relationship between phyloP score and eQTL across species: highly conserved genes are less likely to contain many variable sites. More conserved genes also tend to have slightly lower average minor allele frequencies (p = 0.002), which might reduce the power to detect eQTL (although the effect size is small: r2 = 0.001). However, genes with eQTL in both species were also less likely to have orthologues in deeply diverged species, based on conservation in Homologene (β = −0.036, p = 1.78 × 10−8; Figure 3B). Genetic diversity within the baboons is very weakly correlated with Homologene conservation (r2 = 0.004) and uncorrelated with average minor allele frequency (p = 0.38). Thus, sequence-level conservation scores and depth of homology across species combine to suggest that eQTL—or at least those with relatively large effect sizes—are least likely to be detected for strongly conserved loci, and most likely to be detected for lineage-specific, rapidly evolving genes. Consistent with this idea, genes involved in basic cellular metabolic processes were under-enriched among the set of genes with eQTL in both species, and enriched among the set of genes for which no eQTL were detected in either species (Supplementary file 1B–C). The set of genes with eQTL in either or both species, on the other hand, were enriched for loci involved in antigen processing, catalytic activity, and interaction with the extracellular environment (e.g., receptors, membrane-associated proteins). Widespread selective constraint on gene expression levels has been suggested in previous eQTL analyses in humans, with evidence supplied by a strong negative correlation between minor allele frequency and eQTL effect size (Battle et al., 2014). This pattern could arise if selection acts against large genetic perturbations, such that variants of large effect would be present only at low frequencies. Consistent with this idea, plotting eQTL effect size vs MAF in the baboons results in a very strong, highly significant negative correlation (r = −0.723, p < 10−280; Figure 3C), with no large effect eQTL detected at higher MAFs. However, such a relationship could also be a consequence of the so-called winner's curse (in which sampling variance leads to upwardly biased effect size estimates: Zöllner and Pritchard, 2007) because the degree of bias in effect size estimation is itself negatively correlated with MAF. Indeed, when we simulated sets of eQTL with constant small effect sizes (β = 0.75, close to the mean effect size detected for SNPs with MAF ≥0.4), we found that the relationship between estimated effect size and MAF among detected eQTL almost perfectly recapitulated the observed negative correlation. Hence, the correlation between estimated eQTL effect size and MAF in the baboons does not provide strong support for widespread negative selection on gene expression phenotypes within species. We note, however, that our sample size of individuals is much smaller than that used for a similar analysis in humans (Battle et al., 2014: n = 922 individuals), and larger sample sizes should attenuate winner's curse effects.

Genetic and environmental contributions to gene expression variation in wild baboons

Finally, we took advantage of our data set to generate the first estimates of genetic, demographic (age and sex), and environmental contributions to gene expression variation in wild nonhuman primates (Supplementary file 1D). While our limited sample size leads to high variance around estimates for any individual gene, the median estimates across genes should be unbiased (Zhou et al., 2013), so we concentrated on these overarching patterns. We focused specifically on three social environmental variables of known importance in this population, all of which have been extensively investigated as models for human social environments. These were: (i) early life social status, which predicts growth and maturation rates (Altmann and Alberts, 2005; Charpentier et al., 2008); (ii) maternal social connectedness to other females, which predicts both adult lifespan and the survival of a female's infants (Silk et al., 2003, 2009, 2010; Archie et al., 2014); and (iii) maternal social connectedness to males, based on recent evidence that heterosexual relationships have strong effects on survival as well (Archie et al., 2014). Overall, we found that genetic effects on gene expression levels tended to be far more pervasive than demographic and environmental effects. Specifically, the median additive genetic PVE was 28.4%, similar to, or slightly greater than, estimates from human populations (Monks et al., 2004; McRae et al., 2007; Emilsson et al., 2008; Price et al., 2011; Wright et al., 2014). We applied a Bayesian sparse linear mixed model (BSLMM: Zhou et al., 2013) to further partition this additive genetic PVE into two components: a component attributable to cis-SNPs (here, all SNPs within 200 kb of a gene) and a component attributable to trans-SNPs (all other sites in the genome). Again similar to humans (Price et al., 2011; Wright et al., 2014), we found that more of the additive genetic PVE is explained by the trans component (median PVE = 23.8%) than the cis component (median PVE = 2.9%) (Figure 4). Unsurprisingly, we estimated a larger cis-acting component for genes in which functional cis-regulatory variation was detected in our previous analysis (median PVE = 10.2% among eQTL genes and median PVE = 5.0% among ASE genes).

Figure 4.

Genetic contributions to variance in gene expression levels in wild baboons.

Proportion of variance in gene expression levels estimated for all genes, genes without detectable eQTL, and genes with detectable eQTL. Additive genetic effects on gene expression variation, especially cis-acting effects, are larger for eQTL genes than for other genes. See Figure 4—figure supplements 1–3 for related results on percent variance explained by genetic, environmental, and demographic variables and results using an alternative set of SNPs for estimating p.

DOI: http://dx.doi.org/10.7554/eLife.04729.022

DOI: http://dx.doi.org/10.7554/eLife.04729.023

DOI: http://dx.doi.org/10.7554/eLife.04729.024

DOI: http://dx.doi.org/10.7554/eLife.04729.025

Genetic contributions to variance in gene expression levels in wild baboons.

PVE explained by demographic and early environmental variables.

Distribution of PVE explained by additive genetic variance, age, sex, and maternal social connectedness to males across all genes.

DOI: http://dx.doi.org/10.7554/eLife.04729.024

Genetic contributions to variance in gene expression levels, with ptrans based on SNPs on other chromosomes only.

DOI: http://dx.doi.org/10.7554/eLife.04729.025 In contrast to the substantial genetic effects we detected, the median PVE explained by age and sex were 1.89% and 0.82%, respectively (Figure 4—figure supplements 1–2). The distribution of PVE explained by age was significantly greater than expected by chance (Kolmogorov–Smirnov test on binned PVEs, in comparison to permuted data: p < 10−11), whereas that explained by sex was not (p = 0.100); large sex effects tended to be constrained to a small set of genes on the X chromosome (Figure 4—figure supplement 1). Of the early environmental variables we investigated, only maternal social connectedness to males explained more variance in gene expression levels than expected by chance (p = 4.19 × 10−3), with a median PVE of 1.9%. Notably, while social connectedness to males (i.e., heterosexual bonds) and social connectedness to females (i.e., same-sex bonds) are both known predictors of longevity in the Amboseli baboons, previous analyses suggest that their effects are largely independent (Archie et al., 2014). Our result extends this observation to the early life effects of maternal social connectedness on variance in gene expression levels. Taken together, our data suggest that while almost all genes are influenced by genetic variation, the effects of demographic and environmental parameters are generally modest for any single aspect of the environment. However, in at least some cases, we find evidence that early environmental effects on gene expression levels appear to persist across the life course, as has previously been demonstrated in laboratory settings and in response to severe early adversity in humans (e.g., Weaver et al., 2004; Miller et al., 2009; Cole et al., 2012).

Discussion

Much of what we know about genetic contributions to variation in gene expression levels in primates (and vertebrates more generally) come from the extensive body of research on humans. However, increasing evidence indicates that humans are demographically unusual: compared to other primates, humans exhibit low levels of neutral genetic diversity and a low long-term effective population size (Chen and Li, 2001; Hernandez et al., 2007; Perry et al., 2012). Further, humans are distinguished from other primates by recent explosive population growth (Keinan and Clark, 2012; Tennessen et al., 2012). While late Pleistocene population expansion has been suggested for some nonhuman primates, including chimpanzees and Chinese-origin rhesus macaques (Hernandez et al., 2007; Wegmann and Excoffier, 2010), none have undergone the extreme levels of population increase that characterized humans. Indeed, evidence from microsatellite data suggests that the long-term effective population size of baboons actually may have contracted during this period (Storz et al., 2002). These differences are not simply of historical interest, but also important for understanding the genetic architecture of traits measured in the present day. Differences in demographic history not only affect overall levels of genetic variation and the minor allele frequency spectrum, but also the mean effect size of sites that contribute to phenotypic variation (Lohmueller, 2014). Interestingly, demographic history does not impact overall trait heritability (Lohmueller, 2014; Simons et al., 2014), perhaps explaining why we estimated mean additive genetic PVEs for gene expression levels in baboons that are similar to those estimated for humans. However, demographic history can influence the power to detect individual genetic contributions to phenotypic variation. Large-scale population expansion of the type that occurred in human history appears to reduce power to identify genotype-phenotype correlations for fitness-related traits (Lohmueller, 2014). This observation may account, in part, for our ability to identify many more functional regulatory variants in the baboons than we expected based on previous studies in humans. However, while our analysis extends previous observations that large effect eQTL are nonrandomly distributed, we found mixed evidence for widespread negative selection on gene expression levels. Specifically, within the baboons alone, we found that the negative relationship between eQTL effect size and minor allele frequency was explicable based on winner's curse effects alone. Thus, increased power to identify functional regulatory variants in the baboons is probably not due to pervasive associations between gene expression levels and fitness. In contrast, stronger evidence for selection on gene expression patterns stems from our cross species comparisons. In particular, we observed that genes with eQTL in baboons significantly overlapped with genes with eQTL in humans, and that these genes as a class also tended to be less constrained at the sequence-level (consistent with observations for analyses of cis-eQTL in humans alone: Popadin et al., 2014). This result suggests that genes vary in their tolerance of functional regulatory genetic variation, and, intriguingly, that gene-specific robustness to genetic perturbation may be a conserved property across species. Because no comparable data are yet available for other large mammal populations, including for other baboons, it is unclear whether our results are typical or instead a consequence of the Amboseli population's own unique history. In particular, the population has experienced recent admixture between yellow baboons, the dominant taxon, and closely related anubis baboons (P. anubis) (Alberts and Altmann, 2001; Tung et al., 2008). Admixture, which appears to be relatively common in natural populations (Mallet, 2005), can have important consequences for genetic diversity and LD patterns. While it appears to have had a modest impact on the relative ability to map gene expression phenotypes in baboons vs the YRI data set, comparison to a non-admixed baboon population could help resolve this question further. More generally, our results encourage further investigation of the relationship between demography and trait genetic architecture in other populations, as has been suggested for humans (Lohmueller, 2014) but could also be profitably extended to nonhuman model systems. Such comparisons would provide an empirical basis for testing predicted relationships between demographic history and the power to identify genotype-phenotype associations. From an applied perspective, they could also help identify animal models that favor more highly powered association mapping studies, a strategy that has already been heavily exploited in domestic dogs (Karlsson et al., 2007; Karlsson and Lindblad-Toh, 2008) and suggested for rhesus macaques (Hernandez et al., 2007). While the same sites will probably rarely be associated with the same traits across species, this strategy could help identify molecular mechanisms that are conserved across humans and animal models (e.g., Lamason et al., 2005). Comparisons that use matched sample types will be particularly informative: our study compared eQTL detection in whole blood (from the baboons) with eQTL from lymphoblastoid cell lines (YRI), which exhibit highly correlated, but not identical, patterns of overall gene expression (Spearman's rho = 0.645, p < 1 × 10−16)—these differences could affect rates of eQTL detection as well. Finally, our data—the first profile of genome-wide gene expression levels in a wild primate population—serve as a useful proof of principle of the ability to concurrently generate genome-wide gene expression phenotype and genotype data, and to relate them to each other using eQTL and ASE approaches. Intensively studied natural primate populations—some of which have been studied continuously for 30 or more years—have emerged as important phenotypic models for human behavior, health, and aging. The approach we used here provides a way to leverage these models for complementary genetic studies as well, especially if eQTL prove to be strongly enriched for sites associated with other traits, as in humans (Nicolae et al., 2010). Although preliminary, our results highlight the increasing feasibility of integrating functional genomic data with phenotypic data on known individuals in the wild. For example, our data set revealed a number of genes in which variation in gene expression levels could be mapped to an identifiable eQTL, validated using an ASE approach, and also linked to early life environmental variation. Such cases suggest the potential for future investigations of the molecular basis of persistent environmental effects, including whether genetic and environmental effects act additively or interact.

Materials and methods

Study subjects and blood sample collection

Study subjects were 63 individually recognized adult members (26 females and 37 males) of the Amboseli baboon population. All study subjects were recognized on sight by observers based on unique physical characteristics. To obtain blood samples for RNA-seq analysis, each baboon was anesthetized with a Telazol-loaded dart using a handheld blowpipe. Study subjects were darted opportunistically between 2009 and 2011, avoiding females with dependent infants and pregnant females beyond the first trimester of pregnancy (female reproductive status is closely monitored in this population, and conception dates can be estimated with a high degree of accuracy). Following anesthetization, animals were quickly transferred to a processing site distant from the rest of the group. Blood samples for RNA-seq analysis were collected by drawing 2.5 ml of whole blood into PaxGene Vacutainer tubes (Qiagen, Valencia, CA), which contain a lysis buffer that stabilizes RNA for downstream use. Following sample collection, study subjects were allowed to regain consciousness in a covered holding cage until fully recovered from the effects of the anesthetic. They were then released within view of their social group; all subjects promptly rejoined their respective groups upon release, without incident. Blood samples were stored at approximately 20°C overnight at the field site. Samples were then shipped to Nairobi the next day for storage at −20°C until transport to the United States and subsequent RNA extraction.

Gene expression profiling using RNA-seq

For each RNA sample (one per individual), we constructed an RNA-seq library suitable for measuring whole genome gene expression using Dynal bead poly-A mRNA purification and a standard Illumina RNA-seq prep protocol. Each library was randomly assigned to one lane of an Illumina Genome Analyzer II instrument and sequenced to a mean depth of 30 million 76-base pair reads (±4.5 million reads s.d., Supplementary file 1A). The resulting reads were mapped to the baboon genome (Panu2.0) using the efficient short-read aligner bwa 0.5.9 (Li and Durbin, 2009), with a seed length of 25 bases, a maximum edit distance of two mismatches in the seed, a read trimming quality score threshold of 20, and the default maximum edit distance (4% after trimming). To recover reads that spanned putative exon–exon junctions, and therefore could not be mapped directly to the genome, we used the program jfinder on reads that did not initially map (Pickrell et al., 2010b). Finally, we filtered the resulting mapped reads data for low quality reads (quality score <10) and for reads that did not map to a unique position in the genome. To assign reads to genes, we used the RefSeq exon annotations for Panu 2.0 (ref_Panu_2.0_top_level.gff3, downloaded September 6, 2012). We considered the total read counts for each gene and individual as the sum of the number of reads for that individual that overlapped the union of all exon base pairs assigned to a given gene. In downstream analyses, we considered only highly expressed genes that had non-zero counts in more than 10% individuals, and that had mean read counts greater than or equal to 10 (excluding the gene for beta-globin). We then performed quantile normalization across samples followed by quantile normalization for each gene individually, resulting in estimates of gene expression levels for each gene that were distributed following a standard normal distribution. This procedure effectively removed GC bias in gene expression level estimates (Figure 1—figure supplement 2). For eQTL mapping, ASE analysis, and PVE estimation for sex and age we used all 63 individuals. For PVE estimation for maternal rank and social connectedness, missing data meant that we conducted our analysis on n = 52 and n = 47 individuals, respectively.

Variant identification and genotype calls

To identify genetic variants in the baboon data set, we used the Genome Analysis Toolkit (v. 1.2.6; McKenna et al., 2010; DePristo et al., 2011). Because no validated reference set of known genetic variants are available for baboon, we performed an iterative bootstrapping procedure for base quality score recalibration. Specifically, we performed an initial round of base quality score recalibration and identified a set of variants using GATK's UnifiedGenotyper and VariantFiltration walker. From this call set, we constructed a set of high confidence variants with quality score ≥100 that passed all filters for variant confidence (variants failed if QD < 2.0), mapping quality (variants failed if MQ < 35.0), strand bias (variants failed if FS > 60.0), haplotype score (variants failed if HaplotypeScore >13.0), mapping quality (variants failed if MQRankSum < −12.5) and read position bias (variants failed if ReadPosRankSum < −8.0). We used this high confidence set as the set of ‘known sites’ in a second round of base quality score recalibration, repeating this procedure until the number of variants identified in consecutive rounds of recalibration stabilized. In the final call set, we removed all sites (i) that were monomorphic in the Amboseli samples; (ii) for which genotype data were missing for more than 12 individuals (19%) in the data set; (iii) that deviated from Hardy–Weinberg equilibrium; and (iv) that failed the above quality control filters. We further filtered the data set to contain only sites with a minimum quality score of 100 that were located within 200 kb of a gene of interest, and that were sequenced at a mean coverage ≥5× across all samples. We validated our quality control and filtering steps by performing the same procedure on an RNA-seq data set from the HapMap Yoruba population (see below). These steps resulted in a set of 64,432 single nucleotide polymorphisms carried forward into downstream analysis (30,938 for the YRI). For eQTL mapping analysis, missing genotypes in this final set were imputed using BEAGLE (Browning and Browning, 2009). To estimate genome-wide LD, we followed the approach of Eberle et al. (2006), which uses allele frequency-matched SNPs to calculate pair-wise LD. Specifically, we selected SNPs with MAFs greater than 10% and divided them into four subgroups (MAF between 10%–20%; MAF between 20%–30%; MAF between 30%–40%; and MAF between 40%–50%). We then calculated pair-wise r for all SNP pairs within 100 kb in each subgroup using VCFtools (Danecek et al., 2011) and combined values from all four subgroups.

Estimating accuracy of SNP genotypes using human RNA-seq data

To assess the accuracy of the RNA-seq-based genotyping calls we performed in the baboons, we investigated a similarly sized data set of RNA-seq reads from a human population (Pickrell et al., 2010a). Because this data set focused on samples from the HapMap consortium (n = 69 members of the Yoruba population from Ibadan, Nigeria), we were able to compare genotypes called using the RNA-seq pipeline to independently collected genotype data from HapMap Phase 3 (r27) (International HapMap Consortium, 2010). To do so, we focused on 9919 variants that were genotyped in both data sets. We then calculated the correlation between genotypes called in the RNA-seq-based pipeline and genotypes from HapMap, for each individual (Figure 1—figure supplement 5A). We also found that low accuracy was correlated with the level of apparent homozygosity in the genotype data (Figure 1—figure supplement 5B). In the baboon data, we had no individuals with unusually low homozygosity, but six individuals with unusually high homozygosity (>80% of genotype calls). These outliers were missing a median of 10.6% of data in the unimputed genotype data set, whereas all other individuals were missing a median of 0.6% data. However, removing these six individuals from our analysis resulted in very similar results as using the full data set: 87.6% of eQTL genes (n = 1566) identified when using all individuals were also identified with this subset.

Figure 1—figure supplement 5.

Accuracy of genotype calls for SNPs independently typed in HapMap3.

DOI: http://dx.doi.org/10.7554/eLife.04729.008

Importantly, the available data from humans also support accurate variant discovery. Of the 30,938 sites that we identified from the RNA-seq data and that passed all of our filters, only 3.1% (967) did not have an assigned rsID in dbSNP release 138. These sites were likely enriched for false positives, as the transition/transversion ratio for this set was 1.42, vs 2.80 for the set of 30,938 sites as a whole.

eQTL mapping

To identify cis-acting eQTLs in the baboon data set, we used the linear mixed model approach implemented in the program GEMMA (Zhou and Stephens, 2012). This model provides a computationally efficient method for eQTL mapping while explicitly accounting for genetic non-independence within the sample; in our case, some individuals in the data set are related (although overall relatedness was low: the median kinship coefficient across all pairs was 0.015; mean = 0.024 ± 0.033 s.d.). For each gene, we considered all variants within 200 kb of the gene as candidate eQTLs. For each variant, we fitted the following linear mixed model:and tested the null hypothesis H0: β = 0 vs the alternative H1: β ≠ 0. Here, y is the n by 1 vector of gene expression levels for the n individuals in the sample. Gene expression values were first corrected for hidden factors that could act as sources of global structure (e.g., batch effects or ancestry- or environment-related trans effects) by regressing out the first 10 principal components of the gene expression data. Consistent with previous results (e.g., Pickrell et al., 2010a), this procedure greatly improves our ability to detect eQTL (Figure 1—figure supplement 12). In the model, μ is the intercept; x is the n by 1 vector of genotypes for the variant of interest; and β is the variant's effect size. The n by 1 vector of u is a random effects term to control for individual relatedness and other sources of population structure, where the n by n matrix K = XXT/p provides estimates of pairwise relatedness derived from the complete 63 × 64,432 genotype data set X. Residual errors are represented by ε, an n by 1 vector, and MVN denotes the multivariate normal distribution.

Figure 1—figure supplement 12.

Number of eQTL identified by PCs removed from the gene expression data set.

DOI: http://dx.doi.org/10.7554/eLife.04729.015

We took the variant with the best evidence (i.e., lowest p-value) for association with gene expression levels for each gene, and then calculated corrected gene-wise q-values (with a 10% false discovery rate threshold) via comparison to the same values obtained from permuted data (similar to Barreiro et al., 2012; Pickrell et al., 2010a).

Possible confounds associated with eQTL mapping using RNA-seq data

We evaluated the potential for eQTL mapping based on RNA-seq data to introduce four possible confounds. First, for genes with large effect cis-eQTLs, reads from heterozygotes at eQTL-linked sites might be biased towards the allele associated with higher gene expression levels. If so, heterozygotes might be mistakenly genotyped as homozygotes for the high expressing allele, resulting in an underrepresentation of heterozygous genotypes relative to neutral expectations. To control for this possibility, we eliminated sites that violated Hardy-Weinberg expectations (n = 2386) from our analyses. We note, however, that this scenario would not introduce false positives. Instead, it would lead to more conservative detection of additive eQTL effects, with the direction of an estimated eQTL effect still consistent with the true effect. Second, SNP calling might be biased towards the reference allele. If so, more reads would be required to support a genotype call of homozygote alternate than a genotype call of homozygote reference. This bias would result in higher apparent expression levels for alternate allele homozygotes and lower expression levels for reference allele homozygotes, which could create false positive eQTLs. However, we observe no evidence for this scenario in our data set. For all tested SNPs (n = 64,432) and for eQTL SNPs only (n = 1693), alternate allele homozygotes tend to have slightly lower coverage than reference allele homozygotes, and heterozygotes tend to have the highest coverage (because more reads are required to support inference of heterozygosity) (Figure 1—figure supplement 13). Thus, coverage and genotype do not covary additively, and this potential confound is unlikely to produce false positive eQTLs.

Figure 1—figure supplement 13.

Coverage by genotype call.

DOI: http://dx.doi.org/10.7554/eLife.04729.016

Third, read mapping might be biased towards the reference allele, such that reads carrying the alternate allele are less likely to map because they contain more mismatches to the reference genome. This possibility is consistent with our observation that alternate allele homozygotes tend to have slightly less coverage than reference allele homozygotes (Figure 1—figure supplement 13). While this difference in coverage is significant (Kolmogorov-Smirnov test: p < 2.2 × 10−16 for all SNPs; p = 3.9 × 10−5 for eQTL SNPs), the magnitude of the effect itself is modest (Figure 1—figure supplement 13), probably because we allowed reads to map with up to three mismatches: Wittkopp and colleagues have shown that reference allele mapping bias is largely obviated by allowing reads to map with more mismatches (Stevenson et al., 2013). Further, systematic calling of false positive eQTLs due to biased read mapping would predict a bias towards negative effect sizes (i.e., eQTL effects suggesting that the alternate allele is associated with lower expression levels). Our data are not consistent with such a pattern: 47% of eQTL betas are negative, whereas 53% are positive. Reference allele mapping biases are, however, more likely to affect ASE analysis, producing a pattern of greater expression in the reference allele. Indeed, we do observe a bias towards negative betas in the ASE analysis (67.2% of n = 510 genes), although the overall magnitude and direction of ASE data agree well with eQTL evidence. Fourth, lower mean coverage in homozygotes of either type relative to heterozygotes could induce false positive eQTLs in which the major allele was associated with lower gene expression levels. To test this possibility, we recoded eQTL effects to reflect the effect of the major allele instead of the effect of the alternate allele (i.e., a genotype of 0 = homozygous minor and a genotype of 2 = homozygous major). We observed a modest excess of eQTL for which the major allele was associated with lower gene expression levels (56%, binomial test p = 1.15 × 10−7). This bias did not differ depending on whether the major allele was the reference allele or the alternate allele (Fisher's Exact Test, p = 0.28), supporting minimal read mapping biases in our data. Instead, it appears to be primarily driven by SNPs with low minor allele frequencies (proportion of negative betas for the lowest quartile of MAFs = 62.8%, p = 7.49 × 10−8; highest quartile of MAFs = 48.6%, p = 0.602). At these sites, eQTL inference relies primarily on two genotype classes (the major allele homozygotes and heterozygotes) rather than three genotype classes. Because heterozygotes tend to have slightly higher coverage than homozygotes of both classes, spurious relationships between genotype and gene expression levels are much less likely to be observed when both types of homozygotes are well represented (i.e., MAFs are larger). Along with the high genotype accuracy rates estimated from the Yoruba data, our analyses thus indicate that the set of eQTL we identified are largely robust to RNA-seq-specific confounds. The eQTL identified in YRI in the conventional pipeline vs the RNA-seq pipeline offer a further source of comparison. We find that eQTL identified through the RNA-seq pipeline tend to be associated with more highly expressed genes (providing greater power to call genotypes: Wilcoxon test p = 6.53−9), but otherwise do not differ in sequence conservation (phyloP scores: p = 0.707; Homologene scores: p = 0.603) or in estimated effect size (p = 0.137) (Figure 1—figure supplement 9). Further, effect size magnitude is highly correlated across pipelines when eQTL are discovered in both pipelines (r = 0.874, p < 10−57). When eQTL were discovered only in the RNA-seq pipeline (n = 104), they tended to be high on the ranked list of eQTL evidence in the conventional pipeline as well (median rank of 1395, where the top 700 were significant and 10,615 genes were tested), suggesting that many of them did not pass the threshold for eQTL detection in that analysis. Thus, the most salient source of error stems from low MAF sites, which are also the cases most vulnerable to sampling error and winner's curse effects more generally (Figure 3)—a problem that is not confined to RNA-seq-based eQTL mapping. Taken together, these analyses argue that, as a general rule, eQTL associated with lower MAF SNPs should be treated with increased caution.

ASE detection

To identify ASE, we focused on SNPs within gene exons with Phred-scaled quality scores greater than 10. We further required that these sites have more than five reads in more than two individuals and more than 300 total reads across all heterozygous individuals. This threshold is based on the observation that the power to detect ASE is dependent on sequencing read coverage at heterozygous sites (Fontanillas et al., 2010). Indeed, in our data set, power to detect ASE appeared to scale primarily with total read coverage rather than number of heterozygous individuals. Sites with more reads tended to have more heterozygotes (r = 0.266, p < 10−100); however, when sites were partitioned by total read depth (in deciles), sites with significant ASE were not more likely to harbor more heterozygotes in any decile (Wilcoxon test comparing number of heterozygotes in significant sites vs background; Figure 1—figure supplement 14).

Figure 1—figure supplement 14.

Detection of ASE is not dependent on number of heterozygotes, conditional on total read depth.

DOI: http://dx.doi.org/10.7554/eLife.04729.017

After these filtering steps, we retained 8154 SNPs associated with 2280 genes for ASE analysis. For ASE analysis, we did not take into account possible recombination between exonic SNPs and the (unknown) cis-regulatory variants whose effects they capture, as we did not have detailed data on recombination rates across the baboon genome. However, recombination between exonic SNPs and the true causal regulatory SNPs would decrease our power to detect ASE. For each variant, we considered a beta-binomial distribution (following Pickrell et al., 2010a) to model the number of reads from the (+) haplotype (denoted as xi+) or the number of reads from the (−) haplotype (denoted as xi−), conditional on the number of total reads (denoted as yi = xi+ + xi−), for each individual i, or We tested the null hypothesis H0: α = β vs the alternative H1: α ≠ β using a likelihood ratio test. For both the null model and the alternative model, beta distribution parameters (α and β) were estimated via a maximum likelihood approach, using the R function optim. Again, we took the variant with the lowest p-value for each gene, and then calculated corrected gene-wise q-values (using a 10% false discovery rate threshold) via comparison to the same values obtained from an empirical null distribution. To construct the empirical null distribution, we performed the same analysis after substituting the xi+ value for each variant of interest, for each heterozygous individual, with a randomly selected xi+ value from a heterozygous site elsewhere in the genome (contingent on that site having the same number of total reads, yi).

Power simulations

To assess the relative power of eQTL mapping in baboons vs the YRI data set, we randomly selected 10% of the genes in each data set to harbor eQTL. For each of these simulated eQTL genes, we then randomly chose a SNP among all the cis-SNPs tested (i.e., all variable sites that passed quality control filters and fell within 200 kb of a gene of interest) and assigned it as a causal eQTL. The impact of the eQTL was simulated using either effect size, in which we simulated a constant effect size between 0.25 and 2.5 (in intervals of 0.25) or PVE, in which we chose an effect size that explained a specific proportion of variance in gene expression levels (from 5% to 50%, in intervals of 5%). We then simulated gene expression levels by adding the effect of the simulated cis-eQTL SNP to residual errors drawn from a standard normal distribution. To calculate the FDR, we also simulated a set of genes with no eQTL. For each combination of effect sizes and population (baboon or YRI), and for each simulation scenario (e.g., with the causal SNP masked or unmasked, with SNP density thinned in the baboons, or using PVE vs a constant effect size), we performed 10 replicates. For each replicate, we calculated the power to detect eQTL as the proportion of simulated eQTL genes recovered at a 10% empirical FDR.

Testing the contribution of admixture to eQTL detection

To investigate whether admixture might drive our power to detect eQTL in the baboon data set, we performed three analyses. First, we asked whether evidence for ASE remained similar across longer distances (i.e., between sites separated by more base pairs) in the baboons vs in the YRI. Such a pattern might be expected if long-distance, admixture-driven LD explained our other observations. However, the pattern of ASE similarity (the magnitude of the difference between ASE estimates) by distance between sites was highly congruent between the YRI and baboon data sets (Figure 1—figure supplement 10). Second, we investigated whether adding a control for local structure (i.e., population structure in cis to a gene of interest, and based only on variants located on the same chromosome) asymmetrically reduced evidence for eQTL in the baboon data set relative to the YRI data set. To do so, we regressed out the top two PCs for variants on the same chromosome as the gene of interest from the gene expression data prior to fitting mixed effects models. We found that this approach modestly reduced the number of eQTL discoveries in the baboon data set (n = 1583 from n = 1787, an 11.4% difference). However, this number was still 5.4× larger than the number of eQTL detectable in the YRI, and when we applied the same local structure control to the YRI data, a comparable drop in the number of discoverable eQTL also occurred (n = 216 from n = 290, resulting in a ∼7× fold increase in eQTL in baboons vs YRI). Third, we compared the spatial distribution of eQTL in baboon between the models with and without local structure controls. We reasoned that if admixture drove most of the signal in the data set, controlling for local structure should shift the location of discovered eQTL closer to the gene of interest, where the strongest cis effects are generally identified. However, the locations of eQTL were very similar under both models (Kolmogorov-Smirnov test, p = 0.577; Figure 1—figure supplement 11).

Evidence for patterns consistent with natural selection on gene expression levels

We investigated the relationship between conservation level and the presence of detectable eQTL in the Amboseli baboons or the YRI using phyloP conservation scores (Pollard et al., 2010) and Homologene conservation of orthology across species. For the former, we extracted the per-site phyloP score from the 46-way primate comparison or 100-way vertebrate comparison on the UCSC Genome Browser for each base contained within the annotated exons (including untranslated regions) used for mapping RNA-seq reads in the YRI. We then calculated the average phyloP score across all exons associated with a given gene. We obtained Homologene scores from the CANDID database (Hutz et al., 2008). In both cases, we used linear models to test for a relationship between conservation level and three categories of genes: those with no detectable eQTL in either the baboons or YRI; those with a detectable eQTL in one of the two species; and those with a detectable eQTL in both species. To investigate whether the correlation between minor allele frequency and eQTL effect size could be a result of winner's curse effects, we extracted the results from our simulations in which the causal variant was masked and the true effect size was fixed at a small value (beta = 0.75). We then calculated the correlation between the estimated effect size () from these simulations against minor allele frequency, for detected eQTL only.

Estimation of genetic contributions to gene expression

We used the Bayesian sparse linear mixed model (BSLMM) approach implemented in the GEMMA software package (Zhou and Stephens, 2012) to estimate the genetic contribution to gene expression variation. Specifically, for each gene, we fit the following model:where y is the n by 1 vector of gene expression levels for n individuals; μ is the intercept; xcis is an n by pcis matrix of genotypes for pcis cis-SNPs and βcis are the corresponding effect sizes; xtrans is an n by ptrans matrix of genotypes for ptrans trans-SNPs and βtrans are the corresponding effect sizes; and ε is an n by 1 vector of i.i.d. residual errors. We used different priors for cis-acting effects and trans-acting effects to capture different properties for the two components. Specifically, the spike-slab prior on the cis effects βcis captures our prior belief that only a small proportion of local SNPs has cis effects and these effects are relatively large. The normal prior on the trans effects captures our prior knowledge that trans-acting SNPs tend to be relatively difficult to find and have relatively small effects. In addition, because pcis is small and ptrans approximately equals p, the number of total SNPs, we used p instead of ptrans to facilitate computation (i.e., ptrans was based on all genotyped sites used in our analyses, n = 64,432). Results are qualitatively similar if ptrans is calculated based on sites that must act in trans (i.e., sites located on a different chromosome than the chromosome containing the gene of interest: Figure 4—figure supplement 3). We used Markov chain Monte Carlo (MCMC) to fit the model with 1000 burn-in and 10,000 sampling steps. We obtained posterior samples of βcis and βtrans to calculate the PVE attributed by each of the two components, as well as the total additive genetic PVE contributed by both components.

Figure 4—figure supplement 3.

Genetic contributions to variance in gene expression levels, with ptrans based on SNPs on other chromosomes only.

DOI: http://dx.doi.org/10.7554/eLife.04729.025

To calculate PVE values for demographic and environmental predictors, we again used the linear mixed model approach implemented in GEMMA to control for additive genetic effects. Sex was known from direct observation of the study subjects. Ages were known to within a few days' error for 52 of the 63 individuals in the data set; six animals had birth dates estimated to be accurate within 1 year, four animals had birth dates estimated to be accurate within 2 years, and one had a birth date estimated to be less accurate than 2 years. Early social status was measured using the proportional dominance rank of the individual's mother, at the time of that individual's conception. Dominance ranks are assigned monthly using ad libitum observations of dyadic agonistic (aggressive or competitive) encounters within social groups (Hausfater, 1974; Alberts et al., 2003). Maternal social connectedness values were defined as the social connectedness of the individual's mother, in the year of that female's life during which the focal individual was born. Social connectedness is calculated on a yearly basis as the frequency with which a female was involved in affiliative interactions, relative to the median for all females in the population at the same time and controlling for observer effort (see Runcie et al., 2013; Archie et al., 2014). Social connectedness is measured for females, but can focus on either female–female relationships (SCI-F) or a female's relationship with adult males (SCI-M), which have independent effects on longevity in this population (Archie et al., 2014). For SCI-F, affiliative interactions included both grooming interactions and close spatial proximity to other females. For SCI-M, only grooming interactions were used. For each gene, we fit the following model:where x is the n by 1 vector of values for the demographic or environmental predictor of interest and β is its coefficient. The n by 1 vector of u is a random effects term with K = XXT/p controlling for additive genetic effects. We calculated the PVE estimate as var(x)/var(y), where var denotes the sample variance. eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers. Thank you for sending your work entitled “The Genetic Architecture of Gene Expression Levels in Wild Baboons” for consideration at eLife. Your article has been favorably evaluated by Aviv Regev (Senior editor), a Reviewing editor, and two reviewers, one of whom, Stephen Montgomery, has agreed reveal his identity. The Reviewing editor and the reviewers discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission. Please find below a summary of the comments of the reviewers that we request that you address in your revised manuscript: Figure 1 and 3 should not be main figures. The pipeline should be moved to the supplement. Also the ASE and eQTL effect sizes is a general technical issue and does not relate specifically to the study of the genetics of gene expression in baboons. The text around these sections is long and generally distracting. There is a comment regarding the lack of an association between effect size and minor allele frequency compared to Battle et al., 2014. However, Battle et al. had 922 individuals compared to 66 individuals here (among several technical differences). We are not sure that these studies can be easily compared to provide a definitive statement regarding negative selection in baboons here. A caveat regarding study differences might suffice. The authors bring up the possibility of admixture in the baboon population. This is slightly concerning as it could increase the number of hets and has the potential to increase the length of haplotypes (both observations of the study). It further could explain many of the observations of the study. Three possible suggestions to address this are to test whether: (i) one sees an excess of trans-associations on the same chromosome compared to across chromosomes for cis-eQTL in baboons versus YRI, (ii) apply a surrogate variable or hidden factor correction to eQTL analysis on a single chromosome or (iii), our reviewers’ favorite, test if ASE is correlated over a longer distance (more independent genes) than in humans—in this case, since there is no biological basis for why the locations of causal variants should be further away from the genes they regulate between baboons and humans, the decay of the correlation of ASE measured in proximal genes should be similar. We realize the authors have a model to control for individual relatedness and population structure, but this is derived from the entire genetic data set and does not address local patterns of admixture. The gene expression data was quantile normalized. Why was a hidden factor correction not applied? Typically, these types of corrections dramatically improve eQTL discovery. Our concern is if there is some structure to the data that is both present and correlated in genotype and gene expression space, the number discoveries will be artificially inflated. For ASE analyses: (i) the authors assume no recombination, this is not stated, (ii) how is beta in theta∼beta (alpha, beta) estimated, and (iii) detection of ASE correlates with expression level (Figure 3D), this is not a surprise, but given their model we are concerned whether this estimate is more extreme, because ASE has different variances in effect size when it is estimated from a few individuals for very highly expressed genes (2 het individuals with 150 reads = 300 total) compared to lots of estimates from intermediately expressed genes (10 het individuals with 30 reads = 300 total). For robustness, the authors should show whether detection of ASE in their study is independent of the number of input individuals once a testable site has been selected using their criterion. The authors should discuss the possibility that the negative correlation between conservation and probability of eQTL in a gene in baboons at least may be driven by the technical issue that only coding SNPs were tested and therefore conserved genes will tend to have low MAF and therefore low power. The authors indicate a large component of expression variability is in trans. Is trans defined as on other chromosomes? In particular, the authors should clarify what goes into the ptrans matrix. should not be main figures. The pipeline should be moved to the supplement. Also the ASE and eQTL effect sizes is a general technical issue and does not relate specifically to the study of the genetics of gene expression in baboons. The text around these sections is long and generally distracting. We have removed the original Figure 1 from the manuscript, as it provided an overview version of the more detailed pipelines provided in the original Figure 1–figure supplements 1-6. These supplements are now attached to the original Figure 2 (now Figure 1), along with other figures that were previously included as Figure 1’s supplements. The plots in former Figure 3 have also been changed to Figure 1 supplements and removed from the main text. In addition, we have streamlined our discussion of the power to detect ASE versus eQTL in the main text, largely removing this entire section. There is a comment regarding the lack of an association between effect size and minor allele frequency compared to . However, Battle et al. had 922 individuals compared to 66 individuals here (among several technical differences). We are not sure that these studies can be easily compared to provide a definitive statement regarding negative selection in baboons here. A caveat regarding study differences might suffice. We agree and have added such a caveat to this section of the Results. The authors bring up the possibility of admixture in the baboon population. This is slightly concerning as it could increase the number of hets and has the potential to increase the length of haplotypes (both observations of the study). It further could explain many of the observations of the study. Three possible suggestions to address this are to test whether: (i) one sees an excess of trans-associations on the same chromosome compared to across chromosomes for cis-eQTL in baboons versus YRI, (ii) apply a surrogate variable or hidden factor correction to eQTL analysis on a single chromosome or (iii), our reviewers’ favorite, test if ASE is correlated over a longer distance (more independent genes) than in humans—in this case, since there is no biological basis for why the locations of causal variants should be further away from the genes they regulate between baboons and humans, the decay of the correlation of ASE measured in proximal genes should be similar. We realize the authors have a model to control for individual relatedness and population structure, but this is derived from the entire genetic data set and does not address local patterns of admixture. Thanks for these helpful suggestions. We agree that the evolutionary history of this population could have influenced our findings. In the revised manuscript, we investigate this possibility in substantially greater detail (eleventh paragraph in the Results section and under the heading “Testing the contribution of admixture to eQTL detection”, in a new Materials and methods section). Overall, we found little evidence that admixture drove our results. These conclusions are based on three sets of findings: 1) Distance between sites tested for ASE predicts the magnitude of the difference between ASE estimates, as expected; however, this relationship does not differ between baboons and YRI (new Figure 1–figure supplement 10). This analysis, motivated by the reviewers’ third suggestion, suggests that ASE is not correlated over a longer distance in baboons than in humans (note that we could not directly test how correlations between ASE estimates decay with distance because our ASE estimates are site-specific, not individual-specific). 2) Controlling for local structure in addition to global ancestry (suggestion 2) modestly reduces the number of eQTL discoveries in the baboon data set, but not more so than the same procedure in YRI. When controlling for local structure using the top two principal components for variants on the same chromosome, the number of detectable eQTL drops from 1787 to 1583. However, this number is still 5.4-fold larger than the number detected in the YRI. A drop also occurs when running a parallel model in YRI (from 290 to 216), so that the number of eQTL detected in baboons is ∼7-fold larger when controls for local structure are used in both. Thus, more extensive local structure does not appear to explain the increased power in baboons. 3) In support of this idea, the spatial distribution of baboon eQTLs relative to genes is almost identical between the models with and without local structure controls (new Figure 1–figure supplement 11). If admixture drove most of the signal in the data set, we would expect to observe greater enrichment of eQTL within or near genes when adding controls for local structure; we do not. The gene expression data was quantile normalized. Why was a hidden factor correction not applied? Typically, these types of corrections dramatically improve eQTL discovery. Our concern is if there is some structure to the data that is both present and correlated in genotype and gene expression space, the number discoveries will be artificially inflated. This comment reflects our mistake in writing the original version of our Methods. We indeed corrected for hidden factors by regressing out the first 10 principal components of the overall gene expression data. As the reviewers note, this process greatly improved our ability to detect eQTL and minimized the possibility that the eQTL we detected reflect global structure in gene expression space (we did the same for the YRI data, so our methods remained comparable). We explain our procedure in the revised Methods (tenth paragraph) and have also added a figure supplement (new Figure 1–figure supplement 12) showing the relationship between the number of PCs we removed and eQTL detection. For ASE analyses: (i) the authors assume no recombination, this is not stated, (ii) how is beta in theta∼beta (alpha, beta) estimated, and (iii) detection of ASE correlates with expression level (), this is not a surprise, but given their model we are concerned whether this estimate is more extreme, because ASE has different variances in effect size when it is estimated from a few individuals for very highly expressed genes (2 het individuals with 150 reads = 300 total) compared to lots of estimates from intermediately expressed genes (10 het individuals with 30 reads = 300 total). For robustness, the authors should show whether detection of ASE in their study is independent of the number of input individuals once a testable site has been selected using their criterion. With regards to (i), we did not make any explicit assumptions about recombination in the ASE analysis. Recombination between the exonic SNPs we used to test for ASE and true causal regulatory SNPs would decrease our power to detect ASE; we make this point clear in the revision (in the subsection “ASE detection” of the Materials and methods). For (ii), we used a maximum likelihood approach to estimate both the alpha and beta parameters in the beta component of the beta-binomial model; we have clarified this point in the same subsection of the Materials and methods of the revised manuscript. To address point (iii), we have now tested whether ASE is more likely to be detected for sites with more heterozygous individuals, conditional on total read depth (overall, number of hets and total read depth are correlated: r = 0.266, p < 10-100). We find no evidence for such an effect across the ten deciles of total read depth values we tested (please see the aforementioned subsection and Figure 1–figure supplement 14). The authors should discuss the possibility that the negative correlation between conservation and probability of eQTL in a gene in baboons at least may be driven by the technical issue that only coding SNPs were tested and therefore conserved genes will tend to have low MAF and therefore low power. Thanks for pointing out this possible interpretation. We have now calculated the correlation between levels of conservation and average minor allele frequency for each gene, for SNPs used in our analysis. The correlation between a gene’s average phyloP score (from the 46-way primate comparison) and the average MAF of all SNPs tested in association with that gene is nominally significant (p = 0.002) but explains a very small fraction of the variance (r = -0.037). Thus, while more conserved genes do tend to have lower average MAFs (among SNPs tested), this relationship is weak, probably because we filter out sites with very low MAFs and because many sites occur in non-protein coding regions of the transcript or in the transcribed regions of other genes. Further, we observe no significant correlation between a gene’s Homologene score and average minor allele frequency among SNPs tested (p = 0.38). These analyses suggest that the relationship between conservation and eQTL discovery is probably not driven by a relationship between conservation and MAF (at least within the data set we ultimately analyzed). We have discussed this point in the revised manuscript (under the heading “Mixed evidence for nature selectin of gene expression levels”, in the Results section). We also cite a recent paper by Popadin et al. (third paragraph of the Discussion) that also suggests that fewer cis-eQTL are found in older genes, similar to our findings for the Homologene analysis. The authors indicate a large component of expression variability is in trans. Is trans defined as on other chromosomes? In particular, the authors should clarify what goes into the p matrix. Because the number of sites that are included in ptrans is almost equal to p (the total number of sites in the genome), we calculated the ptrans matrix based on all SNPs typed. However, in the revision we also have run ptrans based on all other chromosomes except the chromosome containing the focal gene. These results are almost identical to the results obtained when using all SNPs; we include them as a new supplemental figure (please see the subsection “Estimation of genetic contributions to gene expression”, in the Materials and methods, and new Figure 4–figure supplement 3).

78 in total

1. Evolution of gene expression in the Drosophila melanogaster subgroup.

Authors: Scott A Rifkin; Junhyong Kim; Kevin P White
Journal: Nat Genet Date: 2003-01-27 Impact factor: 38.330

2. Key considerations for measuring allelic expression on a genomic scale using high-throughput sequencing.

Authors: Pierre Fontanillas; Christian R Landry; Patricia J Wittkopp; Carsten Russ; Jonathan D Gruber; Chad Nusbaum; Daniel L Hartl
Journal: Mol Ecol Date: 2010-03 Impact factor: 6.185

3. Hybridization as an invasion of the genome.

Authors: James Mallet
Journal: Trends Ecol Evol Date: 2005-05 Impact factor: 17.712

Review 4. The evolutionary significance of cis-regulatory mutations.

Authors: Gregory A Wray
Journal: Nat Rev Genet Date: 2007-03 Impact factor: 53.242

5. Social affiliation matters: both same-sex and opposite-sex relationships predict survival in wild female baboons.

Authors: Elizabeth A Archie; Jenny Tung; Michael Clark; Jeanne Altmann; Susan C Alberts
Journal: Proc Biol Sci Date: 2014-09-10 Impact factor: 5.349

6. Sex-specific and lineage-specific alternative splicing in primates.

Authors: Ran Blekhman; John C Marioni; Paul Zumbo; Matthew Stephens; Yoav Gilad
Journal: Genome Res Date: 2009-12-15 Impact factor: 9.043

7. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS.

Authors: Dan L Nicolae; Eric Gamazon; Wei Zhang; Shiwei Duan; M Eileen Dolan; Nancy J Cox
Journal: PLoS Genet Date: 2010-04-01 Impact factor: 5.917

8. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals.

Authors: Alkes L Price; Agnar Helgason; Gudmar Thorleifsson; Steven A McCarroll; Augustine Kong; Kari Stefansson
Journal: PLoS Genet Date: 2011-02-24 Impact factor: 5.917

9. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals.

Authors: Alexis Battle; Sara Mostafavi; Xiaowei Zhu; James B Potash; Myrna M Weissman; Courtney McCormick; Christian D Haudenschild; Kenneth B Beckman; Jianxin Shi; Rui Mei; Alexander E Urban; Stephen B Montgomery; Douglas F Levinson; Daphne Koller
Journal: Genome Res Date: 2013-10-03 Impact factor: 9.043

10. The impact of population demography and selection on the genetic architecture of complex traits.

Authors: Kirk E Lohmueller
Journal: PLoS Genet Date: 2014-05-29 Impact factor: 5.917

43 in total

1. Association mapping reveals the role of purifying selection in the maintenance of genomic variation in gene expression.

Authors: Emily B Josephs; Young Wha Lee; John R Stinchcombe; Stephen I Wright
Journal: Proc Natl Acad Sci U S A Date: 2015-11-24 Impact factor: 11.205

2. Getting to the heart of cardiovascular evolution in humans.

Authors: Alex Pollen; Bryan J Pavlovic
Journal: Elife Date: 2019-05-28 Impact factor: 8.140

3. High-resolution mapping of cis-regulatory variation in budding yeast.

Authors: Ryosuke Kita; Sandeep Venkataram; Yiqi Zhou; Hunter B Fraser
Journal: Proc Natl Acad Sci U S A Date: 2017-11-28 Impact factor: 11.205

4. Gene expression shifts in yellow-bellied marmots prior to natal dispersal.

Authors: Tiffany C Armenta; Steve W Cole; Daniel H Geschwind; Daniel T Blumstein; Robert K Wayne
Journal: Behav Ecol Date: 2018-12-17 Impact factor: 2.671

5. Heritability estimation and differential analysis of count data with generalized linear mixed models in genomic sequencing studies.

Authors: Shiquan Sun; Jiaqiang Zhu; Sahar Mozaffari; Carole Ober; Mengjie Chen; Xiang Zhou
Journal: Bioinformatics Date: 2019-02-01 Impact factor: 6.937

Introduction

Results

Functional regulatory variation is common in the Amboseli baboons

Baboon eQTLs are enriched in and near genes.

Detailed workflow for gene expression level estimation.

Elimination of GC bias via quantile normalization.

Detailed workflow for SNP genotyping.

Location of analyzed SNPs relative to genes.

Accuracy of genotype calls for SNPs independently typed in HapMap3.

PCA projection of YRI samples using the RNA-seq-based pipeline vs independently typed SNPs.

Agreement between eQTL and ASE approaches for identifying functional variants.

Power to detect ASE vs eQTL.

Characteristics of YRI eQTL identified in the RNA-seq vs conventional pipelines.

Differences in the magnitude of ASE vs distance between sites.

Location of eQTL SNPs relative to genes with and without controlling for local structure.

Number of eQTL identified by PCs removed from the gene expression data set.

Coverage by genotype call.

Detection of ASE is not dependent on number of heterozygotes, conditional on total read depth.

Increased power to detect eQTL in baboons relative to humans

Shifts in the minor allele frequency spectrum

Power to detect eQTL in the Amboseli baboons compared to the HapMap YRI population.

Relationship between power to detect eQTL and simulated effect size, when the true eQTL is masked.

Differences in genetic diversity and linkage disequilibrium

Mixed evidence for natural selection on gene expression levels

Mixed evidence for negative selection on variants affecting gene expression level.

Correlation between eQTL detection and mean phyloP scores based on 100-way vertebrate comparison.

Genetic and environmental contributions to gene expression variation in wild baboons

Genetic contributions to variance in gene expression levels in wild baboons.

PVE explained by demographic and early environmental variables.

Distribution of PVE explained by additive genetic variance, age, sex, and maternal social connectedness to males across all genes.

Genetic contributions to variance in gene expression levels, with ptrans based on SNPs on other chromosomes only.

Discussion

Materials and methods

Study subjects and blood sample collection

Gene expression profiling using RNA-seq

Variant identification and genotype calls

Estimating accuracy of SNP genotypes using human RNA-seq data

eQTL mapping

Possible confounds associated with eQTL mapping using RNA-seq data

ASE detection

Power simulations

Testing the contribution of admixture to eQTL detection

Evidence for patterns consistent with natural selection on gene expression levels

Estimation of genetic contributions to gene expression

Review 4. The evolutionary significance of cis-regulatory mutations.

Review 10. Functional genomic insights into the environmental determinants of mammalian fitness.