Literature DB >> 27376237

Genome-wide association study of behavioral, physiological and gene expression traits in outbred CFW mice.

Clarissa C Parker^1,2,3, Shyam Gopalakrishnan^1,4, Peter Carbonetto^1,5, Natalia M Gonzales¹, Emily Leung¹, Yeonhee J Park¹, Emmanuel Aryee¹, Joe Davis¹, David A Blizard⁶, Cheryl L Ackert-Bicknell^7,8, Arimantas Lionikas⁹, Jonathan K Pritchard^10,11,12, Abraham A Palmer^1,13,14,15.

Abstract

Although mice are the most widely used mammalian model organism, genetic studies have suffered from limited mapping resolution due to extensive linkage disequilibrium (LD) that is characteristic of crosses among inbred strains. Carworth Farms White (CFW) mice are a commercially available outbred mouse population that exhibit rapid LD decay in comparison to other available mouse populations. We performed a genome-wide association study (GWAS) of behavioral, physiological and gene expression phenotypes using 1,200 male CFW mice. We used genotyping by sequencing (GBS) to obtain genotypes at 92,734 SNPs. We also measured gene expression using RNA sequencing in three brain regions. Our study identified numerous behavioral, physiological and expression quantitative trait loci (QTLs). We integrated the behavioral QTL and eQTL results to implicate specific genes, including Azi2 in sensitivity to methamphetamine and Zmynd11 in anxiety-like behavior. The combination of CFW mice, GBS and RNA sequencing constitutes a powerful approach to GWAS in mice.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Genetic Markers

Year: 2016 PMID： 27376237 PMCID： PMC4963286 DOI： 10.1038/ng.3609

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

In the last decade, genome-wide association studies (GWAS) have demonstrated that common alleles influence susceptibility to virtually all common diseases[1-3]. The success of GWAS in elucidating the genetic determinants of disease in humans is due in part to the large number of recombinations among unrelated individuals, which permits high-resolution mapping across the genome. One important conclusion from those studies is that most causal loci appear to be due to regulatory rather than coding polymorphisms[4]. Mice offer a powerful tool for elucidating the genetic architecture of complex traits: environmental factors can be held constant or systematically varied; genome editing permits experimental testing of identified genotype-phenotype relationships; most mouse genes have a human homolog, allowing rapid translation to humans; and relevant tissues can be obtained under highly controlled conditions and used to identify gene expression quantitative trait loci (eQTLs). However, the mouse populations used in most prior studies lacked sufficient recombination to narrow the implicated loci to a tractable size and thus generally failed to identify specific genes[5,6]. In this study, we mapped QTLs and eQTLs using Carworth Farms White (CFW) mice, which are a commercially available outbred population[7]. While CFW mice were not developed for genetic research, they have several attractive properties. CFW mice were derived from a small number of founders and have been subsequently maintained as an outbred population for more than 100 generations, thus degrading linkage disequilibrium (LD) between nearby alleles[8-10]. Although CFW mice have longer range LD compared to most human populations, they have less LD than other commercially available laboratory mice[9], and therefore should provide fine-scale mapping resolution. Compared to humans, the more extensive LD in CFW mice means that fewer markers are needed to perform GWAS and correspondingly lower levels of significance are required because fewer independent hypotheses are tested. We used genotyping-by-sequencing (GBS) to overcome another barrier to GWAS in mice, which is the high cost and limited coverage of extant SNP genotyping arrays. Finally, based on the importance of regulatory variation suggested by human GWAS[4,11], we identified eQTLs that co-mapped with behavioral QTLs in an effort to identify the most likely causal genes.

Results

We phenotyped 1,200 male CFW mice for conditioned fear, anxiety-like behavior, methamphetamine sensitivity, prepulse inhibition, fasting glucose, body weight, tail length, testis weight, the weight of five hindlimb muscles, bone mineral density, bone morphology and gene expression in prefrontal cortex, hippocampus and striatum (Figure 1, Supplementary Figures 1–4, Online Methods, Supplementary Note).

Figure 1

Components of study

Each of the 4 panels illustrates a component of the study: (A) Behavioral testing and measurement of physiological traits; (B) Genotyping-by-sequencing (GBS); (C) Measurement of gene expression in brain tissues using RNA-Seq; (D) QTL mapping for physiological and behavioral traits, and for gene expression.

Genotyping

Existing mouse SNP genotyping technologies, such as the Mouse Universal Genotyping Array (MUGA), MegaMUGA[12], the more recent GigaMUGA[13] and the Mouse Diversity Array (MDA)[14] were not designed to capture common genetic variation in the CFW population. Furthermore, we sought to reduce the cost of genotyping, which has been a barrier to GWAS in mice. Therefore, we adapted GBS, which was originally developed in maize[15], for use in mice. We used GBS to genotype 1,024 CFW mice, and identified 92,734 autosomal bi-allelic SNPs after filtering, 79,284 (86%) of which were present in dbSNP (v137). The remaining 13,450 SNPs (14%) represent “novel” SNPs that had not been previously reported. The distribution of GBS SNPs on autosomes is shown in Figure 2A. The nonuniform distribution of SNPs is likely due to differences in the numbers of polymorphic markers among all laboratory mice (Figure 2A, Supplementary Figure 5) and regions that are identical-by-descent among CFW mice. The non-uniform distribution of polymorphic SNPs appears to be a characteristic of CFW mice since polymorphic SNPs identified by the MegaMUGA array showed a similar pattern (Figure 2A; r = 0.43 on log-scale).

Figure 2

Genetic characteristics of CFW mouse population

(A) Density of GBS SNPs on autosomal chromosomes; (B) Mean LD (r2) decay rates estimated using frequency-matched SNPs[55], with MAF > 20%, in a 34th generation AIL derived from LG/J and SM/J strains[43,46], heterogeneous stock (HS) mice bred for > 50 generations[49], the Hybrid Mouse Diversity Panel (HMDP)[83], a panel of 30 inbred lab strains[14,52], Diversity Outbred mice[12], and CFW mice; (C) Treemix analysis summarizing genetic relationship between CFW mice and inbred strains in Wellcome Trust sequencing panel.

To assess the quality of GBS genotypes, we estimated the genotyping error rate in two ways. First, we compared GBS SNPs against those that were also present on the MegaMUGA array among 24 CFW mice that were genotyped using both platforms. This comparison yielded an overall discordance rate of 3%. We obtained a second estimate of the error rate of 1.6% by comparing genotypes in pairs of haplotypes that were identical-by-descent. Based on these results, we concluded that GBS provided a larger number of polymorphic SNPs than were found using MegaMUGA.

Genetic Architecture of the CFW Population

Comparing LD in different populations is useful for gauging mapping resolution[16]. Figure 2B shows that LD (r) decays rapidly in CFW mice compared to other populations, consistent with previous findings based on a much smaller number of SNPs[9,17], supporting their suitability for high-resolution mapping. Importantly, the majority of the SNPs we identified in CFW mice segregate among domesticus-derived laboratory strains (Figure 2C). Unlike the Collaborative Cross (CC) and the Diversity Outbred (DO), few of the SNPs found in CFW are derived from the castaneous and musculus subspecies[18-20]. When compared to a panel of inbred mice, CFW are most genetically similar to FVB/NJ (Figure 2C). Next, we considered the distribution of minor allele frequencies (MAF) of SNPs genotyped in the CFW mice (Supplementary Figure 6). The majority of SNPs (73%) had relatively high allele frequencies (MAF > 0.05). This profile is consistent with the reported history of CFW mice; namely, a severe bottleneck at the inception of the CFW population, followed by expansion to create an outbred population with a modest effective population size[9]. The mean MAF of novel SNPs was lower than for previously reported SNPs, consistent with the hypothesis that some of these novel SNPs are unique to the CFW population. Although we requested only one mouse from each litter, we were concerned that individuals in our study might have close familial relationships because they were sampled from a finite breeding population; however, we did not detect widespread population structure or cryptic relatedness in the CFW mice (Supplementary Figures 7–11).

SNP Heritability

Supplementary Table 1 shows the SNP heritability[21,22], which is the proportion of variance in the trait explained by available SNP genotypes. SNP heritability estimates ranged from 9–60%, with a mean of 28%. The mean SNP heritability for physiological traits was slightly higher (32%) compared to behavioral traits (27%).

GWAS

We mapped QTLs for 66 behavioral and physiological phenotypes (Supplementary Tables 1, 2; Figure 3A). We used GEMMA to fit a linear mixed model (LMM) and quantify support for an association at each SNP. We also used a simpler linear model that did not correct for population structure and observed that it produced broadly similar results (see Supplementary Figure 12). However, we have presented the results from the LMM-based analysis because it may reduce subtle inflation of the test statistics due to close relationships or fine-scale population structure. We calculated a threshold via permutation, which is a standard approach for QTL mapping in mice that controls for the type I error rate[23,24] (Supplementary Figure 13). This approach identified numerous QTLs for physiological and behavioral traits (Figure 3A, Supplementary Figures 14–18) that exceeded 2 × 10−6 (p < 0.1). Supplementary Table 2 contains more detailed information about all the most significant physiological and behavioral QTLs.

Figure 3

QTLs for physiological and behavioral traits

(A) Minimum p-values for association across all tested behavioral and physiological phenotypes (see Supplementary Table 1 and 2 for details). (B) Genome-wide scan for testis weight and (C) Pre-pulse inhibition in response to +12 dB pre-pulse. (D) Association signal for testis weight near the QTL on chromosome on 13. (E) Association signal for pre-pulse inhibition near the QTL on chromosome 7. Dotted red lines indicate thresholds (p<0.1) estimated via permutation tests.

For testis weight we found a strong association with rs6279141 on chromosome 13 (Figure 3B, p-value: 4.51 × 10−18) that accounted for 7.5% of variation in that trait. The implicated region contained few genes (Figure 3D), one of which was Inhba, a gene that has been shown to affect testis morphogenesis, testicular cell proliferation and testis weight in mice[25-27], and is therefore a promising candidate gene. The strongest association for soleus muscle weight mapped to rs30535702 on chromosome 13 and explained 2.8% of trait variance (p-value: 8.33 × 10−8). One of the genes in this interval, Fst, is known to influence muscle mass[28,29] and is a strong candidate to explain this association. We identified several examples of pleiotropy. For example, two independently measured muscle weights, tibialis anterior (TA) and extensor digitorum longus (EDL), were both associated with rs27338905 on chromosome 2, in each case accounting for 2.3% of the variation. Tp53inp2 is near the peak marker and is abundantly expressed in skeletal muscle[30], where it functions as a negative regulator of muscle mass[31]. Likewise, the weight of three muscles (gastrocnemius, EDL and soleus) mapped to the proximal end of chromosome 13; in each case, the minor allele was associated with increased muscle weight. Finally, on chromosome 12 we identified pleiotropic effects on tibia length and EDL weight. Unexpectedly, we found that CFW mice appear predisposed toward abnormally high bone mineral density (BMD). This is a characteristic of CFW mice that does not appear to be shared with commonly used inbred laboratory strains (Supplementary Figure 3). This “abnormal BMD” phenotype was strongly associated with rs33583459 on chromosome 5 and rs29477109 on chromosome 11 (p-values: 1.57 × 10−9, 1.12 × 10−14, respectively). The locus on chromosome 5 contains a large number of genes, including Abcf2 and Slc4a2. The human ortholog, ABCF2, has been associated with BMD in the largest GWAS of BMD completed to date[32], and is highly expressed in osteoblasts[33]. Slc4a2 plays a critical role in osteoclasts[34] and homozygous deletion of Slc4a2 is associated with the osteopetrosis-like phenotype “Marble Bone Disease” in Red Angus cattle[35]. Thus, both Abcf2 and Slc4a2 are viable candidates for this region. The association on chromosome 11 contains the gene Col1a1. In humans, Osteogenesis Imperfecta Type I can be caused by a null allele of COL1A1 and results in gracile bones with decreased strength[36,37]. COL1A1 is also associated with other bone size phenotypes[38], making Col1a1 a likely causal gene for this locus. Finally, we identified several associations for behavioral traits, including methamphetamine sensitivity on chromosome 6 at rs22397909 (p-value: 9.03 × 10−7) and chromosome 9 at rs46497021 (p-value: 1.58 × 10−6); these associations account for 2.6% and 2.1% of the phenotype variance, respectively (Figure 4). We also identified an association for anxiety-like behavior with rs238465220 on chromosome 13 (p-value:7.31 × 10−8) that explained 3% of the variance. For prepulse inhibition (12 db), we identified associations with rs264716939 on chromosome 7 (Figure 3C) and rs230308064 on chromosome 13 (p-values: 1.18 × 10−6 and 2.17 × 10−6, respectively). There were many genes in the ~3Mb region on chromosome 7 that were associated with PPI (Figure 3E), making it difficult to identify the causal gene(s). Candidate genes for the associations with behavioral traits are discussed below.

Figure 4

Overview of eQTL mapping

(A) Color of each pixel in the matrix depicts the lowest p-value among all eQTLs using a 10 Mb × 10 Mb window. (B) Overlap of genes with eQTLs in the three brain tissues detected using the traditional cis-eQTL mapping method (not ASE). The permutation-based p-value threshold for each eQTL is 0.05. (C) Genome-wide scan for total locomotor activity on day 3 of the methamphetamine sensitivity tests. (D) Association signal for total locomotor activity in the QTL region on chromosome 9. (E) Association signal for expression of Azi2 in the striatum, in the same region as panel D. Dotted red lines indicate thresholds (p < 0.1) estimated via permutation tests.

eQTLs

In an effort to identify causal genes within our behavioral QTLs, we mapped eQTLs for three brain regions that are critical for the behaviors that we studied. We performed RNA-Seq on messenger RNA (mRNA) from three brain regions: hippocampus (n = 79), striatum (n = 55) and prefrontal cortex (n = 54). In a cis-eQTL scan that was limited to the region flanking the gene being interrogated (Supplementary Figures 19, 20), we identified a total of 6,045 associations for 4,174 genes (Figure 4A, Supplementary Figure 21, Supplementary Table 3) at a permutation-derived significance threshold of p < 0.05 (this threshold reflects a per-gene, per-brain region). For 534 of those genes we identified a cis-eQTL in all three tissues. For an additional 803 genes, we identified a cis-eQTL in two of the three tissues (Figure 4B). The RNA-Seq data were generated from a set of partially overlapping individuals; therefore, we did not perform a joint analysis of the three brain tissues[39]. In addition, we searched for cis-eQTLs by examining allele-specific expression (ASE), which measures relative expression of the two possible RNA alleles derived from a heterozygous SNP[40,41]. We identified 655 genes with ASE in at least one of the three tissues. Of these, 380 (58%) were found only using ASE, and 275 (42%) were also identified in the conventional cis-eQTL scan, suggesting that there was more overlap than would be expected by chance. Overlap was likely limited by several factors, including type I errors in the ASE and type II errors in both the ASE and conventional cis-eQTL mapping. We also mapped eQTLs genome-wide for each gene in an effort to detect trans-eQTLs. We identified 2,278 trans-eQTLs that were significant (p < 0.05 permutation-based threshold) after testing 43,414 transcripts across the three brain regions. We expected almost that many tests to be positive under the null hypothesis. Consistent with this, a quantile-quantile (QQ) plot of these results suggested that only a small number of these results were true positives (Supplementary Figure 22). As expected, most true positive results appear to be from the hippocampus, which had the largest sample size (n = 79).

Integration of behavioral QTLs with eQTLs

Based on evidence from human GWAS, we anticipated that heritable gene expression polymorphisms (eQTLs) would be responsible for most of the observed behavioral associations. Therefore, we tried to identify eQTLs that co-mapped with behavioral QTLs, under the assumption that the eQTL might be the molecular cause of the behavioral QTL. For example, we observed an association between methamphetamine sensitivity and rs46497021 on chromosome 9 (p-value: 1.6 × 10−6; Figure 4C, Supplementary Figure 18). The implicated region was small (<1 Mb) and contained only 2 genes: Cmc1 and Azi2 (Figure 4D). We identified cis-eQTLs for both genes in the striatum, which is the tissue that is most relevant for methamphetamine sensitivity. However, rs46497021 was most strongly correlated with Azi2 expression (p-value: 1.2 × 10−8; Figure 4E). In addition, the pattern of SNPs associated with methamphetamine sensitivity and Azi2 expression showed obvious overlap. Therefore, while both Cmc1 and Azi2 are credible positional candidates, the eQTL data suggest that Azi2 is most likely to be the causative gene. Neither gene has been previously implicated in dopaminergic/striatal processes, suggesting that this observation may offer novel insights into the biology of this drug abuse-relevant trait. Additionally, we identified an association between anxiety-like behavior and rs238465220 on chromosome 13 (p-value: 7.3 × 10−8, Supplementary Figure 17). The implicated region spanned ~1.5 Mb and contained 4 genes: Chrm3, Larp4b, Dip2c, and Zmynd11. Among those genes, rs238465220 was also associated with expression of Zmynd11 in the hippocampus, suggesting that this locus may influence anxiety-like behavior through regulation of Zmynd11 expression (Supplementary Figure 23). Zmydn11 has not been previously implicated in anxiety; however, copy number variants in ZMYND11 were recently shown to be associated with autistic tendencies and aggressive behaviors in humans[42]. These examples illustrate the utility of combining GWAS with eQTL data to identify the molecular mechanism by which a chromosomal region influences a complex trait.

Discussion

We performed a GWAS in a commercially available outbred mouse population, which identified numerous physiological, behavioral, and expression QTLs. In several cases the implicated loci were smaller than 1 Mb and contained just a handful of genes that included an obvious candidate. In addition, we used the eQTL results to further parse among the genes in the intervals that were implicated in the behavioral traits. The goal of using CFW mice was to enhance our mapping resolution. CFW have shorter-range LD than other commercially available populations[9]. Using GBS genotypes, we estimated LD in CFW mice and compared it to other mapping populations (Figure 2B). The 34th generation of the LG/J×SM/J advanced intercross line (LG×SM-AIL) that we have used in prior studies[43-46] showed more extensive LD compared to CFW. Various outbred heterogeneous stocks (HS), typically made up of 8 inbred strains, have also been used in prior mapping efforts[47-49]. We examined one HS[49], and found that it also had longer range LD compared to CFW. The Hybrid Mouse Diversity Panel (HMDP)[50,51], which is a collection of approximately 100 inbred mouse strains that has been used for QTL mapping, also showed greater LD compared to CFW, as did a smaller panel of 30 inbred strains[52]. The DO[12,18,19,53] exhibited LD decay that was almost as degraded as CFW. Populations like the AIL and HS (including the DO) are expected to show decreased LD in the future due to the accumulation of additional recombinations (for example, the LG×SM-AIL is now at generation 62). MF-1 is another commercially available outbred population that has been used to map QTLs[50,54], but we were unable to obtain the data needed to estimate LD decay in this population. Comparing LD patterns in different populations is a common method for estimating mapping resolution[16], however additional factors including the allele frequency distribution[55], population structure[56], error rates and the number, effect size and frequency of causal variants all influence power and mapping resolution. Despite these limitations, our comparison of LD (Figure 2B) and our mapping results (Figures 3–4, Supplementary Table 2, Supplementary Figures 14–18, 21–23) suggest that CFW mice are an attractive option for fine-mapping studies. Another important parameter for GWAS studies is allele frequency, since power to identify associations increases with greater MAF. Laboratory mouse populations have higher average MAF compared to humans or wild populations[17]. We found that 73% of SNPs genotyped in this study had MAF > 0.05, although our SNP filtering steps may have underestimated the number of rare SNPs. Populations produced by crossing inbred strains, such as F2 crosses, recombinant inbred (RI) lines, AILs and HS typically have even more desirable MAF distributions[43]. Because the ascertainment of SNPs included in genotyping platforms directly influences the estimated MAF distribution, we did not attempt to use publicly available data to compare MAFs in commonly used mapping populations. We found that CFW mice lacked genetic variability in certain regions; for example, Chromosome 16 had a low density of polymorphic markers as measured using both GBS and MegaMUGA (Figure 2A) and no significant QTLs (Figure 3A). This is an example of a previously described tendency for laboratory mouse populations to harbor regions that are identical-by-descent[43,57]. Several other advantages of CFW mice include their commercial availability, their low cost, and the ability to acquire non-siblings upon request. We also found that the CFW mice were easy to handle, and their uniform coat color simplified automated scoring of certain behavioral traits. One barrier to more widespread adoption of GWAS in mice has been the lack of universal and economical SNP genotyping platforms. One innovative aspect of this paper is the use of GBS to overcome this obstacle. GBS is a reduced-representation sequencing approach in which a small fraction of the genome is sequenced at moderate depth in order to obtain genotypes at a subset of markers. While GBS shares some characteristics with low-coverage whole-genome sequencing[58-60], GBS yields high coverage for a subset of the genome, thus acquiring information about fewer SNPs but with greater confidence. Our GBS methods included a custom-designed library preparation protocol (which reduced per-sample costs), and used the standard software toolkits GATK[61] and IMPUTE2[62]. An advantage of GBS was that it did not require pre-selection of polymorphic SNPs. We chose conservative criteria for SNP calling, which yielded 92,734 SNPs, of which 14% were newly discovered and possibly unique to CFW. These 92,734 SNPs provided extensive coverage of the genome (Figure 2A) and allowed for fine-mapping (Figures 3C–D, 4D–E and Supplementary Figures 17, 23). The number of markers obtained using GBS can be titrated by varying the restriction enzymes used, the fragment sizes selected and the degree of sample multiplexing. GBS involves imputation to correct errors and to populate missing genotypes, requiring more expertise than analysis of SNP genotyping arrays. Compared to conventional array-based SNP genotyping, GBS had a higher error rate, which is expected to modestly decrease power, but should not produce false positive QTLs since the errors will not be correlated with the traits. We are currently improving genotype imputation methods for populations in which the founder haplotypes are known, such as AILs and HS populations [12,45,46,63,64]. Because the monetary advantage of GBS over array-based genotyping will continue to improve as sequencing prices decrease, we anticipate that GBS and other sequencing-based approaches will supplant array-based methods in the coming years. The majority of human GWAS findings implicate regulatory rather than coding differences[4,11]. The identified haplotypes frequently contain several genes. It is now widely appreciated that even when an association can be localized to a single gene, that gene may not be the cause of the association[65], meaning that proximity to the peak SNP is not sufficient to identify the causal gene. eQTLs can provide the crucial link between a region implicated by GWAS and the biological processes that underlie that association. Therefore, a major goal of our study was to integrate behavioral QTL and eQTL data. We used RNA-Seq to examine gene expression in three brain regions that are known to be important for the behavioral traits that we studied. Although Azi2 was not an obvious candidate for the behavioral QTL for methamphetamine sensitivity, our data showing the co-mapping of an eQTL for Azi2 expression in the striatum provide an additional layer of evidence. Similarly, Zmynd11 has not been previously implicated in anxiety-like behavior, but the eQTL for Zmynd11 expression in the hippocampus suggests that it is the most promising of the four genes within the behavioral QTL. These examples demonstrate the power of integrating fine-mapping of behavioral QTLs and eQTLs and extend on multiple prior mouse studies that have used similar approaches in conjunction with F2 crosses[66], RIs[67-69], selected lines[70], HS[71], outbred MF-1 mice[50] and the HMDP[51,72,73]. RNA-Seq offers a number of advantages relative to array-based gene expression measurements[74-80]. In particular, we were able to map cis- and trans-eQTLs using a traditional mapping approach, and simultaneously map cis-eQTLs by quantifying ASE. Since only a fraction of genes can be studied using ASE, we did not anticipate complete overlap between genes identified using these two approaches. Using ASE we identified 655 cis-eQTLs of which 42% were also identified as cis-eQTLs using conventional mapping. We found that physiological traits typically had slightly higher heritabilities than behavioral traits (Supplementary Table 1). We also found the effect sizes of individual associations tended to be higher for physiological traits (Supplementary Table 2), consistent with findings from another recent study in rats[81]. However, it was not always true that traits with the highest heritabilities also showed the largest effect sizes for individual associations. Because the effect size of individual QTL alleles is of paramount importance for assessing power at a given sample size, and because this parameter is never known in advance, it is not possible to provide general guidelines about the sample size needed for future studies. Based on our results, we suggest that a sample size of 1,000 or more CFW mice should be used for most traits, though traits like testis weight and abnormal BMD would have yielded significant results with just a few hundred mice. While our use of the CFW was intended to increase mapping precision, there is a direct tradeoff between mapping precision and statistical power[6], therefore sample sizes required for studies using CFW will necessarily be larger than for F2, recombinant inbred, or other traditional mapping populations that offer less precision. Our data do not directly address the reasons that the effect sizes we observed are so much larger than the effect sizes observed in most human GWAS. We can speculate that the unique population history of laboratory mice (domestication, selection and repeated population bottlenecks) have increased the frequency of alleles that may have been rare in ancestral wild mouse populations. It is also true that, unlike many traits studied in human GWAS, the traits we are examining are not disease traits (and thus may not influence fitness), and therefore may not have been influenced by natural selection even among ancestral wild mouse populations from which laboratory populations were originally derived. Furthermore, laboratory mice are drawn from a much more uniform environment, potentially diminishing gene-by-environment interactions that may reduce effect sizes in human GWAS. Finally, because LD in the CFW mice is more extensive as compared to humans, we are effectively testing fewer hypotheses and therefore applied a lower (permutation-derived) significance threshold. We have shown that use of CFW mice in conjunction with GBS and RNA-Seq provides a powerful and efficient means for identifying genetic associations, and for nominating candidate genes within the associated regions. Compared to other outbred mouse populations, CFW mice showed rapid decay of LD (Figure 2B), were less expensive, and primarily allowed examination of domesticus derived alleles (Figure 2C). Compared to human GWAS, this approach provided dramatically reduced costs, the ability to examine phenotypes that include experimental manipulations that would be impractical or unethical in humans, the ability to obtain tissue samples for expression analysis, and the ability to exert exquisite control over environmental variables. Identified genes can be manipulated in future studies via genome engineering[82]. Thus, our approach can be used to rapidly generate specific and testable hypotheses for a wide array of complex traits. More broadly, our results demonstrate methods and principles that apply to a variety of other model systems.

Online Methods

Animal Models

We phenotyped 1,200 male Carworth Farms White (CFW) mice (Mus musculus) that were obtained from the Charles River Laboratories facility in Portage, Michigan, USA (CRL; strain code: CRL:CFW(SW); facility code: P08). We performed a power analysis using the program Quanto (hydra.usc.edu/gxe). This indicated that 1200 mice would provide 80% power to detect QTLs that accounted for ~3% of total trait variance with p < 5 x 10−7. Since our study was completed, the Portage colony has been relocated to Kingston, New York (new code K92). It has been reported that ancestors of the CFW mice were obtained from a large colony of Swiss mice in 1926, and maintained by Dr. Webster at the Rockefeller Institute. A single pair of highly inbred albino mice were later acquired by Carworth Farms and used to initiate an outbred mouse stock. Several mice from this colony were later acquired in 1974 by CRL and were subsequently maintained as an outbred population[8-10]. Every two weeks, 48 male CFW mice were shipped from CRL in Portage, MI to our laboratory in Chicago, IL. We requested that CRL send only one mouse from each litter to avoid obtaining siblings, since close relatives reduce power to map QTLs, and complicate analysis. The average age of the mice upon arrival in our labs was 35 days (ranging from 34 to 46 days), and their average weight was 25.5 g (ranging from 13.4 g to 38.7 g). Mice were housed 4 per cage and given ~15 days to adapt to their new environment (Supplementary Figure 1). Standard lab chow and water were available ad libitum, except during the behavioral procedures and prior to testing for fasting glucose. Mice were maintained on a standard 12:12h light-dark cycle (lights on at 06:30). All phenotyping occurred during the light phase between 08:00 and 16:00 hours, over the period of August 2011 to December 2012. All procedures were approved by the University of Chicago Institutional Animal Care and Use Committee (IACUC) in accordance with National Institute of Health guidelines for the care and use of laboratory animals.

Phenotyping

The order of phenotyping was identical for each mouse, and is shown schematically in Supplementary Figure 1. One day after arrival, mice were fasted for four hours prior to measurement of blood glucose levels. Fourteen days later, we assessed their response to a novel environment and to administration of 1.5 mg/kg of methamphetamine in a 3-day paradigm[64]. Twelve days later, we tested mice for conditioned fear[46]. Nine days after that, we tested mice for prepulse inhibition[44] (PPI). Finally, after 15 days we weighed and sacrificed the mice. Immediately after sacrifice, we weighed testis, and collected one leg for measurement of muscle phenotypes, and collected the other leg for measurement of bone-phenotypes. We also measured tail length at this time (Supplementary Note).

RNA-Seq

After sacrifice, we collected brain tissue from a subset of mice as a source of mRNA from the hippocampus (n = 79), striatum (n = 55) and frontal cortex (n = 54). We used RNA-Seq[84,85] to quantify gene transcript abundance in these brain tissues. Library preparation was performed with the TruSeq RNA Sample Kit (Illumina). Samples were multiplexed 5-per lane and sequenced on an Illumina HiSeq 2000 sequencer, using single-end 100-bp reads. We processed the RNA-Seq short reads using the Tuxedo software suite[79]: (1) first, we aligned the short reads to the reference genome assembly (NCBI release 38, mm10) with bowtie2[86]; (2) next, we used tophat2[79] to align the short reads to known splice junctions; (3) finally, we used cufflinks[87] to calculate, for each gene, a gene-level measure of expression based on the mapped reads. This measure is reported in reads per kilobase per million reads mapped (RPKM). This measure does not depend on length of coding sequences or sequencing depth of each sample (so mapping eQTLs will not be biased by these factors). We focused on this gene-level measurement for subsequent investigation, including eQTL mapping and assessment of allele-specific expression (ASE). See Supplementary Note for further details.

Genotyping-by-sequencing (GBS)

Genotyping-by-sequencing (GBS) is a reduced-representation genotyping method for obtaining genotyping information by sequencing only regions that are proximal to a restriction enzyme cut site[15]. Our protocol was adapted from the procedures previously described[88]. GBS libraries were prepared by digesting genomic DNA with a restriction enzyme, PstI, and annealing oligonucleotide adapters to the resulting overhangs. Samples were multiplexed 12-per lane, and sequenced on an Illumina HiSeq 2000 sequencer using single-end 100-bp reads. We obtained an average of 4.8M reads per sample. By focusing the sequencing effort on the Pstl restriction sites, we obtained high coverage (~15x, Supplementary Figure 24) at a subset of genomic loci, although those reads were very non-uniformly distributed. We aligned the 100-bp single-end reads to Mouse Reference Assembly 38 from the NCBI database (mm10) using bwa[89]. We used GATK[61,90] to discover variants and to obtain genotype probabilities. For the Variant Quality Score Recalibration (VQSR) step, we calibrated variant discovery against (1) whole-genome sequencing (WGS) data that we ascertained from a small set of CFW mice, (2) SNPs and indels from the Wellcome Trust Sanger Mouse Genome project[91], and SNPs available in dbSNP release 137. We used IMPUTE2[62] to improve low-confidence genotypes, or genotypes that were not called in individual mice. Supplementary Table 4 and the Supplementary Note detail our efforts to estimate the error rate of GBS in this study. The Supplementary Note also contains a description of a small number of SNPs that were discarded because a large proportion of the genotypes were imputed with low certainty. Finally, the Supplementary Note details our identification of 110 DNA samples that appeared to be mislabeled and were therefore excluded from our study (Supplementary Figures 25–28).

Treemix analysis

We estimated phylogenetic relationships between the CFW mice and different lab strains sequenced as part of the Wellcome Trust mouse genome sequencing project using treemix[92]. We used the genotypes for the lab strains sequenced by the Wellcome Trust to obtain the locations of SNPs that were identified in the CFW mice using our GBS pipeline. We excluded the mus spretus strain from the Wellcome Trust data, since this strain was included as an outgroup. Since the lab strains are all inbred, we assumed that the allele frequency was 1 or 0. We represented each strain by only a single individual. We used a subset of 100 CFW mice to compute the allele frequencies from the genotype likelihoods of GBS SNPs in our sample. Treemix was used to fit a maximum-likelihood tree to all the lab strains and CFW samples.

QTL mapping for behavioral and physiological traits

We performed a GWAS for the behavioral and physiological phenotypes using all SNPs with MAF > 2% and good imputation quality (defined as 95% of the samples having a maximum probability genotype greater than 0.5). Although our analyses did not suggest the presence of close relatives or population structure, we used the linear-mixed model implemented in the program GEMMA[93]. GEMMA is similar to a standard linear regression, in which the quantitative trait (Y) is modeled as a linear combination of the genotype (X) and the covariates (Z), except that it includes an additional “random” or “polygenic” effect capturing the covariance structure in the phenotype that is attributed to genome-wide genetic sharing: The notation in this expression is defined as follows: y is the ith phenotype sample; z is ith sample of covariate k, in which k ranges from 1 to the number of covariates included in the regression (m); α is the coefficient corresponding to covariate k; x is the genotype of sample i at SNP j; β is the coefficient corresponding to SNP j; u is the polygenic effect for the ith sample; ε is the residual error; and μ is the intercept. The genotype, x, is represented as the expected allele count, in which 0 represents a homozygous major allele, and 2 represents a homozygous minor allele, and β is the additive effect of the expected allele count on the phenotype. The residuals ε are assumed to be i.i.d. normal with zero mean and covariance σ2, whereas the polygenic effect u = (u is a random vector drawn from the multivariate normal distribution with mean zero and n × n covariance matrix σ2 λK, where n is the number of samples. We estimated the relatedness matrix, K, from the genotype data. We specified the covariance matrix using the realized relationship matrix K = XX/p, where p is the number of SNPs, and X is the n × p genotype matrix with entries x. This formulation was derived from a polygenic model of the phenotype in which all SNPs helped explain variance in the phenotype, and the contributions of individual SNPs were i.i.d. normal[94-96]. The inclusion of a genetic marker in both the fixed and random terms can deflate the test statistic for this marker, leading to a loss of power to detect a QTL; this problem has been termed “proximal contamination”[95]. To avoid proximal contamination, we computed 19 different K matrices, each one excluding one of the 19 autosomes. To scan markers on a given chromosome, we used the version of K that did not include that chromosome. We have previously proposed this leave-one-chromosome-out (LOCO) approach as a simple solution for avoiding the problem of proximal contamination[23]. We used a permutation-based approach to calculate the genome-wide significance threshold for p-values calculated in GEMMA. We estimated the distribution of p-values under the null hypothesis by mapping QTLs in 1,000 randomly permuted data sets, then taking the threshold to be the 100(1 − α)th percentile of this distribution, with α= 0.1. Although this permutation test is technically only valid under the assumption that the samples are exchangeable[97], we have previously suggested that ‘naive’ permutations are generally sufficient[23]. Furthermore, given our observation that population structure is subtle, we expect that this simulation provides a good approximation to the null (Supplementary Note).

Heritability estimates

Instead of computing a point estimate for h2, which is the usual approach (e.g. using the REML estimate[98]), we evaluated the likelihood over a regular grid of values for h2, which allowed us to directly quantify uncertainty in h2 under the reasonable assumption of a uniform prior for the proportion of variance explained[96]. We estimated the SNP heritability, h, of our phenotypes[21]. Because the GBS SNPs did not completely tag all casual variants (and because we excluded the sex chromosomes), our estimates of h2 underestimated a trait’s true narrow-sense heritability. To estimate h, we assumed that all genetic markers made some small contribution to variation in the trait, and that these contributions were normally distributed with the same variance[96,98,99]. Under this polygenic model, the covariance of the phenotype measurements was Cov(Y, …,Y = σ2H, where , I is the n × n identity matrix, K is the n × n realized relatedness matrix, σa2 is the variance of the additive genetic effects, and σ2 is the variance of the residuals. Under this formulation, σa2 represents the relative contribution of the additive genetic variance, and we can use this parameter to provide an estimate for h2: where s is the mean sample variances of all the available SNPs, or the mean of the diagonal entries of K assuming that the columns of X are centered so that each of the columns have a mean of zero. See Supplementary Note for further details.

Expression QTL (eQTL) mapping

We used the RPKM measurements from RNA-Seq and the GBS genotype data to map eQTLs. We performed an eQTL scan separately in each brain tissue (hippocampus, striatum, prefrontal cortex). First, we discarded genes with low levels of expression (RPKM < 1), and genes that showed no variability in expression. For the remaining genes, we quantile-normalized the expression data. To account for unknown confounders, we removed linear effects of the first few principal components (PCs) calculated from the K x N gene expression matrix (20 PCs for hippocampus, 10 PCs for striatum, 20 PCs for prefrontal cortex)[41]. After removing linear effects of the PCs, we again quantile-normalized the expression data. We then used an LMM as implemented in GEMMA to scan for cis-eQTLs, as described above for the behavioral and physiological phenotypes. To define cis-eQTLs, we only considered SNPs within 1 Mb of the gene’s transcribed region (preliminary analyses indicated that 1 MB captured most of the significant signals; Supplementary Figures 19–20). We used a permutation-based approach to calculate significance thresholds for p-values in the cis-eQTL mapping. We used 1,000 permutations of the expression values to compute a separate significance threshold for each gene, using only the SNPs that were included in the cis-eQTL scan. In addition to cis-eQTL scans, we also performed genome-wide trans-eQTLs scans for all the genes. The genome-wide scans were performed using the same LMM that was used for cis-eQTL analyses, except that all the SNPs outside a 2 Mb region around the gene were included in the trans-eQTL analysis. The significance threshold for trans-eQTLs was computed using permutations of 1,000 randomly selected genes in each tissue; this approach is permissible because all expression traits were quantile-normalized (Supplementary Note).

Allele-specific expression (ASE)

Finally, we performed an analysis of allele-specific expression (ASE) to identify genes that had ASE QTLs. This analysis was performed independently from the mapping of cis-eQTLs described above. We identified variants that had at least 10 samples with high-confidence heterozygote genotype calls. For genes that contained at least one such variant, we compared the relative expression of the two alleles across these heterozygote samples. To account for overdispersion, we used a beta-binomial model to fit the counts of the two alleles for each sample. We then used a likelihood-ratio test to test for significant deviation of the observed data from the expectation of equal counts from both alleles (Supplementary Figure 29 and Supplementary Note).

99 in total

1. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples.

Authors: Si Quang Le; Richard Durbin
Journal: Genome Res Date: 2010-10-27 Impact factor: 9.043

2. Increased accuracy of artificial selection by using the realized relationship matrix.

Authors: B J Hayes; P M Visscher; M E Goddard
Journal: Genet Res (Camb) Date: 2009-02 Impact factor: 1.588

3. Estimation of SNP heritability from dense genotype data.

Authors: S Hong Lee; Jian Yang; Guo-Bo Chen; Stephan Ripke; Eli A Stahl; Christina M Hultman; Pamela Sklar; Peter M Visscher; Patrick F Sullivan; Michael E Goddard; Naomi R Wray
Journal: Am J Hum Genet Date: 2013-12-05 Impact factor: 11.025

Review 4. The role of regulatory variation in complex traits and disease.

Authors: Frank W Albert; Leonid Kruglyak
Journal: Nat Rev Genet Date: 2015-02-24 Impact factor: 53.242

5. The so-called Swiss mouse.

Authors: C J Lynch
Journal: Lab Anim Care Date: 1969-04

Review 6. Dissecting quantitative traits in mice.

Authors: Richard Mott; Jonathan Flint
Journal: Annu Rev Genomics Hum Genet Date: 2013-07-03 Impact factor: 8.929

7. A deletion mutation in bovine SLC4A2 is associated with osteopetrosis in Red Angus cattle.

Authors: Stacey N Meyers; Tara G McDaneld; Shannon L Swist; Brandy M Marron; David J Steffen; Donal O'Toole; Jeffrey R O'Connell; Jonathan E Beever; Tad S Sonstegard; Timothy P L Smith
Journal: BMC Genomics Date: 2010-05-27 Impact factor: 3.969

Review 8. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

9. Quantitative trait locus mapping methods for diversity outbred mice.

Authors: Daniel M Gatti; Karen L Svenson; Andrey Shabalin; Long-Yang Wu; William Valdar; Petr Simecek; Neal Goodwin; Riyan Cheng; Daniel Pomp; Abraham Palmer; Elissa J Chesler; Karl W Broman; Gary A Churchill
Journal: G3 (Bethesda) Date: 2014-09-18 Impact factor: 3.154

10. Refining analyses of copy number variation identifies specific genes associated with developmental delay.

Authors: Bradley P Coe; Kali Witherspoon; Jill A Rosenfeld; Bregje W M van Bon; Anneke T Vulto-van Silfhout; Paolo Bosco; Kathryn L Friend; Carl Baker; Serafino Buono; Lisenka E L M Vissers; Janneke H Schuurs-Hoeijmakers; Alex Hoischen; Rolph Pfundt; Nik Krumm; Gemma L Carvill; Deana Li; David Amaral; Natasha Brown; Paul J Lockhart; Ingrid E Scheffer; Antonino Alberti; Marie Shaw; Rosa Pettinato; Raymond Tervo; Nicole de Leeuw; Margot R F Reijnders; Beth S Torchia; Hilde Peeters; Brian J O'Roak; Marco Fichera; Jayne Y Hehir-Kwa; Jay Shendure; Heather C Mefford; Eric Haan; Jozef Gécz; Bert B A de Vries; Corrado Romano; Evan E Eichler
Journal: Nat Genet Date: 2014-09-14 Impact factor: 38.330

34 in total

1. Genome-wide Associations Reveal Human-Mouse Genetic Convergence and Modifiers of Myogenesis, CPNE1 and STC2.

Authors: Ana I Hernandez Cordero; Natalia M Gonzales; Clarissa C Parker; Greta Sokolof; David J Vandenbergh; Riyan Cheng; Mark Abney; Andrew Sko; Alex Douglas; Abraham A Palmer; Jennifer S Gregory; Arimantas Lionikas
Journal: Am J Hum Genet Date: 2019-11-21 Impact factor: 11.025

2. Novel biomarkers to assess in utero effects of maternal opioid use: First steps toward understanding short- and long-term neurodevelopmental sequelae.

Authors: Laura Goetzl; Tara Thompson-Felix; Nune Darbinian; Nana Merabova; Salim Merali; Carmen Merali; Kathryne Sanserino; Tamara Tatevosian; Bruno Fant; Mathieu E Wimmer
Journal: Genes Brain Behav Date: 2019-06-11 Impact factor: 3.449

Review 3. Mouse Systems Genetics as a Prelude to Precision Medicine.

Authors: Hao Li; Johan Auwerx
Journal: Trends Genet Date: 2020-02-06 Impact factor: 11.639

4. Genome-Wide Association Study in 3,173 Outbred Rats Identifies Multiple Loci for Body Weight, Adiposity, and Fasting Glucose.

Authors: Apurva S Chitre; Oksana Polesskaya; Katie Holl; Jianjun Gao; Riyan Cheng; Hannah Bimschleger; Angel Garcia Martinez; Tony George; Alexander F Gileta; Wenyan Han; Aidan Horvath; Alesa Hughson; Keita Ishiwari; Christopher P King; Alexander Lamparelli; Cassandra L Versaggi; Connor Martin; Celine L St Pierre; Jordan A Tripi; Tengfei Wang; Hao Chen; Shelly B Flagel; Paul Meyer; Jerry Richards; Terry E Robinson; Abraham A Palmer; Leah C Solberg Woods
Journal: Obesity (Silver Spring) Date: 2020-08-29 Impact factor: 5.002

5. Heritable natural variation of light/dark preference in an outbred zebrafish population.

Authors: Amelia Dahlén; Mahendra Wagle; Mahdi Zarei; Su Guo
Journal: J Neurogenet Date: 2019-09-22 Impact factor: 1.250

6. Genetic Basis of Aerobically Supported Voluntary Exercise: Results from a Selection Experiment with House Mice.

Authors: David A Hillis; Liran Yadgary; George M Weinstock; Fernando Pardo-Manuel de Villena; Daniel Pomp; Alexandra S Fowler; Shizhong Xu; Frank Chan; Theodore Garland
Journal: Genetics Date: 2020-09-25 Impact factor: 4.562

7. Genome-wide association for testis weight in the diversity outbred mouse population.

Authors: Joshua T Yuan; Daniel M Gatti; Vivek M Philip; Steven Kasparek; Andrew M Kreuzman; Benjamin Mansky; Kayvon Sharif; Dominik Taterra; Walter M Taylor; Mary Thomas; Jeremy O Ward; Andrew Holmes; Elissa J Chesler; Clarissa C Parker
Journal: Mamm Genome Date: 2018-04-24 Impact factor: 2.957

Review 8. Post-genomic behavioral genetics: From revolution to routine.

Authors: D G Ashbrook; M K Mulligan; R W Williams
Journal: Genes Brain Behav Date: 2017-12-21 Impact factor: 3.449

9. ^AZI23'UTR Is a New SLC6A3 Downregulator Associated with an Epistatic Protection Against Substance Use Disorders.

Authors: Kefu Liu; Jinlong Yu; Juan Zhao; Yanhong Zhou; Nian Xiong; Jie Xu; Tao Wang; Richard L Bell; Hong Qing; Zhicheng Lin
Journal: Mol Neurobiol Date: 2017-10-05 Impact factor: 5.590

10. Integrated analysis of population genomics, transcriptomics and virulence provides novel insights into Streptococcus pyogenes pathogenesis.

Authors: Priyanka Kachroo; Jesus M Eraso; Stephen B Beres; Randall J Olsen; Luchang Zhu; Waleed Nasser; Paul E Bernard; Concepcion C Cantu; Matthew Ojeda Saavedra; María José Arredondo; Benjamin Strope; Hackwon Do; Muthiah Kumaraswami; Jaana Vuopio; Kirsi Gröndahl-Yli-Hannuksela; Karl G Kristinsson; Magnus Gottfredsson; Maiju Pesonen; Johan Pensar; Emily R Davenport; Andrew G Clark; Jukka Corander; Dominique A Caugant; Shahin Gaini; Marita Debess Magnussen; Samantha L Kubiak; Hoang A T Nguyen; S Wesley Long; Adeline R Porter; Frank R DeLeo; James M Musser
Journal: Nat Genet Date: 2019-02-18 Impact factor: 38.330