Literature DB >> 33067325

Content and Performance of the MiniMUGA Genotyping Array: A New Tool To Improve Rigor and Reproducibility in Mouse Research.

John Sebastian Sigmon¹, Matthew W Blanchard^2,3, Ralph S Baric⁴, Timothy A Bell², Jennifer Brennan³, Gudrun A Brockmann⁵, A Wesley Burks⁶, J Mauro Calabrese^7,8, Kathleen M Caron⁹, Richard E Cheney⁹, Dominic Ciavatta², Frank Conlon¹⁰, David B Darr⁸, James Faber⁹, Craig Franklin¹¹, Timothy R Gershon¹², Lisa Gralinski⁴, Bin Gu⁹, Christiann H Gaines², Robert S Hagan¹³, Ernest G Heimsath^8,9, Mark T Heise², Pablo Hock², Folami Ideraabdullah^2,8,14, J Charles Jennette¹⁵, Tal Kafri^16,17, Anwica Kashfeen¹, Mike Kulis⁶, Vivek Kumar¹⁸, Colton Linnertz², Alessandra Livraghi-Butrico¹⁹, K C Kent Lloyd^20,21,22, Cathleen Lutz¹⁸, Rachel M Lynch^2,8, Terry Magnuson^2,3,8, Glenn K Matsushima^16,23, Rachel McMullan², Darla R Miller^2,8, Karen L Mohlke², Sheryl S Moy^24,25, Caroline E Y Murphy², Maya Najarian¹, Lori O'Brien⁹, Abraham A Palmer²⁶, Benjamin D Philpot^9,19, Scott H Randell⁹, Laura Reinholdt¹⁸, Yuyu Ren²⁶, Steve Rockwood¹⁸, Allison R Rogala^15,27, Avani Saraswatula², Christopher M Sassetti²⁸, Jonathan C Schisler⁷, Sarah A Schoenrock², Ginger D Shaw², John R Shorter², Clare M Smith²⁸, Celine L St Pierre²⁶, Lisa M Tarantino^2,29, David W Threadgill^26,30, William Valdar², Barbara J Vilen¹⁶, Keegan Wardwell¹⁸, Jason K Whitmire², Lucy Williams², Mark J Zylka⁹, Martin T Ferris³¹, Leonard McMillan¹, Fernando Pardo Manuel de Villena^2,3,8.

Abstract

The laboratory mouse is the most widely used animal model for biomedical research, due in part to its well-annotated genome, wealth of genetic resources, and the ability to precisely manipulate its genome. Despite the importance of genetics for mouse research, genetic quality control (QC) is not standardized, in part due to the lack of cost-effective, informative, and robust platforms. Genotyping arrays are standard tools for mouse research and remain an attractive alternative even in the era of high-throughput whole-genome sequencing. Here, we describe the content and performance of a new iteration of the Mouse Universal Genotyping Array (MUGA), MiniMUGA, an array-based genetic QC platform with over 11,000 probes. In addition to robust discrimination between most classical and wild-derived laboratory strains, MiniMUGA was designed to contain features not available in other platforms: (1) chromosomal sex determination, (2) discrimination between substrains from multiple commercial vendors, (3) diagnostic SNPs for popular laboratory strains, (4) detection of constructs used in genetically engineered mice, and (5) an easy-to-interpret report summarizing these results. In-depth annotation of all probes should facilitate custom analyses by individual researchers. To determine the performance of MiniMUGA, we genotyped 6899 samples from a wide variety of genetic backgrounds. The performance of MiniMUGA compares favorably with three previous iterations of the MUGA family of arrays, both in discrimination capabilities and robustness. We have generated publicly available consensus genotypes for 241 inbred strains including classical, wild-derived, and recombinant inbred lines. Here, we also report the detection of a substantial number of XO and XXY individuals across a variety of sample types, new markers that expand the utility of reduced complexity crosses to genetic backgrounds other than C57BL/6, and the robust detection of 17 genetic constructs. We provide preliminary evidence that the array can be used to identify both partial sex chromosome duplication and mosaicism, and that diagnostic SNPs can be used to determine how long inbred mice have been bred independently from the relevant main stock. We conclude that MiniMUGA is a valuable platform for genetic QC, and an important new tool to increase the rigor and reproducibility of mouse research.

Entities: Chemical

Keywords: chromosomal sex; diagnostic SNPs; genetic QC; genetic background; genetic constructs; substrains

Mesh：

Year: 2020 PMID： 33067325 PMCID： PMC7768238 DOI： 10.1534/genetics.120.303596

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.402

The laboratory mouse is among the most popular and extensively used models for biomedical research. For example, in 2018 the word “mouse” appeared in the abstract of over 82,000 scientific manuscripts available in PubMed. The laboratory mouse is such an attractive model due to the existence of hundreds of inbred strains and outbred lines designed to address specific questions, as well as the ability to edit the mouse genome; originally by homologous recombination and now with more efficient and simple techniques such as clustered regularly interspaced short palindromic repeats (Lanigan ). The centrality of genetics in mouse-enabled research begs the question of how genetic quality control (QC) is performed in these experiments. At a minimum, genetic QC should provide reliable information about the sex, genetic background, and presence of genetic constructs in a given sample in a robust and cost-effective manner. We have a long track record of developing genotyping arrays for the laboratory mouse, from the Mouse Diversity Array (MDA, Yang ) to the previous versions of the Mouse Universal Genotyping Array (MUGA) (Morgan and Welsh 2015). These tools were originally designed for the genetic characterization of two popular genetic reference populations, the Collaborative Cross (CC) and the Diversity Outbred, and then used in experiments involving other laboratory strains as well as wild mice (Yang ; Collaborative Cross Consortium 2012; Carbonetto ; Arends ; Didion ; Rosshart ; Shorter ; Srivastava ; Veale ). In the following paragraphs, we discuss each of the main components of genetic QC as defined above. Sex is widely recognized as a key biological variable. Standard chromosomal sex determination in the MDA and MUGA relied on detection of the Y chromosome. This approach has limitations, most notably the inability to identify sex chromosome aneuploidies. In mammals, including humans and mice, sex chromosome abnormalities are relatively frequent (Searle and Jones 2002; Cheng ), and thus the ability to detect them will substantially improve genetic QC. The ability to discriminate between genetic backgrounds is critical for genetic QC, and all previous platforms were able to accomplish this goal to varying degrees. Array-based discrimination depends on the number of markers, their spatial distribution, and the ascertainment bias of those markers. While the MDA and several previous iterations of the MUGA had tens to hundreds of thousands of markers, the selection of those markers depended heavily on whole-genome sequencing (WGS) data from <20 inbred strains (Yang , 2009, 2011; Keane ; Morgan ). Thus, these platforms provided very fine-grained discrimination for some strains and coarse or happenstance discrimination for many others. An extreme example of the latter is the very poor discrimination between substrains. Mouse genetics is built on the phenotypic differences observed between strains and, recently, we have come to appreciate that closely related substrains can be phenotypically divergent due to variants accumulated by genetic drift (Kumar ; Treger ). Drift within an inbred strain can also lead to phenotypic divergence. The presence of a genetic construct is designed to make carriers phenotypically different from noncarriers (Lanigan ). Thus, the ability to detect genetic constructs will enhance genetic QC. Currently, detection of constructs is achieved by custom PCR designed for each construct in a given mouse stock. Because this process is costly and time-consuming, most researchers only test for the desired construct(s) used in their experiment. Many mouse stocks are the product of breeding mice with different constructs (e.g., flox-flanked knockouts and Cre recombinase are simultaneously present in many stocks). This cross-breeding and the common process of sharing genetically modified stocks between groups can lead to the accumulation of genetic constructs. A genetic QC platform that tests for multiple commonly used constructs will therefore be highly desirable. The MDA and the first two iterations of the MUGA arrays were not designed to detect any constructs (Yang ; Morgan ). Efforts to extend the use of the MUGA to detect genetic constructs were met with limited success in GigaMUGA (Morgan ). We conclude that current genotyping tools are suboptimal for construct detection. For microarray-based construct detection, the most valuable assays are those that can detect the most popular constructs independent of site of insertion and the genetic background of the sample. In addition to its value as a genetic QC tool, a well-designed genotyping array can also be a valuable tool for experimental research. Two such areas of research are sex chromosome biology and genetic mapping using reduced-complexity crosses (RCC; Kumar ; Bryant ). RCC are predicated on the idea that if a heritable phenotype is variable between a pair of closely related laboratory substrains, then QTL mapping combined with a complete catalog of the few thousand variants that differ among these substrains can lead to the rapid identification of the candidate causal variants (Kumar ; Babbs ). The exact number of variants between a pair of substrains, or within a set of substrains, varies substantially but is several orders of magnitude fewer than between classical strains (Mortazavi ; M. T. Ferris, unpublished data). This addresses one of the major limitations of standard mouse crosses, namely the cost in time and resources to move from QTL to quantitative trait variants (Scalzo and Yokoyama 2008). The RCC concept has been successfully demonstrated in crosses between C57BL/6J and C57BL/6NJ (Kumar ; Babbs ), but existing genotyping platforms do not support extension to other crosses because of the lack of sufficient informative markers to support robust QTL mapping. Identification of such markers requires WGS from all parental substrains used in the RCC. By definition, most of these markers will be diagnostic for one substrain, thus improving genetic background identification. To address these limitations, we created a fourth iteration of the MUGA family of arrays that we call MiniMUGA. The central considerations for the design were to reduce genotyping costs, robustly determine chromosomal sex, provide broad discrimination between most inbred strains and substrains, and reliably detect the presence of popular genetic constructs. We also incorporated diagnostic variants for multiple substrains to expand RCC to crosses between those substrains. MiniMUGA fulfills all our criteria and facilitates simple, uniform, and cost-effective standard genetic QC, as well as serving the mouse community at large by providing a new tool for genetic studies.

Materials and Methods

Reference samples

To test the performance of the MiniMUGA array, we genotyped 6899 DNA samples from a wide range of genetic backgrounds, ages, and tissues (Supplemental Material, Table S1). These samples include examples of inbred strains, F1 hybrids, experimental crosses, and cell lines (Table 1). The array content was designed in two phases, resulting in preliminary and production versions of the array. We genotyped 5604 samples in the preliminary version of the array containing 10,171 markers. We genotyped 1295 samples in the production version of the array. The production version of the array includes 954 additional markers selected to increase coverage of diagnostic SNPs for selected substrains (905 markers targeting 39 substrains) and additional constructs (45 markers targeting seven constructs). Samples were genotyped to determine the marker performance and information content, and to develop the multiple pipelines discussed throughout this paper. Overall, 6300 samples were genotyped once and 225 samples were genotyped two or more times, resulting in a total of 6525 unique samples genotyped.

Table 1

Sample set

Content	Chromosomal sex	Inbred	F1	CC	Cross	Unclassified	Cell lines	Total
Preliminary	XX	138	131	305	1383	817	87	2861
	XY	265	41	181	1236	907	74	2704
	XO	0	1	3	11	8	9	32
	XXY	0	1	1	2	3	0	7
Subtotal								5604
Production	XX	41	59	40	580	21	4	745
	XY	153	13	7	248	112	10	543
	XO	0	1	0	2	0	0	3
	XXY	0	0	0	4	0	0	4
Subtotal								1295
Total		597	247	537	3466	1868	184	6899

The table provides the number of samples genotyped in the preliminary and production version of the array classified according their chromosomal sex and type. CC, collaborative cross; Cross, experimental back- and intercrosses; unclassified, samples provided by the coauthors that may be of any type. Table S1 provides comprehensive information about each of these samples including name, type, whether it was genotyped in the preliminary or production version of the array, whether it was used in the array calibration process, and whether the sample was used to determine consensus genotypes or thresholds for chromosomal sex determination. Table S1 also lists chromosomal sex, basic QC metrics, and values used to determine the presence of 17 constructs. A complete description of the information provided in Table S1 is available in the table legend. DNA stocks for inbred strains were purchased from The Jackson Laboratory over a decade ago, or provided by the authors. DNA from most other samples was prepared from tail clips or spleens using the DNeasy Blood & Tissue Kit (catalog no. 69506; QIAGEN, Valencia, CA). Approximately 1.5-μg genomic DNA per sample was shipped to Neogen Inc. (Lincoln, NE) for array hybridization and genotype calling.

Microarray platform and genotype calling

MiniMUGA is implemented on the Illumina Infinium XT platform (Illumina, Inc., San Diego, CA). Invariable oligonucleotide probes 50 bp in length are conjugated to silica beads that are then addressed to wells on a chip. Sample DNA is hybridized to the oligonucleotide probes and a single-base-pair templated extension reaction is performed with fluorescently labeled nucleotides (Steemers ). The relative signal intensity from alternate fluorophores at the target nucleotide is processed into a discrete genotype call (AA, AB, or BB) using the Illumina GenomeStudio genotyping software (Illumina). Although the two-color Infinium readout is optimized for genotyping biallelic SNPs, both the total and relative signal intensity can also be informative for copy-number variation and construct detection. For each marker in the preliminary array, we optimized the default clustering algorithm with a training data set of 2698 high-quality samples representing a wide variety of genetic backgrounds. Similarly, the production content markers were calibrated using 1295 samples (Table S1).

Probe design

Of the 11,125 markers present in the production version of the array, 10,819 (97.2%) are probes designed for biallelic SNPs and the remaining 306 markers (2.6%) are probes designed to test for the presence of genetic constructs (Table S2). Nucleotides are labeled such that only one silica bead is required to genotype most SNPs, except the cases of A/T and C/G SNPs, which require two beads. To maximize information content, target SNPs were biased toward single-bead SNPs (mostly transitions). There are 10,721 single-bead assays and 404 two-bead assays. All construct probes are single-bead assays. The transition:transversion ratio in SNPs (excluding constructs) is 3:1.

Probe annotation

Probe design and performance of individual assays was used to annotate the array. Table S2 contains the following information: (1) marker name, (2) chromosome, (3) position, (4) strand, (5–6) sequences for one- and two-bead probes, (7–8) reference and alternate alleles at the SNP, (9) tier, (10) reference SNP ID# (rsID), (11) diagnostic information, (12) uniqueness, (13) X chromosome markers used to determine the presence and number of X chromosomes, (14) Y chromosome markers used to determine the presence of a Y chromosome, and (15) markers added in the production version. A complete description of the information provided in Table S2 is available in the table legend.

Chromosomal sex determination

We selected a set of 2348 control samples (1108 males and 1240 females) with known X and Y chromosome number, as determined through standard anatomical sexing and/or known reproductive status. In the case of mice with known and well-defined genetics (inbreds and F1s), this was further confirmed by homozygous or heterozygous status at chromosome X markers. For each sample, we first normalized the intensity values at each X and Y chromosome marker by dividing the intensity (r) by the median intensity at all of the autosomal markers in that sample. These autosome-normalized intensity values are used in all subsequent sex-determination calculations. The next step of chromosomal sex determination was to identify sex-linked markers that provide an estimate of sex chromosome number consistent with the anatomical sex and that have low between-sample noise. We identified 269 X and 72 Y sex-informative markers as those for which the ranges of median normalized intensity, as defined by their standard deviations not overlapping between male and female controls (Figure S1). The identity of these markers is provided in Table S2. Next, we established chromosomal sex intensity threshold values. For each sample, we plotted the medians of the normalized intensity values at the X-informative markers on the x-axis and the medians of the normalized intensity values at the Y-informative markers on the y-axis (Figure 1). Based on this plot we identified two clusters, one containing the control males and one containing the control females (XY and XX, respectively). These two clusters contain >99% of the samples. Two additional clusters represent XO and XXY aneuploids, and are located at the predicted X and Y areas based on chromosomally normal males and females. Some XO and XYY cases are confirmed through genetic analysis (see the Results). We defined chromosomal sex (XX, XY, XO, and XXY) thresholds as the midpoint between the relevant clusters. There is a single Y threshold value (0.3) separating samples with or without a Y chromosome. We identified two independent X threshold values (0.77 and 0.69) depending on whether the sample has a Y chromosome or not (Figure 1). These threshold values were used to classify the chromosomal sex of experimental samples into four groups: XX, XY, XO, or XXY.

Figure 1

Chromosomal sex determination in 6899 samples. Each circle and cross represent one genotyped sample. The x-axis value is the autosome-normalized median sample intensity at 269 sex-informative X chromosome markers, and the y-axis value is the autosome-normalized median sample intensity at 72 sex-informative Y chromosome markers. The dot color denotes the assigned chromosomal sex: XX, red; XY, blue; XO, green; and XXY, purple. Potential mosaic samples are shown in gray and known errors in yellow. Samples with pd_stat lower than the threshold are shown as circles and samples with high pd_stat are shown as crosses.

Generation of consensus genotypes

The impetus for creating consensus genotypes for inbred strains in MiniMUGA is to provide a set of reference genotype calls for widely used strains. When possible, we included both sexes and multiple biological and technical replicates of a given inbred strain to smooth over any errors in genotyping results, identify problematic markers, and to provide a more robust set of reference calls for comparison. For each of 241 inbred and recombinant inbred strains (Table S3), we genotyped between 1 and 19 replicates (average 3.2 per strain). Most inbred strains (179) were genotyped more than once. For 53 strains (mostly recombinant inbred lines from the BXD panel) we did not genotype a male, and thus Y chromosome genotypes are not provided for those strains. Over one-half of the strains (146) were genotyped only in the preliminary version of the array, so genotypes at markers added to the production version of the array are missing in those strains. See Table S1 for details. We generated consensus genotype calls at all 10,819 autosomal, X, pseudoautosomal region (note that this region may vary among strains) (Morgan ), and Y chromosome markers (biallelic SNPs). For each strain, at each marker, we recorded the genotype calls in all of the constituent samples and determined the consistency among these calls. For strains with more than one sample, if all calls were consistent, the consensus genotype is shown in upper case (A, T, C, G, H, or N). We define partially consistent calls as those with a mix of one or more calls of a single nucleotide (A, T, C, or G), and one or more H and/or N calls. Partially consistent calls are shown in lower case, as are calls for strains with a single constituent sample. Inconsistent calls are those for which two distinct nucleotides calls are observed. Unless noted, inconsistent genotypes within a strain are shown as N in the consensus. Partially diagnostic SNPs (see below) are always inconsistent (both allele calls are present in a single substrain), and the consensus call is the diagnostic allele shown in lower case. For CC strains, inconsistent consensus genotypes are shown as H, as these strains are known to be not fully inbred (Srivastava ; Shorter ). For mitochondria and Y chromosome markers, consensus calls follow the same rules except H calls are treated as N. Table S4 provides a list of rules for generating all possible consensus calls. Table S5 provides a listing of the consensus genotypes.

Informative SNPs between closely related substrains

To increase the specificity of MiniMUGA as a tool for discriminating between closely related inbred strains, we used public data from several other studies providing genotype or WGS information (Yang ; Keane ; Adams ; Morgan ). Most importantly, we included SNPs that are segregating between substrains. These SNPs were identified by WGS of 33 substrains (Table 2). These genome sequences will be made available as part of an upcoming publication (M. T. Ferris, R. S. Baric, M. T. Heise, C. M. Sassetti, and F. Pardo-Manuel de Villena, unpublished results). Finally, we included 339 variants discriminating between substrains of C57BL/10 (provided by A. A. Palmer, Y. Ren, and C. L. St. Pierre). The preliminary version of the array included 5171 probes present in GigaMUGA (Morgan ). These were included to cover the genome uniformly in classical and wild-derived inbred strains, and a few were also informative between substrains.

Table 2

Sequenced inbred mouse strains used to select the content of the genotyping array.

Background	Strain group	Diagnostic type	Full	Partial	Reference
129P2/OlaHsd	129P	Substrain	25	0	Keane et al. (2011); Doran et al. (2016)
129P3/J	129P	Substrain	54	0	M. T. Ferris et al., unpublished results
129S1/SvImJ	129S	Substrain	82	13	Keane et al. (2011); Doran et al. (2016)
129S2/SvHsd	129S	Substrain	7	1	M. T. Ferris et al., unpublished results
129S2/SvPasOrlRj	129S	Substrain	36	0	M. T. Ferris et al., unpublished results
129S4/SvJaeJ	129S	Substrain	45	0	M. T. Ferris et al., unpublished results
129S5/SvEvBrd	129S	Substrain	12	0	Keane et al. (2011); Doran et al. (2016)
129S6/SvEvTac	129S	Substrain	41	0	M. T. Ferris et al., unpublished results
129T2/SvEmsJ	129T	Substrain	38	0	M. T. Ferris et al., unpublished results
129X1/SvJ	129X	Substrain	39	0	M. T. Ferris et al., unpublished results
A/J	A	Substrain	58	7	Keane et al. (2011); Doran et al. (2016)
A/JCr	A	Substrain	53	0	M. T. Ferris et al., unpublished results
A/JOlaHsd	A	Substrain	38	0	M. T. Ferris et al., unpublished results
BALB/cAnNCrl	BALB/c	Substrain	36	2	M. T. Ferris et al., unpublished results
BALB/cAnNHsd	BALB/c	Substrain	109	4	M. T. Ferris et al., unpublished results
BALB/cByJ	BALB/c	Substrain	3	4	M. T. Ferris et al., unpublished results
BALB/cByJRj	BALB/c	Substrain	19	0	M. T. Ferris et al., unpublished results
BALB/cJ	BALB/c	Substrain	103	3	Keane et al. (2011); Doran et al. (2016)
BALB/cJBomTac	BALB/c	Substrain	47	0	M. T. Ferris et al., unpublished results
C3H/HeJ	C3H/He	Substrain	166	2	Keane et al. (2011); Doran et al. (2016)
C3H/HeNCrl	C3H/He	Substrain	39	0	M. T. Ferris et al., unpublished results
C3H/HeNHsd	C3H/He	Substrain	39	1	M. T. Ferris et al., unpublished results
C3H/HeNRj	C3H/He	Substrain	42	0	M. T. Ferris et al., unpublished results
C3H/HeNTac	C3H/He	Substrain	45	14	M. T. Ferris et al., unpublished results
C57BL/6J	C57BL/6	Substrain	136	20	Sarsani et al. (2019)
C57BL/6JBomTac	C57BL/6	Substrain	41	2	M. T. Ferris et al., unpublished results
C57BL/6JOlaHsd	C57BL/6	Substrain	43	0	M. T. Ferris et al., unpublished results
C57BL/6NJ	C57BL/6	Substrain	37	7	Keane et al. (2011); Doran et al. (2016)
C57BL/6NRj	C57BL/6	Substrain	20	0	M. T. Ferris et al., unpublished results
B6N-Tyr < c-Brd>/BrdCrCrl	C57BL/6	Substrain	21	10	M. T. Ferris et al., unpublished results
DBA/1J	DBA/1	Substrain	70	0	Keane et al. (2011); Doran et al. (2016)
DBA/1LacJ	DBA/1	Substrain	77	2	M. T. Ferris et al., unpublished results
DBA/1OlaHsd	DBA/2	Substrain	32	0	M. T. Ferris et al., unpublished results
DBA/2J	DBA/2	Substrain	112	0	Keane et al. (2011); Doran et al. (2016)
DBA/2JOlaHsd	DBA/2	Substrain	39	0	M. T. Ferris et al., unpublished results
DBA/2JRj	DBA/2	Substrain	30	0	M. T. Ferris et al., unpublished results
DBA/2NCrl	DBA/2	Substrain	85	14	M. T. Ferris et al., unpublished results
DBA/2NTac	DBA/2	Substrain	36	10	M. T. Ferris et al., unpublished results
FVB/NCrl	FVB	Substrain	47	0	M. T. Ferris et al., unpublished results
FVB/NHsd	FVB	Substrain	39	1	M. T. Ferris et al., unpublished results
FVB/NJ	FVB	Substrain	72	7	Keane et al. (2011); Doran et al. (2016)
FVB/NRj	FVB	Substrain	47	0	M. T. Ferris et al., unpublished results
FVB/NTac	FVB	Substrain	37	0	M. T. Ferris et al., unpublished results
NOD/MrkTac	NOD	Substrain	33	0	M. T. Ferris et al., unpublished results
NOD/ShiLtJ	NOD	Substrain	51	3	Keane et al. (2011); Doran et al. (2016)
Subtotal			2281	127
129S	129S	Strain group	17	0
A	A	Strain group	57	0
BALB/c	BALB/c	Strain group	125	0
C3H/He	C3H/He	Strain group	45	0
C57BL/10	C57BL/10	Strain group	291	0	Mortazavi et al. 2020
C57BL/6	C57BL/6	Strain group	19	0
DBA/1	DBA/1	Strain group	5	0
DBA/2	DBA/2	Strain group	62	0
FVB/N	FVB/N	Strain group	2	0
NZO	NZO	Strain group	12	0	Keane et al. (2011); Doran et al. (2016)
Subtotal			635	0
Total			2916	127

The table provides the strain name and group, the number and type for both fully and partial diagnostic SNPs, and the source of the whole-genome sequencing data.

Probes for genetically engineered constructs

We designed and included 306 probes targeting commonly used genetic constructs (257 in the preliminary phase and 49 in the production phase; Table S6). We identified conserved 51-mers that fulfill the following three conditions: (1) they are present in construct sequences available from either Addgene or GenBank, (2) the 5′ 50-mers did not have matches in the mouse genome, and (3) the last base pair of the 51-mer is an A in the forward orientation or a T in the reverse orientation. The alternative allele is either a C in the forward orientation or a G in the reverse orientation. This alternative allele is a requirement for Illumina array design and does not represent a true SNP. Genotype calls at construct probes are not relevant. As these probes are only useful in the context of intensity, and not in genotype calls, we developed a different pipeline to classify these probes into tiers (Morgan ; applied in the Results to genomic SNPs). Because of the probe design, we only assessed probes for their ability to have consistent low raw normalized intensity in the x-axis in samples with nonmanipulated genomes (585 samples from inbred strains and 250 F1 hybrids, referred to as negative construct controls in Table S1) and at least some samples with high raw normalized intensity in our experimental data set. We eliminated 79 probes from the analysis because they failed this step (probes with purple labels in Figure S2A). We further eliminated 50 probes from the analysis because the range of variation in our experimental samples was not sufficiently distinguishable from the negative controls, or we observed a unimodal intensity distribution in the experimental samples (probes with blue labels in Figure S2A). Our experience with GigaMUGA suggested that a single construct probe was insufficient to robustly classify samples with the presence or absence of a construct (Morgan ). Therefore, we eliminated 14 probes from the analysis because of low correlation of intensities across our experimental sample set (probes with red labels in Figure S2, A and B). Figure S2B shows the clusters based on the correlation of probe intensities. Finally, we confirmed that clustered probes were targeting the same or related constructs based on the Basic Local Alignment Search Tool (Boratyn ). These alignments are provided in Figure S3. In total, these 163 probes mapped to 17 biologically distinct constructs (see Table 3). For each of these constructs, we identified conservative threshold values for the presence and absence based on the sum intensity of the probes assigned to that construct. We used the distribution of values to identify breaks and set the thresholds such that we minimized the number of samples misclassified as positive or negative. Positive controls (when available) were used to validate our classification schema.

Table 3

Validated constructs

Name	Abreviation	Number of probes	Number of distinct probes
“Greenish” Fluorescent Protein (EGFP, EYFP, and ECFP)	g_FP	19	19
SV40 large T antigen	SV40	18	18
Cre recombinase	Cre	16	12
Tetracycline repressor protein	tTA	14	14
Diptheria toxin	DTA	11	11
Human CMV enhancer version b	hCMV_b	10	7
Luciferase and firefly luciferase	Luc	10	10
Chloramphenicol acetyltransferase	chloR	9	9
Bovine growth hormone poly A signal sequence	bpA	8	4
iCre recombinase	iCre	8	8
Reverse improved tetracycline-controlled transactivator	rtTA	8	4
CRISPR associated protein 9	cas9	7	7
Blasticidin resistance	BlastR	6	4
Internal Ribosome Entry Site	IRES	6	6
hCMV enhancer version a	hCMV_a	5	4
“Reddish” fluorescent protein (tdTomato, mCherry)	r_FP	6	6
Herpesvirus TK promoter	hTK_pr	2	2
Total		163	145

The table lists the name, abbreviation shown in the report and the number of total and distinct probes for 17 constructs validated in the data set reported here. EGFP, enhanced green fluorescent protein; EYFP, enhanced yellow fluorescent protein; ECFP, enhanced cyan fluorescent protein; CMV, cytomegalovirus; hCMV, human cytomegalovirus; SV40, simian virus 40; TK, thymidine kinase.

Additional sample quality metrics

Most quality metrics for genotyping arrays are based on genotype calls. However, intensity-based analyses, such as chromosomal sex determination, assume quasi-normal distribution of marker intensities in a given sample (Figure S4). In our data set, some samples had significantly skewed and idiosyncratic intensity distributions. Among these samples, there were several erroneously identified as sex chromosome aneuploids. To identify samples with abnormal intensity distributions, we used 200 random samples with no chromosomal abnormalities and confirmed that, in aggregate, they have quasi-normal intensity distribution (reference distribution) at the autosomal markers of the preliminary array content. We then computed a power divergence statistic (pd_stat; equivalent to Pearson’s chi square goodness of fit statistic) for each sample, comparing its autosomal intensity distribution to that of the reference distribution. Figure S5 shows the distribution of pd_stat values in our entire data set. We selected a pd_stat value of 3230 as the threshold, and in samples with higher values the reported chromosomal sex could be incorrect. The pd_stat should be carefully scrutinized for samples with reported sex chromosome aneuploidy. The threshold also ensures that in samples from species other than Mus musculus, chromosomal sex determination is treated with skepticism. To determine whether a high pd_stat had an effect on the accuracy of genotyping calls we selected four pairs of different F1 mice [(A/JxCAST/EiJ)F1_M15765; (CAST/EiJxA/J)F1_F002; (CAST/EiJxNZO/HlLtJ)F1_F0019; (CAST/EiJxNZO/HlLtJ)F1_F022; (NZO/HlLtJxNOD/ShiLtJ)F1_F0042; (NZO/HlLtJxNOD/ShiLtJ)F1_F0042; (PWK/PhJxNZO/HlLtJ)F1_F0019 and (PWK/PhJxNZO/HlLtJ)F1_M0001] that cover a variety of pd_stat contrasts (high/low, medium/medium, and low/low). For each pair, we first determined the pairwise consistency of the genotype calls and then compared these genotypes to predicted calls for the consensus reference parental inbred strains. Pairwise comparison consistencies in the autosomes (excluding N calls) vary between 99.5 and 100%. Similarly, the consistency with predicted genotypes is very high (99.5–100%). We conclude that the pd_stat is independent of genotype call quality.

Data availability

Genotype calls and hybridization intensity data (both raw and processed) for 6899 samples, and consensus genotypes for 241 inbred strains are available for download at the University of North Carolina at Chapel Hill (UNC) Dataverse. These data are posted at https://dataverse.unc.edu/dataverse/MiniMUGA. Supplemental material available at figshare: https://doi.org/10.25386/genetics.11971941.

Results

Sample set, reproducibility, and array annotation

The 599 technical replicates (Table S1, see Materials and Methods) were used to calculate the reproducibility of the genotype calls. Overall, 99.6 ± 0.4% of SNP genotype calls were consistent between technical replicates (range 95.9–100%). The consistency rate was similar for replicates run in the same and different versions of the array. Samples with lower consistency rates included samples from more distant species and subspecies (SPRET/EiJ, SFM, SMZ, MSM/MsJ and JF1/Ms), lower-quality samples, and cell lines. Inconsistency was typically driven by a small minority of markers and by “no calls” in one or few of the technical replicates. We annotated the array based on probe design and performance of individual assays was used to annotate the array (Table S2). Probes were classified in four tiers based on the presence of reference, alternate, and H calls in our sample set. Probes in tier 1 and 2 (all three genotype calls present and two genotype calls present, respectively) were used in most of the genotype-based analyses. For tier 3 probes, only one genotype call was present, while for tier 4 all samples had N calls. For construct markers, this tier classification was not relevant.

Improved chromosomal sex determination reveals sex chromosome aneuploidy due to strain-dependent paternal nondisjunction

Typically, genetic determination of sex of a mouse sample has relied on detecting the presence of a Y chromosome. This approach does not estimate X chromosome dosage and thus lacks the ability to identify samples with common types of sex chromosome aneuploidies. MiniMUGA uses probe intensity to discriminate between normal chromosomal sexes (XX and XY) and two types of sex chromosome aneuploidies, XO and XXY (Table S1). Our methodology (Materials and Methods) provides a robust framework to discriminate between at least four types of chromosomal sex (Figure 1). Our set of 6899 samples was composed of 3507 unique females (no Y chromosome present) and 3018 unique males (Y chromosome present). We initially identified 54 samples as potential XO and XXY. However, in eight XO females the pattern of heterozygosity and recombination in the X chromosome (Table S7) demonstrated that these were, in fact, normal XX females with abnormal intensity distributions and pd_stat values above the threshold (see Materials and Methods). Once these eight samples were removed, 46 samples that had sex chromosome aneuploidies remained. To determine the rate of aneuploidy we only considered unique samples (not replicates). This resulted in 45 aneuploid samples among 6525 total unique samples, an overall 0.7% rate. This rate was driven by a highly significant excess (7X, χ2 = 62.9384; P < 0.00001) of sex chromosome aneuploids among the cell lines. Notably, all these aneuploids were XO. Among live mice there were 36 unique aneuploids (a rate of 0.55%). This rate was similar to but higher than that previously reported in both mice and in humans (Searle and Jones 2002; Chesler ; Le Gall ). In this data set, unique XO females were observed at significantly higher frequency than unique XXY males (P = 0.02; 25 XO females and 11 XXY males) (Table 1). For 22 of the 45 unique samples with sex chromosome aneuploidies, the parents were known and had informative markers in the X chromosome. This information allowed us to potentially determine the parental origin of the missing (in XO) or the extra (in XXY) X chromosome based on the parental haplotype inherited and the presence of recombination in the X chromosome (Figure 2 and Table S7). Overall, the parental origin can be determined unambiguously in 21 of these samples, and in all but one sample (95%) the aneuploidy is due to sex chromosome nondisjunction in the paternal germ line (Figure 2). Note that this applies to both XO and XXY samples. Given the paternal origin of most sex chromosome aneuploidies, we investigated whether the type of sire had an effect on this phenomenon. We observed a significantly (P < 0.00001) higher rate of aneuploids in the progeny of (CC029/Unc × CC030/GeniUnc) F1 hybrid males compared with all other types of sires. Out of 180 progeny of this cross, 5% of genotyped samples were aneuploids and both XO and XXY were observed (three XO and six XXY mice, respectively). There was also some evidence of a higher rate of sex chromosome aneuploids in progeny of sires with CC011/Unc background (five XO females, Table S7). We conclude that sex chromosome aneuploidy is relatively common in laboratory mice, originates predominantly in the paternal germ line, and depends on the sire genotype. In some backgrounds, aneuploidy rate is an order of magnitude higher than in the general population.

Figure 2

Sex chromosome aneuploidy is due to paternal nondisjunction. The figure shows the parental sex chromosome and mitochondrial complement of the dam and sire for two types of crosses. Only the sex chromosomes and the mitochondria are shown. The X chromosomes are shown as long acrocentric, the Y chromosomes as shorter submetacentric, and the mitochondria as circles. The figure also shows the inferred parental origin of the sex chromosome aneuploidy and the actual number of cases observed in our data set. The sex chromosome configuration of standard types of sex chromosome aneuploidy in the progeny in each type of cross are shown with the inferred parental origin of the X chromosomes.

Detection of Y chromosome mosaicism

There were eight samples (two classified as XX, three as XXY, and three as XO) with abnormal chromosome Y intensities (either too low or too high) and with low number of chromosome Y genotype calls [between 6 and 56 calls compared with 62.99 (mean) ± 1.3 (SD) across all males]. These eight samples are standard laboratory mice and are shown in gray in Figure 1. The intensity and genotypes strongly suggest the presence of a Y chromosome that can be explained either by mosaicism or an incomplete Y chromosome. We performed several additional analyses to discriminate between these two explanations. As a test case, we selected the tail-derived sample TL9348 (Tables S1 and S8) because it was expected to be an F1 hybrid male derived from a C57BL/6J and 129X1/SvJ outcross (Figure 3A), and chromosomal sex was not questionable based on its pd_stat (137). Based on chromosome intensity, this sample was classified as an XXY male with low chromosome Y intensity (Figure 3B). Inspection of the genotype calls on chromosome X revealed a significant excess of N calls compared with the autosomes (P < 0.00001, Table S8). Furthermore, the H calls on the X chromosome in this sample only occurred at SNPs where C57BL/6J and 129X1/SvJ have different alleles, but these H calls occur only at a fraction of expected sites. These H calls are present over the entire X chromosome. The fact that sample TL9348 has approximately one half of the Y chromosome intensity of XY or XXY males, and that there is evidence of heterozygosity on the X chromosome, suggests that the mosaicism is due to the loss of both the Y chromosome and one of the two X chromosomes in a fraction of cells. To test this hypothesis, we plotted the intensity of informative X chromosome markers for three types of controls—C57BL/6J, 129X1/SvJ, and F1 hybrid females—derived from those two inbred strains, as well as for the suspected mosaic sample TL9348 (Figure 3C). For sample TL9348, the vast majority of the informative markers were clustered between the C57BL/6J and the F1 hybrid genotypes (Figure 3C). This pattern explains the observed mix of N calls, heterozygous calls, and C57BL/6J calls in sample TL9348 and confirms its mosaic nature. It further demonstrates that the X chromosome lost is the one from 129X1/SvJ. Based on the positions of the intensities for the Y and X chromosome makers in Figure 3, A and B, respectively, we concluded that slightly less cells were XXY than XO (Figure 3, C and D). Considered together, these results indicate that the embryo started as an XXY due to paternal nondisjunction of the sex chromosomes and that mosaicism occurred in early development, a common observation in embryo mosaicism in humans (Johnson ; Fragouli ; McCoy 2017).

Figure 3

Complex sex chromosome aneuploidy and mosaicism in an F1 male. (A) The panel shows the chromosomal sex and mitochondria complement of the parents and F1 individual. Blue denotes C57BL/6J and red denotes 129X1/SvJ. (B) This panel is a reprint of Figure 1 and was used to classify the F1 male, shown as a yellow circle, as an XXY based on the x- and y-axis intensities (two X chromosomes and a Y chromosome present). This panel also provides evidence of mosaicism for the presence and absence of the Y chromosome (based on the low Y chromosome intensity). (C) This panel provides evidence of mosaicism for the X chromosome and identifies the paternal origin (129X1/SvJ) of the chromosome lost in some cells. The plot presents the intensities of the two alternate alleles for 173 X chromosome markers that are informative between the two parents. Four individuals are shown: a C57BL/6J female in blue, a 129X1/SvJ male in red, a (C57BL/6Jx129X1/SvJ)F1 female in gray, and the F1 male case in yellow. The shapes denote the type of call made by the Illumina software: circles are homozygous A, T, C, or G calls; triangles are H calls; and squares are N calls. (D) This panel shows the proposed sex chromosome complement of the two types of cells present in this F1 male case. This solution explains the observations from previous three panels. Among the remaining seven potential mosaics, one was a cell line and thus mosaicism of the sex chromosomes was not unexpected. For the other six samples we performed a similar analysis as the one described above. In all cases, the Y chromosome calls were consistent with those expected from their sires. This is consistent with Y chromosome mosaicism and not with sample contamination. However, only the two samples with 50 or more genotype calls on the Y chromosome have strong support for such a conclusion. In the Discussion we expand this analysis and provide some guidance for users of the array.

Strain-specific chromosome Y duplications

Among XY males there was a distinct cluster of 64 male samples with higher normalized median Y chromosome intensity (Figure 1). These samples include five inbred C3H/HeJ, two F1 hybrid males with a C3H/HeJ chromosome Y (Figure 4A), and 52 males derived from a C3H/HeJ by C3H/HeNTac F2 intercross. The plot of the normalized Y chromosome intensities in these males and 81 additional males with Y chromosomes derived from other C3H/He substrains (Figure 4A) revealed a clear separation between males carrying a Y chromosome from C3H/HeJ and males carrying C3H/HeNCrl, C3H/HeNHsd, C3H/HeNRj, C3H/HeNTac, and C3H/HeOuJ Y chromosomes. Males with the high-intensity Y chromosome also include two transgenic strains from The Jackson Laboratory: B6C3-Tg(APPswe, PSEN1dE9)85Dbo/Mmjax and B6; C3-Tg(Prnp-SNCA*A53T)83Vle/J. Both strains were developed and/or maintained on a B6C3H background (JAX Stocks 34829-JAX and 004479, respectively).

Figure 4

Segmental chromosome Y duplications in laboratory strains. (A) Normalized median Y chromosome intensity in selected samples with C3H/He, DBA/1, and C57BL/6 Y chromosomes. Within the C3H/He group, samples with a C3H/HeJ Y chromosome are shown in orange while samples with any other C3H/He Y chromosome are shown in blue. For DBA/1, there are multiple technical replicates of a single sample with abnormally high intensity shown in orange. For C57BL/6, there is only one sample with abnormally high intensity. The shape of the point reflects the type of mouse. (B) Range of normalized intensity distributions located at 63 SNPs on the short arm and the beginning of the long arm of Y chromosome in the C3H/He samples shown in (A). The range of intensities (mean ± SD) in samples with a C3H/HeJ Y chromosome are shown in orange while samples with any other types of C3H/He Y chromosomes are shown in blue. At the top of the panel, the potential duplication is shown in red, transition regions with uncertain copy number are shown in pink, and normal copy numbers are shown in black. The bottom of the panel shows the location of the MiniMUGA markers and genes. MUGA, Mouse Universal Genotyping Array. To determine the origin of the higher median intensity in males with a C3H/HeJ Y chromosome, we plotted the normalized intensities at 59 MiniMUGA markers located on the short arm of chromosome Y and the four most proximal markers on the long arm of that chromosome (Figure 4B). Inspection of this figure indicates that 54 consecutive markers have distinctly higher intensities in C3H/HeJ males and are flanked by markers with intensities that are undistinguishable from males with other C3H/He Y chromosomes. These markers define a 2.9-Mb region located on the short arm of the Y chromosome containing seven known genes—Eif2s3y, Uty, Ddx3y, Usp9y, Zfy2, Sry, and Rbmy—and 12 gene models (Figure 4B). We conclude that C3H/He substrain differences are due to an intrachromosomal duplication that arose and was fixed in the C3H/HeJ lineage after the isolation of that substrain in 1952 (Akeson ). There are five additional non-C3H/He samples with high normalized median chromosome Y intensity, four technical replicates from a single DBA/1OlaHsd male and a single Axl congenic mouse on a C57BL/6 background (Figure 4A). Each case represents an independent (different haplotype and different boundaries, Figure S6) and very recent duplication of the Y chromosome. These duplications were segregating within a closed colony. Given that we identified three independent large segmental duplications in the short arm of the Y chromosome among 3018 unique males (Table 1), this leads to a crude mutation rate estimate of 1/1000. This mutation rate is slightly lower than some segmental duplications in the mouse (Egan ), higher than the mutation rate in microsatellites (Dallas 1992), and consistent with high levels of structural variation in the short arm of the Y chromosome in wild mice (Morgan and Pardo-Manuel de Villena 2017).

An effective tool for genetic QC in laboratory inbred strains

To determine the performance of MiniMUGA among inbred strains we genotyped 779 samples representing 241 inbred strains including 86 classical inbred strains, 34 wild-derived inbred strains, 49 BXD recombinant inbred lines, and 72 CC strains (Table S3). We created consensus genotypes for each inbred strain using both biological and technical replicates (see the Materials and Methods). The use of replicates strengthens genetic analyses as they provide a simple but robust method to determine the performance of each SNP in each strain (see the Discussion), as well as determining the dates when diagnostic alleles arise and potentially became fixed (see Diagnostic SNPs as a tool for genetic QC and strain dating). We note that for the CC strains, which are incompletely inbred (Srivastava ; Shorter ), our consensus calls should be treated with caution and viewed as preliminary. This is particularly true given that they were based on a small number of individuals sampled from the UNC Systems Genetics Core Facility colony (Morgan , http://csbio.unc.edu/CCstatus/index.py) between 2016 and 2017 (Shorter ). Future sampling of a wider range of individuals from CC strains throughout the history of the CC colony will result in more accurate consensus genotypes for these strains. Using the consensus genotypes we determined the number of informative markers for pairwise combinations of all inbred strains (excluding BXD and CC). Figure 5 summarizes the results for 83 classical inbred strains. Over 90% of comparisons have ≥1280 informative autosomal markers and all but 0.52% of pairwise comparisons have >40 informative autosomal markers (2.1 markers per autosome). These statistics are exceptional given the small number of markers in the array, and considering the number of diagnostic markers included and a substantial number of construct markers. Although our focus is on classical inbred strains, we extended the analysis to include 37 wild-derived strains. For all 2924 combinations of classical and wild-derived strains, the informativeness is high (mean = 3224, minimum = 1649, and maximum = 3827, data not shown). In marked contrast, combinations between wild-derived strains have a much wider range of informative SNPs (from 93 to 3410, data not shown) due to a fraction of combinations with few to a moderate number of informative SNPs. The pairs of strains with the lowest number of informative SNPs include pairs of strains from a taxa other than M. musculus (for example SPRET/EiJ, SMZ, and XBS) and pairs of strains that are known to have close phylogenetic relationships (TIRANO/EiJ and ZALENDE/EiJ; and PWD and PWK/PhJ) (Yang ). We conclude that MiniMUGA improves cost-effective genotyping for dozens of standard laboratory strains and experimental crosses derived from them.

Figure 5

Number of informative SNP calls in pairwise comparisons among classical inbred strains. Strains are ordered by similarity and colors represent the range of number of informative SNPs based on the consensus genotypes. Only homozygous base calls, at tier 1 and 2 markers, on the autosomes, X, and pseudoautosomal region are included.

Mitochondria

MiniMUGA has 88 markers that track the mitochondrial genome, 82 of which segregate in our set of 241 inbred strains. Based on these 82 markers, the inbred strains can be classified into 22 different haplogroups, 19 of which discriminate between M. musculus strains (Figure 6A). Fifteen haplogroups represent M. m. domesticus (groups 1 to 15 in Figure 6A) two haplogroups represent M. m. musculus (16 and 17), and two M. m. castaneus (18 and 19). Three haplogroups represent different species such as M. spretus and M. macedonicus.

Figure 6

Haplotype diversities. Haplotype diversities of the mitochondria (A) and chromosome Y (B). The trees are built based on the variation present in MiniMUGA and may not represent the real phylogenetic relationships. Colors denote the subspecies-specific origin of the haplotype in question: shades of blue represent M. m. domesticus haplotypes; shades of red represent M. m. musculus haplotypes; and shades of green represent M. m. castaneus haplotypes. The arrow in panel (A) identifies a M. spretus strain with a M. m. domesticus mitochondria haplotype. MUGA, Mouse Universal Genotyping Array. In M. musculus, nine haplogroups are present in multiple inbred strains while 10 are found in a single inbred strain. The most common haplogroup is present in 158 inbred strains (including 49 BXD and 26 CC strains). This haplogroup is found in many classical inbred strains including C57BL6/J, BALB/cJ, A/J, C3H/HeJ, DBA/1J, DBA/2J, and FVB/NJ. Unique haplogroups represent an interesting mix of wild-derived strains (LEWES/EiJ, CALB/Rk, WMP/Pas, SF/CamEiJ, TIRANO/EiJ, ZALENDE/EiJ, and CIM) and DBA/2 substrains (DBA/2JOlaHsd and DBA/2NCrl). CC strains fall into six common haplogroups, one shared by three CC founders (A/J, C57BL/6J, and NOD/ShiLtJ) and five haplogroups present in a single CC founder: PWK/PhJ, 129S1/SvImJ, CAST/EiJ, NZO/HlLtJ, and WSB/EiJ. Interestingly, SMZ, a wild-derived inbred strain of M. spretus origin, has a mitochondrial haplogroup that unambiguously clusters with M. m. domesticus (Figure 6A) demonstrating a case of interspecific introgression (Didion and Pardo-Manuel de Villena 2013).

Chromosome Y

MiniMUGA has 75 markers that track the Y chromosome, 57 of which segregate in our set of 189 inbred strains with at least one male genotyped. Based on these 57 markers, the inbred strains can be classified into 18 different haplogroups, 16 of which are M. musculus (Figure 6B). Only four haplogroups represent M. m. domesticus, two haplogroups represent M. m. castaneus, and 11 represent M. m. musculus. M. spretus and M. macedonicus are represented by a single haplogroup each. In M. musculus, all but one haplogroup (CIM) are present in multiple inbred strains. No single haplogroup dominates in our collection of inbred strains (the most common is present in 38 inbred strains). Interestingly, C57BL/6 substrains fall into three distinct haplogroups. The ancestral haplogroup is found in C57BL/6ByJ, C57BL/6NCrl, C57BL/6NHsd, C57BL/6NJ, C57BL/6NR,j and B6N-Tyr < c-Brd>/BrdCrCrl. This haplogroup is present in other classical inbred strains such as BALB/c, C57BL/10, C57BLKS/J, C57L/J, and C58/J. The second haplogroup is present in C57BL/6JBomTac, C57BL/6JEiJ, and C57BL/6JOlaHsd. Finally, C57BL/6J has its own private derived haplogroup shared with 10 CC strains. Each one of the eight founder strains of the CC (A/J, C57BL/6J, 129S1/SvImJ, NOD/ShiLtJ, NZO/HlLtJ, CAST/EiJ, PWK/PhJ, and WSB/EiJ) has its own distinct haplogroup.

Diagnostic SNPs as a tool for genetic QC and sample dating

We define SNPs as diagnostic when the minor allele is present only in a single substrain or in a set of closely related substrains. The identification of these SNPs for inclusion in the array is based on WGS of 12 publicly available strains (Keane ; Adams ), 33 substrains sequenced by us, and SNP data for the C57BL/10 strain group (Table 2). Almost 30% of the SNPs (3045) in MiniMUGA are diagnostic. Although diagnostic SNPs have low information content (i.e., most samples in a large set of genetically diverse mice will be homozygous for the major allele) they fulfill two critical objectives. First, they increase the specificity of the MiniMUGA array to identify the genetic background present in a sample. In addition, they are essential to extend the power of genetic mapping in RCC beyond the C57BL/6J-C57BL/6NJ paradigm (Kumar ; Treger ). The 3045 diagnostic SNPs can be divided into two classes based on whether they are diagnostic for a specific substrain (i.e., BALB/cJBomTac or C3H/HeJ) or two or more substrains within a strain group (i.e., BALB/c or C3H/He). There are 2408 SNPs that are diagnostic for one of 45 substrains and 637 SNPs diagnostic for one of 10 strain groups (Table 2). A second classification divides diagnostic SNPs into fully diagnostic (2910) and partially diagnostic SNPs (129). The difference between these two classes is based on whether the diagnostic allele was fixed or was still segregating in the samples used to determine the consensus genotypes of 45 classical inbred strains. All diagnostic SNPs originated as partially diagnostic SNPs and they highlight the often-overlooked fact that mutations arise in all stocks and some become fixed despite the best efforts to reduce their frequency and impact (Sarsani ). Note that the classification of a SNP as partially diagnostic depends on the samples used for the consensus calls. It is theoretically possible to date when diagnostic SNPs arose and whether and when they became fixed in the main stock of a substrain. This requires genotyping of cohorts of mice separated from the main stock at known dates. The confidence of the inferences will depend on the size of those cohorts. Those cohorts can be historical samples or extant inbred strains derived at known dates from one or more substrains, such as panels of recombinant inbred lines (RILs), congenics, and consomics. We have two such panels in our sample set: the BXD and the CC RIL. In the former, we determined whether diagnostic alleles for C57BL/6J and DBA/2J were present in 49 BXD RILs. These RIL were generated in three different epochs: 22 of the genotyped BXD lines belong to epoch I (Taylor ), four belong to epoch II (Taylor ), and 23 belong to epoch III (Peirce ). In the CC population we determined whether diagnostic SNPs for C57BL/6J, A/J, 129S1/SvImJ, and NOD/ShiLtJ were present in 72 CC RILs. CC strains were generated in two waves and at three independent sites from inbred mice originally obtained from The Jackson Laboratory in 2004 and 2007 (Collaborative Cross Consortium 2012). For each SNP, we determined in which relevant cohort the diagnostic allele was first observed, and if and when it became fixed. This analysis depended on the number of cohorts relevant for a given substrain and the number of samples per cohort. Note that the analysis for a given substrain may integrate multiple cohorts from different populations as long as the year of origin is known. For the five substrains analyzed here, there is considerable variation in the number of cohorts and samples (Table 4). Note that only diagnostic SNPs included in the preliminary phase were used in this analysis because most CC and BXD samples were only genotyped with that version of the array. Also note that the number of independent samples used to establish the consensus is critical to gauge the strength of support for date of fixation of diagnostic alleles in that cohort (Tables S1, S4, and S5). Finally, in the analyses involving the consensus cohorts, we excluded 33 samples because they represent DNA acquired from The Jackson Laboratory >10 years ago.

Table 4

Dating the origin and fixation of diagnostic SNPs in five mouse inbred strains

Substrain	Cohort	Year	Number of samples	Range of alleles sampled	Diagnostic allele
Substrain	Cohort	Year	Number of samples	Range of alleles sampled	Absent	Segregating	Fixed
C57BL/6J	BXD E1	1971	22	11	156	0	0
	BXD E2	1996	4	0–4	84	72	0
	BXD E3	2001–2009	24 (23)	11.5	50	31	75
	CC	2004–2007	483 (72)	4–18	8	30	118
	Consensus^a	2010–2016	15 (1)	15	0	20	136
DBA/2J	BXD E1	1971	22	11	105	7	0
	BXD E2	1996	4	0–4	37	62	13
	BXD E3	2001–2009	24 (23)	11.5	24	75	13
	Consensus^a	2010–2016	3 (1)	3	0	0	112
A/J	CC	2004–2007	483 (72)	2–22	2	11	47
A/J	Consensus^a	2010–2016	10 (1)	10	0	5	55
129S1/SvImJ	CC	2004–2007	483 (72)	3–42	1	6	81
129S1/SvImJ	Consensus^a	2010–2016	10 (1)	10	0	4	84
NOD/ShiLtJ	CC	2004–2007	483 (72)	4–43	1	2	34
NOD/ShiLtJ	Consensus^a	2010–2016	8 (1)	8	0	1	36

In this analysis, we excluded samples purchased from The Jackson Laboratory (the sample names include the suffix jaxDNA) over a decade ago in the consensus cohorts. Details are provided in the text. CC, Collaborative Cross; BXD, recombinant inbred BXD panel

This table lists the name of the substrain, the cohorts used for dating the diagnostic SNPs, the approximate year(s) when these cohorts were derived from the main stock, the number of samples genotyped, and the range of alleles sampled. When the number of samples does not match the number of strains, the number of strains is shown in parentheses. Diagnostic alleles are classified as absent, segregating, and fixed for each substrain and cohort, and the table provides the total number in each category. In this analysis, we excluded samples purchased from The Jackson Laboratory (the sample names include the suffix jaxDNA) over a decade ago in the consensus cohorts. Details are provided in the text. CC, Collaborative Cross; BXD, recombinant inbred BXD panel C57BL/6J is the only shared parental strain in the BXD and CC panels, it is also the most popular inbred strain for experimental biologists, and it is the basis for the mouse reference genome. Therefore, we selected C57BL/6J as an example for the procedure and utility for dating diagnostic alleles. The 156 C57BL/6J diagnostic markers were classified based on the earliest observation and apparent fixation in Table 5. Notably, 141 SNPs were distributed in 8 of the 15 possible birth/fixation pairwise configurations (Table 5). The remaining 15 SNPs were segregating in the most recent cohort, with one half of them segregating since 2004. These SNPs probably represent variants present in the original pair used in the genetic integrity program at The Jackson Laboratory (Sarsani ). The dates of origin and fixation for C57BL/6J, A/J, 129S1/SvImJ, NOD/ShiLtJ, and DBA/2J diagnostic SNPs are provided in Table S2.

Table 5

Full dating of diagnostic alleles for the C57BL/6J substrain

		Apparent fixation					Not fixed
		BXD E1	BXD E2	BXD E3	CC	Consensus^a	Not fixed
Earliest observation	BXD E1	0	0	0	0	0	0
	BXD E2	NA	0	67	1	4	0
	BXD E3	NA	NA	8	18	8	0
	CC	NA	NA	NA	24	4	14
	Consensus^a	NA	NA	NA	NA	2	6

In this analysis, we excluded samples purchased from The Jackson Laboratory (the sample names include the suffix jaxDNA) over a decade ago in the consensus cohorts. Details are provided in the text.

The table classifies 156 diagnostic SNPs into one of 20 categories based on the earliest observation (origin) and apparent date of fixation based on whether the diagnostic allele is observed in BXD and CC strains with the C57BL/6J haplotype at each loci. Temporally impossible cells are shown as NA. BXD, Recombinant inbred BXD panel; CC, collaborative cross. In this analysis, we excluded samples purchased from The Jackson Laboratory (the sample names include the suffix jaxDNA) over a decade ago in the consensus cohorts. Details are provided in the text. The birth and fixation of diagnostic alleles can be used to determine the origin and breeding history of a given sample of the appropriate background, and thus estimate the expected level of drift (see Discussion). The presence of segregating variants for C57BL/6J, A/J, 129S1/SvImJ, and NOD/ShiLtJ at the initiation of the CC project will result in CC strains that share identical haplotypes but may be functionally different due to one of those variants, as has been observed for gene deletion in the BXD panel (Anderson ; Mulligan ). To test whether it is possible to use the diagnostic SNPs with known dates of origin and fixation (Table S2) to determine the breeding history of a given sample or stock, we selected the 156 C57BL/6J diagnostic SNPs as a test case. The key step in this analysis is to identify all SNPs in the sample that have the ancestral allele at fixed diagnostic SNPs. These SNPs identify genomic regions in that sample that have not been refreshed since fixation of the SNPs. Conversely, SNPs with the derived (diagnostic) allele identify regions in the sample that have been in contact with the main stock since the date of origin of that derived allele. Figure 7 shows the result of this analysis in three samples with different patterns. Figure 7A shows a knockout mouse from a line created prior to epoch III of the BXD panel and bred independently from the C57BL/6J stock since at least 2004. The former conclusion is based on the fact that we detect the ancestral allele at 21 SNPs that were fixed in the C57BL/6J stock prior to epoch III (Table 5). The latter is based on the observation of ancestral alleles at 36 SNPs that were fixed by 2004 (Table 5) and that these markers are distributed across 14 chromosomes. Figure 7B shows a transgenic mouse from a line created prior to the initiation of the CC (2004) and bred independently from the C57BL/6J stock since them. Both conclusions are based on the fact that there are zero ancestral alleles at any of 75 diagnostic SNPs fixed by epoch III (Table 5), the detection of the ancestral allele at 18 SNPs that were fixed prior to the CC (Table 5), and that these markers are distributed across 13 chromosomes. Finally, Figure 7C shows a wild-type C57BL/6J mouse derived from the JAX colony after 2004. This conclusion is based on the lack of ancestral alleles at any of 124 fixed diagnostic SNPs and the presence of a derived allele at three SNPs that arose after the CC (Table 5). Notably, these conclusions are consistent with the expectations of the contributors of these samples.

Figure 7

Sample dating and breeding history of mice with C57BL/6J background. Red bars denote the ancestral allele for diagnostic SNPs fixed by E3 in the BXD panel. Pink bars denote ancestral alleles for diagnostic SNPs fixed by the start of the CC. Light blue bars denote diagnostic alleles at diagnostic SNPs fixed by E3. Blue bars denote diagnostic alleles at diagnostic SNPs fixed by the start of CC. Gray bars denote ancestral alleles at post-CC diagnostic SNPs. Black bars denote diagnostic alleles at post-CC diagnostic SNPs. Split bars denote heterozygosity. (A) Inbred Baff male in C57BL/6J background. (B) Inbred transgenic and IFNgR1 female in C57BL/6J background. (C) Inbred C57BL/6J male. Diagnostic allele always represent the derived allele, and the nondiagnostic allele is always the ancestral allele. CC, collaborative cross; BXD, Recombinant inbred BXD panel; E, epoch.

Expansion of reduced complexity crosses to a large number of substrains

We define RCC as crosses between substrains derived from a single inbred strain that differed only at mutations that arose after they were isolated and bred independently from a common stock. We tested the ability of MiniMUGA to efficiently cover the genome in 78 different RCC between substrains for which we had consensus genotypes, WGS, and for which live mice were available from commercial vendors (see Table 2). We focused our analysis on this group given that WGS of both substrains is required for rapid identification of causative variant(s) (Kumar ; Treger ). We used the distance to the nearest informative marker to estimate how well MiniMUGA covers the genome in a given RCC cross. Figure 8 and Table S9 summarize these data, and demonstrate that for 62 RCC (82%) all of the genome is covered by a linked marker and in 14 RCC (18%) between 95 and 99.5% of the genome is covered by a linked marker. Only in two RCC (3%) is there a significant fraction of the genome that is not covered by a linked marker. These two crosses are B6N-Tyr < c-Brd>/BrdCrCrl by C57BL/6JOlaHsd and BALB/cByJ by BALB/cByJRj with 8 and 14% of the genome not covered, respectively. An alternative test is the number of RCC for which 95% of the genome is covered by informative markers at 20-cM (56 RCC or 72%) and 40-cM (72 RCC or 92%) intervals. We conclude that MiniMUGA provides a cost-effective tool to extend RCC to substrains from the 129P, 129S, A, BALB/c, C57BL/6, C3H, DBA/1, DBA/2, FVB, and NOD strain groups.

Figure 8

Percent of the genome covered by MiniMUGA in RCC. Each of the 78 RCC is shown as a circle in ascending order. The order is independent for each one of the six analyses. Coverage was based on the linkage distance to the nearest informative marker in given RCC. MUGA, Mouse Universal Genotyping Array; RCC, reduced-complexity crosses.

Robust detection of common genetic constructs

Given the broad usage of genetic editing technologies, a key design criterion of MiniMUGA was the ability to detect frequently used genetic constructs. Utilizing our pipeline (see Materials and Methods), we positively identified samples containing 17 construct types (Figure 9). Importantly, for eight of these constructs, our sample set also included positive controls. These positive controls showed robust detection of their relevant constructs. We detected further positive samples in our set for these eight constructs, as well as nine additional constructs without positive controls. The latter set of samples belonged to sample classes where constructs were plausible (e.g., not wild-derived or CC samples), and there was high concordance for intensities among the multiple probes of a single construct (Figure S2B intensity correlation).

Figure 9

Detection of genetic constructs validated in MiniMUGA. For each construct, samples are shown as dots and classified as negative controls (left), experimental (center), and positive controls (right). The dot color denotes whether the sample is determined to be negative (blue), positive (red), or questionable (gray) for the respective construct. For each construct, the gray horizontal lines represent data-driven ad hoc thresholds discriminating between presence and absence. Note for each construct, the y-axis scale is different. MUGA, Mouse Universal Genotyping Array; g_FP; ‘greenish’ fluorescent protein; SV40; SV40 large T antigen; Cre, Cre recombinase; tTA, tetracycline repressor protein;; DTA, Diptheria toxin; hCMV_b, Human CMV enhancer version b; Luc, Luciferase and firefly luciferase; chloR, Chloramphenicol acetyltransferase; bpA, Bovine growth hormone poly A signal sequence; iCre, iCre recombinanse; rtTA, Reverse improved tetracycline-controlled transactivator; cas9, CRISPR associated protein 9; BlastR, Blasticidin resistance; IRES, Internal Ribosome Entry Site; hCMV_a, hCMV enhancer version a; r_FP, ‘reddish’ fluorescent protein; hTK_pr, Herpesvirus TK promoter. Across these 17 constructs, we observed that our ability to discriminate between negative and positive samples was strongly correlated with the number of independent probes for that construct (Figure 9 and Figure S2B). As signal intensity is constrained by the dynamic range, our ability to definitively call the presence of low-probe number constructs is more uncertain. This uncertainty is especially relevant where a construct within a sample is genetically divergent from the sequences used to design a given probe/probe set. Given our ability to positively identify construct classes with as few as two probes, it is likely that even for constructs which have divergent sequences from our designed sequences, or are targeting a more distantly related construct type, our pipeline will flag samples. However, users are highly encouraged to consult the probe sequences (Figure S3) when they expect a given sample to contain a construct, but do not see support in the array itself. Conversely, if a construct with many independent probes is determined to be present, that call is more reliable, even if a sample is not expected to contain that construct. We additionally observed that for some constructs, there was between-sample variation in the overall intensity of the signal associated with a given construct [see internal ribosome entry site (IRES), Figure 9)]. For IRES, the between-sample variation was likely to be due a higher copy number of the construct in five individual samples because of consistent higher intensity across all probes (Figure S2A). Copy-number variation in transgene insertions is a common phenomenon (see “Development” documentation on JAX Stock 034860, McCray ), and such copy-number variation segregating within a given colony/line can lead to noise and a lack of reproducibility in given experiments. Alternatively, between-sample variation might be explained by between-probe variation for a construct as discussed above. Individuals are encouraged to examine the summed intensity levels for positive constructs for their strains/samples, to confirm that within a relevant sample group, these levels are roughly equal.

An easy-to-interpret report summarizes the genetic QC for every sample

The MiniMUGA Background Analysis Report (Figure 10) aims to provide users with essential sample information derived from the genotyping array for every sample genotyped. The report is designed to provide overall sample QC, as well as genetic background information for classical inbred mouse strains, and congenic and transgenic mice. For samples outside of this scope, the report may be incomplete and/or provide misleading conclusions. Details of the thresholds and algorithms for each section of the report are provided in the Materials and Methods section.

Figure 10

Background Analysis Report for the sample named MMRRC_UNC_F38673, from the line named B6.Cg-Cdkn2a/Mmnc. The genotype of this sample is of excellent quality. It is a close to inbred female that is a congenic with C57BL/6J as a primary background, and with multiple regions of a 129S secondary background. This sample is positive for the luciferase and firefly luciferase construct, and negative for 16 other constructs. g_FP; ‘greenish’ fluorescent protein; SV40; SV40 large T antigen; Cre, Cre recombinase; tTA, tetracycline repressor protein;; DTA, Diptheria toxin; hCMV_b, Human CMV enhancer version b; Luc, Luciferase and firefly luciferase; chloR, Chloramphenicol acetyltransferase; bpA, Bovine growth hormone poly A signal sequence; iCre, iCre recombinanse; rtTA, Reverse improved tetracycline-controlled transactivator; cas9, CRISPR associated protein 9; BlastR, Blasticidin resistance; IRES, Internal Ribosome Entry Site; hCMV_a, hCMV enhancer version a; r_FP, ‘reddish’ fluorescent protein; hTK_pr, Herpesvirus TK promoter; PAR. pseudoautosomal region In addition to chromosomal sex and the presence of constructs, the report provides a quantitative and qualitative score for genotyping quality. Based on the number of N calls per sample of our sample set, we classified samples in one of four categories: samples with excellent quality (0–91 N calls, represents 96.8% of samples), samples with good quality (between 92 and 234 N calls, 2% of samples), samples with questionable quality (between 235 and 446 N calls, 0.9% of samples), and samples with poor quality (>447 N calls, 0.3% of samples). Only tier 1 and 2 markers were used in this analysis (see Materials and Methods). Regarding inbreeding status, the report assigns every sample to one of three categories: Inbred (<61 H calls), close to inbred (between 61 and 280 H calls), and outbred (>280 H calls). These thresholds are based on the number of H calls observed in the autosomes of 172 samples of classical inbred strains and predicted heterozygosity in 3655 in silico F1 hybrid mice (Figure S7). For genetic background detection, the report provides two complementary analyses. The first infers the primary and secondary backgrounds of samples that pass genotype quality and inbreeding thresholds based on the totality of their genotypes (excluding the Y chromosome, mitochondria markers, and construct probes). The second returns the genetic backgrounds detected in a sample based on the presence of the diagnostic allele at diagnostic SNPs (see section on Diagnostic SNPs as a tool for genetic QC and strain dating). For the primary background analysis, the sample’s genotype is compared to a set of 120 classical and wild-derived inbred reference strains (Table S3) to identify the strain that best explains the sample genotypes. If multiple substrains from the same strain group have been detected via diagnostic alleles, or if there is an overrepresentation of a particular diagnostic strain in the unexplained markers, the algorithm generates a composite strain consensus that incorporates all substrains in that strain group and uses it in the primary background analysis. The strain or combination of substrains that best matches the sample is called the primary background for the sample. The report provides the number of homozygous calls that are consistent or inconsistent with the primary background, as well as the number of heterozygous calls in the sample. The primary background is always returned for samples in which the primary background explains at least 99.8% of the sample genotype calls. Once the primary background is identified, the algorithm tests whether ≥75% of the markers inconsistent with the primary strain background and heterozygous markers are spatially clustered. If they are not (<75% of markers spatially clustered) the algorithm will not try to identify a secondary background. If ≥75% of the unexplained markers are clustered, all reference strains that equally explain the unexplained calls are identified as potential secondary background(s). If the combination of primary and secondary backgrounds explains ≥99.8% of the calls, the primary and secondary backgrounds are reported. If this combination explains <99.8%, then no genetic background is returned. For samples where a primary and secondary background is reported, the algorithm determines whether the remaining unexplained markers are spatially clustered. If they are, the summary states that clustering of unexplained markers may indicate the presence of an additional genetic background. The limitations of this greedy approach to identification of the primary and secondary backgrounds are addressed in the Discussion section. Note that this report is generated programmatically using an available set of reference inbred strains (Table S3). If the reported results are inconsistent with expectations, users need to consider further analyses before reaching a final conclusion. All estimates and claims in the report are heavily dependent on the quality of the sample and genotyping results. Less than excellent genotyping quality may increase the likelihood of an incorrect background determination. Genotyping noise can lead to incorrect reporting and may be particularly misleading in samples from standard commercial inbred strains. Fully inbred strains routinely have a small percentage of spurious H calls. These do not represent true heterozygosity (see consensus of inbred strains).

Discussion

MiniMUGA as a tool for QC

Among the many new capabilities of the MiniMUGA array compared with its predecessors is the Background Analysis Report provided with each genotyped sample. Although expert users can, and undoubtedly will, refine existing and develop new analyses pipelines, all users benefit from a common baseline developed after the analyses of thousands of samples. The size, annotation, and variety of our sample set provides a firm foundation for our conclusions. We urge users to pay particular attention to genotype quality, reported heterozygosity, and unexpected conclusions (i.e., sex, backgrounds, and constructs detected). Genotype quality depends on the sample quality, quantity, and purity, and on the actual genotyping process. Poor genotype quality can also be the byproduct of off-target variants in the probes used for genotyping and, thus, wild mouse samples and mice from related taxa are expected to have lower apparent quality (Didion ). Samples with poor quality will not be run through the full report pipeline. Samples with questionable quality may lead to incorrect conclusions. For samples of any quality the total number of N calls should be carefully considered if unexpected results are reported. It is also important to consider a sample’s probe intensity distribution value as represented by pd_stat when evaluating the credibility of its reported chromosomal sex. The reported chromosomal sex of samples with high pd_stat (>3230) should be questioned (see Figure S5). Reported heterozygosity is sensitive to genotyping quality. A lower-quality sample will typically include more spurious heterozygous calls than an excellent-quality sample of the same strain. This leads to an incorrect estimate of the level of inbreeding in a given sample and can be particularly misleading in a fully inbred mouse of a single background. The thresholds used to classify samples as inbred, close to inbred, and outbred are somewhat arbitrary and reflect the biases in SNP selection (overrepresentation of diagnostic SNPs for selected substrains) and the highly variable range of diversity observed in F1 mice. We used the observed number of H calls in known inbred samples and the predicted number of H calls among a large and varied set of potential F1 hybrids to set our thresholds, but users should consider the level of heterozygosity expected in a specific experiment (Figure S7). For example, mice generated in RCC between related substrains may have a very small number of H calls and thus will be misclassified as more inbred than they really are. The report combines sample quality and heterozygosity in a single figure for quick visual inspection (Figure 10). Note that the x- and y-axes are compressed in the high-value range to ensure that all samples, even those with very poor quality and/or high heterozygosity, are shown. The precise location of a sample in the plot should help customers contextualize their sample’s quality and inbreeding when evaluating their results. For users genotyping a large number of samples in a given batch (for example, several 96-well-plates), we found it useful to include a plate-specific control at an unambiguous location (we used the B3 well). Ideally, these controls should have known genotypes, excellent quality, and be easy to distinguish from all other samples in the batch. Plating errors or unaccounted transpositions occurring during the genotyping process are rare but problematic. Adding one sample per plate is a reasonable price to pay to quickly identify these issues. We anticipate that most users will use the Background Analysis Report to determine the genetic background(s) present in a sample as well as their respective contributions. The identification of the correct primary and secondary background is completely dependent on the preexisting set of reference strains (Table S3). If a genotyped sample is derived from a strain that is not part of this reference set, the reported results may be misleading or completely incorrect. Users should consult the list of reference backgrounds (Table S3). We expect the number of reference backgrounds to increase over time, reducing the frequency and impact of this problem. However, the current background detection pipeline is not appropriate for RILs such as the BXD and CC populations. By their very nature, these RILs have mosaic genomes derived from two or more inbred strains included in our panel, and thus the background analysis will detect more than two inbred backgrounds for CC strains, or declare one of the parental strains as primary or secondary background for the BXD strains. Users interested in confirming or determining the identity of RILs can use our consensus genotypes to do so. An important caveat of the current primary and secondary background analysis is that the approach is greedy, and all variants except those with H and N calls in the consensus are considered. Because only a fraction of the SNPs are informative between a given pair of strains (always less than one half, see Figure 5 and Figure S7), the algorithm always overestimates the contribution of the primary background and underestimates the contribution of the secondary background (Figure 10). In congenic strains, the true contribution of the strain identified as the secondary background is approximately 2.6 times higher than shown in the report (Figure S9). This appears to be true independent of the strains identified as primary and secondary backgrounds and their proportions. If the exact contribution of either background is critical for the research question, the user should reanalyze the data using only SNPs that discriminate between the two backgrounds. A second caveat is that the current pipeline does not include the mitochondria and Y chromosomes. This shortcoming will be addressed in a future update of the Background Analysis Report. A final caveat is that in most cases where more than two inbred strains are needed to explain the genotypes of a sample, the report does not identify any of them. In our experience, when three of more backgrounds are present, a greedy search is not effective and often leads to incorrect results. Therefore, if the user has prior knowledge of at least some of the backgrounds involved, they should conduct an iterative hierarchical search that will typically yield the correct solution, but care needs to be taken at each step. The private variants that underlie the RCC concept are the diagnostic variants used in background determination and sample dating. Diagnostic SNPs have little information content but high specificity. The presence of diagnostic alleles in a sample is strong evidence that that specific substrain (or a closely related substrain absent from our set) contributed to the genetic background of that sample. However, because only a small fraction of diagnostic SNPs have been observed in all three genotypes across multiple samples, their performance is not well established, in particular for heterozygous calls. To avoid errors, we required diagnostic alleles at three different SNPs in a given sample before a genetic background was declared in the Background Analysis Report. All diagnostic SNPs began their history as partially diagnostic (segregating in an inbred strain or substrain population). The examples for dating stocks as shown in the Results section were fairly simple, but more complex and more interesting patterns are plentiful in our data set. For example, four samples from a congenic inbred stock showed evidence of both an old stock and new refreshing of the genome in recent years (Figure S10). Specifically, the presence of ancestral alleles at many diagnostic SNPs fixed prior to epoch III and the start of the CC speaks of a mouse line generated and bred independently for many years. On the other hand, heterozygosity at some of these markers as well as the presence of diagnostic alleles that are still segregating (Table 5) indicates that this line was refreshed by backcrossing to C57BL/6J in recent years. Both conclusions are consistent with our expectations as this stock was imported by Mark Heise at UNC in 2014 and backcrossed once or a few times to JAX mice before being maintained by brother–sister mating. In addition to improving the genetic QC, we believe that this type of analysis may provide researchers with critical information to guide both experimental design and data analysis. Also important is the ability to estimate the amount of drift that will occur and thus the amount of genetic variation present in that line but absent in the main stock. We expect that widespread use of MiniMUGA and the continued and rapid annotation of diagnostic SNPs (Table 4), not only for C57BL/6J but for all inbred substrains, offers an opportunity to significantly improve the rigor and reproducibility of mouse research. Mouse cell lines can be subject to the same genetic QC as mice. We have shown the ability to detect sex chromosome aneuploidy in cell lines (Figure 1). Diagnostic SNPs can be used to date cell lines in similar fashion to live mice with the added simplicity that cell lines are less susceptible to drift. Finally, the Background Analysis Report pipeline can be used to effectively identify the origin of cell lines. Examples are provided in Figure S8. The importance of genetic QC in cell lines will grow in the future given the increased emphasis on cell-based research. We have previously reported that the number of N calls on other genotyping platforms is higher for cell lines than for biopsies (Didion ). The evidence of such phenomena in our data set is inconclusive. Genetic constructs have been a staple of genome editing technologies since the 1980s. In addition to desired genetic modifications, constructs will often include a variety of other necessary features (e.g., selection markers and constitutive promoters). The array can be used to validate the presence of constructs expected to be present and/or to identify the presence of unexpected constructs. Our construct probe design was focused on targeting conserved features of various genetic engineering and/or in vitro constructs commonly used in mammalian genetics. We can split these conserved probe sets into two main classes: those for which we were able to detect positive samples in our data set, and those for which we were not able to detect any consistently positive samples. Given the between-probe variation in a given sample, interested users can examine the individual probe intensities to refine the analysis (e.g., the cyan, green, and yellow fluorescent protein probe sets). Finally, we designed probes for 14 constructs (123 probes) for which we were unable to call presence or absence in our pipeline. This may be due to lack of positive samples in our data set, not enough probes with positive signal for a given construct, or probes that failed. If a user knows that a construct is present in their data set, they are encouraged to recreate our pipeline for calling presence or absence of the relevant construct.

MiniMUGA as a tool for discovery

MiniMUGA was designed to support the research mission of geneticists, but the range of applications will depend on the ingenuity of its users. In the Results sections, we explored three areas in which MiniMUGA has the potential to enhance existing resources and tools. The first of these areas was sex chromosome biology. MiniMUGA is able to robustly determine four sex chromosome configurations (Figure 1) and thus facilitate estimation of the incidence and prevalence of sex chromosome aneuploidy in the mouse. The variation of aneuploidy rates depending on the sire background provides a promising avenue to study the genetics of sex chromosome missegregation. In addition, identification of aneuploid mice can become routine in experimental cohorts and crosses. This is also important in colony management, as XO and XXY mice are likely to be infertile or have reduced fertility (Heard and Turner 2011). This type of analysis can also identify sex chromosome mosaicism (Johnson ; Fragouli ; McCoy 2017) and large structural variants involving the sex chromosomes. In the Results section, we have shown that mosaics are outliers with respect to the four defined clusters observed in the intensity-based chromosome sex determination plot (Figure 1). Specifically, they have abnormal Y chromosome intensities. These mosaics may also have an abnormally high ratio of N calls in the X chromosome compared with the autosomes and chromosome X marker intensity distributions biased toward one parent (Figure 3). This last analysis is only possible in the presence of heterozygosity on the X chromosome. Further evidence of the value of the MiniMUGA array for the characterization of the sex chromosomes is the identification of a 6-Mb de novo duplication of the distal chromosome X (Figure S11) in a single F2 male. The size of this duplication is not large enough to affect chromosomal sex determination and its discovery was due to the presence of 10 heterozygous calls clustered on distal X. These heterozygous calls occur at informative markers between the two parental CC strains of the F2 cross and are embedded in a region of 26 consecutive markers with higher-than-expected intensity (Figure S11). Interestingly, the parental CC strains (CC029/Unc and CC030/GeniUnc) are the same for which a 10× increase in sex chromosome aneuploidy is observed. We concluded that this F2 male had a well-defined duplication of the distal X chromosome. These vignettes provide a potential blueprint that can be extended to other chromosomes and structural variants. It also highlights the importance of having a large set of well-defined genotyped controls, against which to compare a given sample. A second area of potential research is the expansion of the RCC paradigm beyond the narrow confines of C57BL/6 substrains (Kumar ; Babbs ; Treger ). A successful RCC requires complete knowledge of the sequence variants shared and private to the set of substrains that will be used in the mapping experiments. These private variants are needed to infer causation but also in the initial step of genetic mapping. We acknowledge that the development of MiniMUGA was made possible by the efforts of the community to sequence an increasing number of inbred strains. The expansion of RCC to 129S, A, BALB/c, C57BL/6, C3H, DBA/1, DBA/2, FVB, and NOD substrains should increase the total number of accessible private mutations by at least one order of magnitude as compared with RCC involving C57BL/6N and C57BL/6J. Therefore, we should expect a similar increase in the number of causative genetic variants. We note that even as substrains continue to accumulate private variants in an unpredictable manner, MiniMUGA will retain its value for genetic mapping, but future WGS will be required to identify those variants. Genotyping arrays are a powerful, standardized platform with which to characterize the genomic composition of sets of samples. Here, we have described a new mouse genotyping array, MiniMUGA. We have illustrated how the design and performance of MiniMUGA provides a more robust platform for genetic QC (at the sex-chromosome, mouse substrain identity, and genetic construct levels) relative to our previously designed arrays. We have also illustrated examples of how this array can be used for new genetic discovery, including sex chromosome abnormalities, genomic duplications, and also in the expansion of genetic mapping approaches. This array and these associated data highlight the utility of genetic QC for more robust and reproducible science in the mouse, and they are already being widely used by the research community (Smith ; Gu ; Yu ).

52 in total

1. Whole-genome genotyping with the single-base extension assay.

Authors: Frank J Steemers; Weihua Chang; Grace Lee; David L Barker; Richard Shen; Kevin L Gunderson
Journal: Nat Methods Date: 2006-01 Impact factor: 28.547

Review 2. Function of the sex chromosomes in mammalian fertility.

Authors: Edith Heard; James Turner
Journal: Cold Spring Harb Perspect Biol Date: 2011-10-01 Impact factor: 10.005

Review 3. Mosaicism in Preimplantation Human Embryos: When Chromosomal Abnormalities Are the Norm.

Authors: Rajiv C McCoy
Journal: Trends Genet Date: 2017-04-28 Impact factor: 11.639

4. Sequence and Structural Diversity of Mouse Y Chromosomes.

Authors: Andrew P Morgan; Fernando Pardo-Manuel de Villena
Journal: Mol Biol Evol Date: 2017-12-01 Impact factor: 16.240

5. C57BL/6N mutation in cytoplasmic FMRP interacting protein 2 regulates cocaine response.

Authors: Vivek Kumar; Kyungin Kim; Chryshanthi Joseph; Saïd Kourrich; Seung-Hee Yoo; Hung Chung Huang; Martha H Vitaterna; Fernando Pardo-Manuel de Villena; Gary Churchill; Antonello Bonci; Joseph S Takahashi
Journal: Science Date: 2013-12-20 Impact factor: 47.728

6. Complex control of GABA(A) receptor subunit mRNA expression: variation, covariation, and genetic regulation.

Authors: Megan K Mulligan; Xusheng Wang; Adrienne L Adler; Khyobeni Mozhui; Lu Lu; Robert W Williams
Journal: PLoS One Date: 2012-04-10 Impact factor: 3.240

7. SNP array profiling of mouse cell lines identifies their strains of origin and reveals cross-contamination and widespread aneuploidy.

Authors: John P Didion; Ryan J Buus; Zohreh Naghashfar; David W Threadgill; Herbert C Morse; Fernando Pardo-Manuel de Villena
Journal: BMC Genomics Date: 2014-10-03 Impact factor: 3.969

8. The Lupus Susceptibility Locus Sgp3 Encodes the Suppressor of Endogenous Retrovirus Expression SNERV.

Authors: Rebecca S Treger; Scott D Pope; Yong Kong; Maria Tokuyama; Manabu Taura; Akiko Iwasaki
Journal: Immunity Date: 2019-01-29 Impact factor: 31.745

9. A new set of BXD recombinant inbred lines from advanced intercross populations in mice.

Authors: Jeremy L Peirce; Lu Lu; Jing Gu; Lee M Silver; Robert W Williams
Journal: BMC Genet Date: 2004-04-29 Impact factor: 2.797

10. R2d2 Drives Selfish Sweeps in the House Mouse.

Authors: John P Didion; Andrew P Morgan; Liran Yadgary; Timothy A Bell; Rachel C McMullan; Lydia Ortiz de Solorzano; Janice Britton-Davidian; Carol J Bult; Karl J Campbell; Riccardo Castiglia; Yung-Hao Ching; Amanda J Chunco; James J Crowley; Elissa J Chesler; Daniel W Förster; John E French; Sofia I Gabriel; Daniel M Gatti; Theodore Garland; Eva B Giagia-Athanasopoulou; Mabel D Giménez; Sofia A Grize; İslam Gündüz; Andrew Holmes; Heidi C Hauffe; Jeremy S Herman; James M Holt; Kunjie Hua; Wesley J Jolley; Anna K Lindholm; María J López-Fuster; George Mitsainas; Maria da Luz Mathias; Leonard McMillan; Maria da Graça Morgado Ramalhinho; Barbara Rehermann; Stephan P Rosshart; Jeremy B Searle; Meng-Shin Shiao; Emanuela Solano; Karen L Svenson; Patricia Thomas-Laemont; David W Threadgill; Jacint Ventura; George M Weinstock; Daniel Pomp; Gary A Churchill; Fernando Pardo-Manuel de Villena
Journal: Mol Biol Evol Date: 2016-02-15 Impact factor: 16.240

16 in total

1. Genic and chromosomal components of Prdm9-driven hybrid male sterility in mice (Mus musculus).

Authors: Barbora Valiskova; Sona Gregorova; Diana Lustyk; Petr Šimeček; Petr Jansa; Jiří Forejt
Journal: Genetics Date: 2022-08-30 Impact factor: 4.402

2. Zhx2 Is a Candidate Gene Underlying Oxymorphone Metabolite Brain Concentration Associated with State-Dependent Oxycodone Reward.

Authors: Jacob A Beierle; Emily J Yao; Stanley I Goldstein; William B Lynch; Julia L Scotellaro; Anyaa A Shah; Katherine D Sena; Alyssa L Wong; Colton L Linnertz; Olga Averin; David E Moody; Christopher A Reilly; Gary Peltz; Andrew Emili; Martin T Ferris; Camron D Bryant
Journal: J Pharmacol Exp Ther Date: 2022-06-10 Impact factor: 4.402

3. QTL-mapping in the obese Berlin Fat Mouse identifies additional candidate genes for obesity and fatty liver disease.

Authors: Manuel Delpero; Danny Arends; Aimée Freiberg; Gudrun A Brockmann; Deike Hesse
Journal: Sci Rep Date: 2022-06-21 Impact factor: 4.996

4. Patterns and mechanisms of sex ratio distortion in the Collaborative Cross mouse mapping population.

Authors: Brett A Haines; Francesca Barradale; Beth L Dumont
Journal: Genetics Date: 2021-11-05 Impact factor: 4.402

5. Cardiac proteomics reveals sex chromosome-dependent differences between males and females that arise prior to gonad formation.

Authors: Wei Shi; Xinlei Sheng; Kerry M Dorr; Josiah E Hutton; James I Emerson; Haley A Davies; Tia D Andrade; Lauren K Wasson; Todd M Greco; Yutaka Hashimoto; Joel D Federspiel; Zachary L Robbe; Xuqi Chen; Arthur P Arnold; Ileana M Cristea; Frank L Conlon
Journal: Dev Cell Date: 2021-10-15 Impact factor: 13.417

6. Collaborative Cross mice reveal extreme epilepsy phenotypes and genetic loci for seizure susceptibility.

Authors: Bin Gu; John R Shorter; Lucy H Williams; Timothy A Bell; Pablo Hock; Katherine A Dalton; Yiyun Pan; Darla R Miller; Ginger D Shaw; Benjamin D Philpot; Fernando Pardo-Manuel de Villena
Journal: Epilepsia Date: 2020-08-27 Impact factor: 5.864

7. Host-pathogen genetic interactions underlie tuberculosis susceptibility in genetically diverse mice.

Authors: Clare M Smith; Richard E Baker; Megan K Proulx; Bibhuti B Mishra; Jarukit E Long; Sae Woong Park; Ha-Na Lee; Michael C Kiritsy; Michelle M Bellerose; Andrew J Olive; Kenan C Murphy; Kadamba Papavinasasundaram; Frederick J Boehm; Charlotte J Reames; Rachel K Meade; Brea K Hampton; Colton L Linnertz; Ginger D Shaw; Pablo Hock; Timothy A Bell; Sabine Ehrt; Dirk Schnappinger; Fernando Pardo-Manuel de Villena; Martin T Ferris; Thomas R Ioerger; Christopher M Sassetti
Journal: Elife Date: 2022-02-03 Impact factor: 8.140

8. Bayesian modeling of skewed X inactivation in genetically diverse mice identifies a novel Xce allele associated with copy number changes.

Authors: Kathie Y Sun; Daniel Oreper; Sarah A Schoenrock; Rachel McMullan; Paola Giusti-Rodríguez; Vasyl Zhabotynsky; Darla R Miller; Lisa M Tarantino; Fernando Pardo-Manuel de Villena; William Valdar
Journal: Genetics Date: 2021-05-17 Impact factor: 4.562

9. Gabra2 is a genetic modifier of Scn8a encephalopathy in the mouse.

Authors: Wenxi Yu; Sophie F Hill; James G Xenakis; Fernando Pardo-Manuel de Villena; Jacy L Wagnon; Miriam H Meisler
Journal: Epilepsia Date: 2020-11-02 Impact factor: 5.864

10. A new mouse SNP genotyping assay for speed congenics: combining flexibility, affordability, and power.

Authors: Kimberly R Andrews; Samuel S Hunter; Brandi K Torrevillas; Nora Céspedes; Sarah M Garrison; Jessica Strickland; Delaney Wagers; Gretchen Hansten; Daniel D New; Matthew W Fagnan; Shirley Luckhart
Journal: BMC Genomics Date: 2021-05-24 Impact factor: 3.969