| Literature DB >> 28592495 |
Anuj Srivastava1, Andrew P Morgan2,3,4, Maya L Najarian5, Vishal Kumar Sarsani1, J Sebastian Sigmon5, John R Shorter2,3, Anwica Kashfeen5, Rachel C McMullan2,3,6, Lucy H Williams2, Paola Giusti-Rodríguez2, Martin T Ferris2,3, Patrick Sullivan2, Pablo Hock2,3, Darla R Miller2,3, Timothy A Bell2,3, Leonard McMillan5, Gary A Churchill7, Fernando Pardo-Manuel de Villena8,3.
Abstract
The Collaborative Cross (CC) is a multiparent panel of recombinant inbred (RI) mouse strains derived from eight founder laboratory strains. RI panels are popular because of their long-term genetic stability, which enhances reproducibility and integration of data collected across time and conditions. Characterization of their genomes can be a community effort, reducing the burden on individual users. Here we present the genomes of the CC strains using two complementary approaches as a resource to improve power and interpretation of genetic experiments. Our study also provides a cautionary tale regarding the limitations imposed by such basic biological processes as mutation and selection. A distinct advantage of inbred panels is that genotyping only needs to be performed on the panel, not on each individual mouse. The initial CC genome data were haplotype reconstructions based on dense genotyping of the most recent common ancestors (MRCAs) of each strain followed by imputation from the genome sequence of the corresponding founder inbred strain. The MRCA resource captured segregating regions in strains that were not fully inbred, but it had limited resolution in the transition regions between founder haplotypes, and there was uncertainty about founder assignment in regions of limited diversity. Here we report the whole genome sequence of 69 CC strains generated by paired-end short reads at 30× coverage of a single male per strain. Sequencing leads to a substantial improvement in the fine structure and completeness of the genomes of the CC. Both MRCAs and sequenced samples show a significant reduction in the genome-wide haplotype frequencies from two wild-derived strains, CAST/EiJ and PWK/PhJ. In addition, analysis of the evolution of the patterns of heterozygosity indicates that selection against three wild-derived founder strains played a significant role in shaping the genomes of the CC. The sequencing resource provides the first description of tens of thousands of new genetic variants introduced by mutation and drift in the CC genomes. We estimate that new SNP mutations are accumulating in each CC strain at a rate of 2.4 ± 0.4 per gigabase per generation. The fixation of new mutations by genetic drift has introduced thousands of new variants into the CC strains. The majority of these mutations are novel compared to currently sequenced laboratory stocks and wild mice, and some are predicted to alter gene function. Approximately one-third of the CC inbred strains have acquired large deletions (>10 kb) many of which overlap known coding genes and functional elements. The sequence of these mice is a critical resource to CC users, increases threefold the number of mouse inbred strain genomes available publicly, and provides insight into the effect of mutation and drift on common resources.Entities:
Keywords: MPP; drift; genetic variants; multiparental populations; selection; whole genome sequence
Mesh:
Year: 2017 PMID: 28592495 PMCID: PMC5499171 DOI: 10.1534/genetics.116.198838
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Figure 1The CC genomes. In all figures we use the following colors and letter codes to represent the eight founder strains of the CC: A/J, yellow (A); C57BL/6J, gray (B); 129S1/SvImJ, pink (C); NOD/ShiLtJ, dark blue (D); NZO/HlLtJ, light blue (E); CAST/EiJ, green (F); PWK/PhJ, red (G); and WSB/EiJ, purple (H). (A) Haplotype mosaic for the sequenced representative of the CC001/Unc strain. (B) Number of haplotype blocks identified in the MRCA and sequenced samples. (C) Distribution of haplotype block size in MRCAs and sequenced samples in log scale. (D) Founder contribution to the genomes of CC strains with all eight founders. Autosomes are shown in the left panel and chromosome X in the right. Within a panel and founder strain the left boxes represent MRCAs and the right, the sequenced sample. (E) Founder contribution to the genomes of CC strains with missing founders. Founder contribution to chromosome X. Autosomes are shown in the left panel and chromosome X in the right. Within a panel and founder strain the left boxes represent MRCAs and the right, the sequenced sample.
Figure 2Sequencing improves haplotype assignment in recombination intervals. (A) Haplotype reconstruction for chromosome 5 from MRCAs of CC044/GeniUnc. The focal recombination event is indicated by a gray box. (B) Zoomed-in view of recombination interval, showing the flanking informative markers from the MegaMUGA genotyping array. Haplotype assignment in the MRCAs is uncertain over 11.7 kb. (C) Alleles in the sequenced CC044/GeniUnc male shared with PWK/PhJ (top track) or C57BL/6J (bottom track); inferred recombination interval is indicated by a gray box. (D) Genotypes at informative SNPs between PWK/PhJ and C57BL/6J reduce the recombination interval to 298 bp, between rs32922813 and rs32922811.
Founder contribution to the genomes of the CC strains
| Population | Generation | Chr | A/J | C57BL/6J | 129S1/SvImJ | NOD/ShiLtJ | NZO/HlLtJ | CAST/EiJ | PWK/PhJ | WSB/EiJ |
|---|---|---|---|---|---|---|---|---|---|---|
| All | MRCAs | Aut | 12.84 | 14.54 | 14.18 | 13.75 | 14.93 | 8.55 | 7.41 | 13.80 |
| With eight founders only | MRCAs | Aut | 12.41 | 13.59 | 13.93 | 14.21 | 14.48 | 9.58 | 8.63 | 13.16 |
| All | Sequenced | Aut | 12.60 | 14.61 | 14.34 | 13.87 | 14.83 | 8.53 | 7.45 | 13.78 |
| With eight founders only | Sequenced | Aut | 12.26 | 13.49 | 14.23 | 14.49 | 14.27 | 9.45 | 8.63 | 13.18 |
| All | Sequenced | 10.75 | 16.58 | 19.73 | 19.81 | 12.44 | 5.06 | 4.30 | 11.33 | |
| With eight founders only | Sequenced | 10.63 | 16.81 | 22.84 | 18.99 | 12.35 | 4.58 | 4.85 | 8.95 | |
| With eight founders only | Sequenced | 6 | 8 | 7 | 11 | 8 | 8 | 5 | 3 | |
| With missing founders | Sequenced | 1 | 2 | 3 | 1 | 5 | 0 | 1 | 0 | |
| With eight founders only | Sequenced | 19* | 15 | 7 | 19* | 6 | 2 | 2 | 5 | |
| With missing founders | Sequenced | 7* | 0 | 2 | 7* | 2 | 1 | 0 | 1 |
For the autosomes and the X chromosome the table shows the percentage of contribution of each founder. For chromosome Y and mitochondria the table shows the number of CC strains in each haplogroup. *, the mitochondria of A/J and NOD/ShiLtJ cannot be distinguished by sequencing so the total number of CC strains sharing these haplotypes are shown in both columns.
Figure 3Biased contribution of the CC founders to the residual heterozygosity present in the MRCAs and sequenced samples. The x-axis shows log ratio of observed to expected proportion of the genome in each of 28 possible heterozygous states (y-axis) across 56 CC strains with all eight founder haplotypes present. Heterozygous states are divided into two classes: those involving classical inbred strains only (top) or those involving at least one wild-derived strain (bottom). Black dotted line gives expected value of the statistic (zero), and gray dashed lines show median value in each panel.
Figure 4Haplotype frequencies on chromosomes 2, 12, and X in MRCAs and sequenced samples. The analysis is restricted to the 56 CC strains with all eight founder strains present.
Nucleotide substitution rates among the HQHom and private variants
| HQHom | ALT | A | C | G | T |
|---|---|---|---|---|---|
| REF | A | — | 1401759 (4.0%) | 5073707 (14.6%) | 1455186 (4.2%) |
| C | 1589874 (4.6%) | — | 1072433 (3.1%) | 6151991 (17.7%) | |
| G | 6155882 (17.7%) | 1072703 (3.1%) | — | 1591576 (4.6%) | |
| T | 1460442 (4.2%) | 5076553 (14.6%) | 1401807 (4.0%) | — | |
| Private | ALT | ||||
| A | C | G | T | ||
| REF | A | — | 561 (3.8%) | 1161 (7.8%) | 720 (4.8%) |
| C | 1183 (7.9%) | — | 519 (3.5%) | 3323 (22.3%) | |
| G | 3275 (22.0%) | 502 (3.4%) | — | 1179 (7.9%) | |
| T | 763 (5.1%) | 1162 (7.8%) | 569 (3.8%) | — |
A tabulation of reference (REF) and alternative (ALT) allele at SNPs variant sites for high-quality homozygous (HQHom) variant calls and for SNP variants that occur uniquely in one CC strain (private). The pattern of substitutions shows a high proportion of C-to-T and G-to-A substitutions and a transition–transversion ratio of ∼2.
Figure 5Frequency of private variants in 69 CC strains. (A) The log10 frequency per gigabase of SNPs and indels by chromosome (text) and haplotype (color) reveals that wild-derived haplotypes have higher apparent rates of private variation. (B) The strain-specific frequency of SNPs on nonwild autosomal haplotypes was estimated by Poisson regression. The frequency per gigabase of private SNPs increases with the breeding generation of the sequenced animal. The slope of the regression line (2.4 SNPs per gigabase per generation) provides an estimate of the rate of accumulation of new SNPs in the CC strains. Strains are identified by the last two digits of the strain name, e.g., CC002/Unc is indicated as “02” in the figure.
Analysis of 23 selected de novo deletions
| Strain | Chr | Start | End | Size (kb) | Start (refined with BWT) | End (refined with BWT) | Size (refined with BWT) | Haplotype | Genes | Regulatory Elements | Overlaps with known SV | Notes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CC004 | 2 | 116,465,000 | 116,480,000 | 15 | ND | ND | ND | WSB/EiJ | None | None | Yes | Repeat |
| CC004 | 18 | 75,868,000 | 75,896,000 | 28 | 75,863,985 | 75,900,551 | 36,566 | C57BL/6J | ENSMUSR00000378576, ENSMUSR00000378577, ENSMUSR00000378578 | Yes | Unique, splice site | |
| CC006 | 2 | 36,341,000 | 36,366,000 | 25 | ND | ND | ND | PWK/PhJ | None | ENSMUSR00000152903 | Yes | Olfactory receptor cluster |
| CC007 | 13 | 53,304,000 | 53,318,000 | 14 | 53,299,892 | 53,322,383 | 22,491 | NOD/ShiLtJ | None | ENSMUSR00000074496, ENSMUSR00000339964, ENSMUSR00000074498, ENSMUSR00000339965, ENSMUSR00000074500 | No | Microhomology |
| CC008 | 16 | 59,216,000 | 59,261,000 | 45 | ND | ND | ND | PWK/PhJ | ENSMUSR00000114105 | Yes | ||
| CC011 | 18 | 70,254,000 | 70,257,000 | 3 | 70,249,785 | 70,260,636 | 10,851 | A/J | None | ENSMUSR00000378203 | No | Unique |
| CC013 | 6 | 124,529,000 | 124,569,000 | 40 | ND | ND | ND | PWK/PhJ to WSB/EiJ transition | ENSMUSR00000235398, ENSMUSR00000439605, ENSMUSR00000235399, ENSMUSR00000439609, ENSMUSR00000439610 | Yes | Complex SD | |
| CC025 | 12 | 11,617,000 | 11,621,000 | 4 | 11,613,044 | 11,625,412 | 12,368 | PWK/PhJ | None | None | Yes | |
| CC026 | 17 | 57,161,000 | 57,245,000 | 80 | 57,148,212 | 57,248,753 | 100,541 | C57BL/6J | Many | Yes | Microhomology | |
| CC030 | 8 | 43,624,000 | 43,626,000 | 2 | ND | ND | ND | PWK/PhJ | None | Yes | ||
| CC038 | 2 | 86,260,000 | 86,266,000 | 6 | ND | ND | ND | CAST/EiJ | None | None | Yes | Olfactory receptor cluster |
| CC043 | 4 | 90,613,000 | 90,621,000 | 8 | ND | ND | ND | NOD/ShiLtJ | None | None | No | |
| CC046 | 5 | 151,538,000 | 151,547,000 | 9 | ND | ND | ND | CAST/EiJ | None | Yes | ||
| CC055 | 3 | 132,902,000 | 132,939,000 | 37 | 132,897,360 | 132,944,510 | 47,150 | 129S1/SvImj | Many | No | Microhomology | |
| CC056 | 8 | 55,080,000 | 55,094,000 | 14 | ND | ND | ND | PWK/PhJ | None | Yes | ||
| CC057 | 16 | 70,377,000 | 70,378,000 | 1 | 70,377,000 | 70,392,212 | 15,212 | C57BL/6J | None | Yes | Proximal end overlaps with repeat | |
| CC072 | 6 | 11,664,000 | 11,738,000 | 74 | ND | ND | ND | WSB/EiJ | None | ENSMUSR00000431623, ENSMUSR00000222992, ENSMUSR00000222993, | Yes | |
| CC072 | 15 | 40,549,000 | 40,551,000 | 2 | 40,544,598 | 40,555,614 | 11,016 | 129S1/SvImj | None | None | No | Microsatellite |
| CC074 | 13 | 24,536,000 | 24,539,000 | 3 | 24,529,800 | 24,543,000 | 13,200 | PWK/PhJ | None | ENSMUSR00000337301, ENSMUSR00000070027 | Yes | Ends overlap with repeats. No exact breakpoints determined |
| CC074 | 13 | 27,113,000 | 27,123,000 | 10 | ND | ND | ND | PWK/PhJ | None | Yes | ||
| CC074 | 13 | 33,303,000 | 33,304,000 | 1 | ND | ND | ND | PWK/PhJ | None | None | Yes | |
| CC075 | 4 | 112,022,000 | 112,031,000 | 9 | ND | ND | ND | WSB/EiJ | None | Yes | ||
| CC075 | 9 | 27,720,000 | 27,731,000 | 11 | ND | ND | ND | CAST/EiJ | None | ENSMUSR00000462205 | No |
The table provides the strain and the start and positions of the initial discovery step and the refinement step by BWT when applicable. It also indicates the founder haplotype, gene and regulatory content as obtained from Ensembl (v87). Finally, the table indicates whether the deletion overlaps with a known structural variant (SV) and defining characteristics. ND, not determined.
Figure 6Examples of large private deletions. (A) Deletion on a C57BL/6J haplotype on chromosome 17: 57 Mb in CC026/GeniUnc is not shared with CC040/Unc, which shares the underlying C57BL/6J haplotype. Top panel shows normalized coverage in whole-genome sequencing (in 1-kb bins) for CC040/Unc; lower panel shows normalized coverage in CC026/GeniUnc. The deletion spans exons (red) from four genes including complement factor gene C3. Assembled sequence spanning the deletion shows microhomology over 9 bp at the breakpoint. (B) Deletion on a 129S1/SvImJ haplotype on chromosome 3: 133 Mb in CC055/TauUnc is not shared with CC018/Unc, which shares the underlying 129S1/SvImJ haplotype. Organization follows that found in A. The deletion spans the middle exons (red) of Npnt, which encodes the integrin-binding protein nephronectin.
Figure 7Mapping of unplaced sequences in the CC. (A) QTL scan demonstrating successful localization of GL456378, a contig not localized in the current mouse reference genome (mm10/GRCm38.p5), to distal chromosome 4. (B) Estimated copy number of GL456378 in founder strains. (C) Genomic distribution of 19 sequences localized using the CC. Gold, sequences previously assigned to a chromosome but not a specific position; black, sequences whose position was previously unknown. Dot indicates marker with maximum LOD score and line segment indicates 95% credible interval.