| Literature DB >> 30996023 |
Vishal Kumar Sarsani1, Narayanan Raghupathy1, Ian T Fiddes2, Joel Armstrong2, Francoise Thibaud-Nissen3, Oraya Zinder1, Mohan Bolisetty1, Kerstin Howe4, Doug Hinerfeld5, Xiaoan Ruan6, Lucy Rowe1, Mary Barter1, Guruprasad Ananda6, Benedict Paten2, George M Weinstock6, Gary A Churchill1, Michael V Wiles1, Valerie A Schneider3, Anuj Srivastava7, Laura G Reinholdt8.
Abstract
Isogenic laboratory mouse strains enhance reproducibility because individual animals are genetically identical. For the most widely used isogenic strain, C57BL/6, there exists a wealth of genetic, phenotypic, and genomic data, including a high-quality reference genome (GRCm38.p6). Now 20 years after the first release of the mouse reference genome, C57BL/6J mice are at least 26 inbreeding generations removed from GRCm38 and the strain is now maintained with periodic reintroduction of cryorecovered mice derived from a single breeder pair, aptly named Adam and Eve. To provide an update to the mouse reference genome that more accurately represents the genome of today's C57BL/6J mice, we took advantage of long read, short read, and optical mapping technologies to generate a de novo assembly of the C57BL/6J Eve genome (B6Eve). Using these data, we have addressed recurring variants observed in previous mouse genomic studies. We have also identified structural variations, closed gaps in the mouse reference assembly, and revealed previously unannotated coding sequences. This B6Eve assembly explains discrepant observations that have been associated with GRCm38-based analyses, and will inform a reference genome that is more representative of the C57BL/6J mice that are in use today.Entities:
Keywords: C57BL/6J; Mus musculus domesticus; de novo genome assembly; laboratory mouse; long read sequencing; reference genomes; reproducibility
Mesh:
Year: 2019 PMID: 30996023 PMCID: PMC6553538 DOI: 10.1534/g3.119.400071
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Origin of the inbred strain C57BL/6J. Inbred laboratory mouse strains are maintained by brother x sister mating. Filial (F) generations from which mice contributing to the reference assembly clone libraries and from which the B6Eve mouse were derived are shown. Cryopreserved embryo stock is represented by blue snowflakes at F226, 3 generations from Adam and Eve at F223. Generations subsequent to the cryopreservation event are F226p###, e.g., F226p230, which means embryos cryopreserved at F226 were recovered and there were an additional 4 generations of subsequent inbreeding.
Figure 2Schematic overview of the de novo assembly procedure for B6Eve. Details are described in Methods.
Number of sequences, N50 size and assembly length for Bionano optical map, PacBio de novo assembly and scaffolded assemblies
| Bionano Genomics optical map | PacBio | PacBio only Hybrid | Bionano optical only Hybrid | Final Assembly (LXEJ02000000) | Improvement relative to PacBio assembly | |
|---|---|---|---|---|---|---|
| 3,016 | 14,551 | 3,732 | 1,652 | 12,690 | ||
| 1.18 | 0.40 | 0.58 | 1.97 | 1.29 | 3.2 | |
| 2,482.74 | 2,535.01 | 1,820.29 | 2,470.31 | 2,789.93 | 1.3 |
RefSeq Transcripts Alignment Table From NCBI
| GRCm38 | B6Eve | |
|---|---|---|
| GCF_000001635.20 | na | |
| 36,009 | 36,009 | |
| 36,009 | 35,948 | |
| 7 | 16 | |
| 27 | 1,621 | |
| 57 | 1,644 | |
| 8 | 284 | |
| 1 | 44 | |
| 52 | Pre-correction: 8,566 | |
| Post-correction: 335 | ||
| 57 | Pre-correction: 55 | |
| Post-correction: 32 |
Transcripts were aligned to the GRCm38 full assembly (GCF_000001635.20), which includes alternate loci scaffolds from a variety of mouse strains. Counts shown in Table 2 reflect only transcript alignments to the GRCm38 primary assembly unit (GCF_000000055.19), which is comprised only of C57BL/6J sequences, unless noted.
Frameshift counts are shown for alignments to the GRCm38 full assembly, including alternate loci scaffolds. Pre-correction: assembly prior to Quiver polishing and Pilon correction.
Figure 3Ideogram of GRCm38 assembly annotated to highlight resolved gaps (vs. current reference), structural variants, and fixed variation using B6Eve data.
Counts of various structural variation classes detected in the comparison of B6Eve Sequences to GRCm38 using PacBio and Illumina data
| Technology | Duplication | Deletion | Inversion | Insertion | Trans |
|---|---|---|---|---|---|
| 229 | 418 | 36 | 3,394 | 71 | |
| 289 | 221 | 111 | — | — | |
| 44 | 12 | 4 | — | — |
Comparison of various repeat class in common unaligned sequences with GRCm38
| Repeat Class | Number of bp in common unaligned sequences (6,128,602 bp) | GRCm38 (number of bp in complete genome) excluding “alt loci” (2,559,396,830 bp excl N/X-runs) |
|---|---|---|
| 3,671,543 (59.91%) | 3,302,550 (0.13%) | |
| 300,944 (4.91%) | 488,443,086 (18.86%) | |
| 17,196 (0.28%) | 113,630,025 (4.31%) | |
| 32,901 (0.54%) | 111,079,403 (4.22%) | |
| 16,550 (0.27%) | 29,593,691 (1.12%) | |
| 413 (0.01%) | 16,625,654 (0.63%) | |
| 20,863 (0.34%) | 24,057,863 (0.91%) | |
| 1,080 (0.02%) | 8,303,004 (0.32%) | |
| 101,229 (1.65%) | 62,434,516 (2.37%) | |
| 502 (0.01%) | 4,546,701 (0.17%) | |
| 252,653 (4.12%) | 121,444,463 (4.61%) | |
| 307,262 (5.01%) | 289,477,645 (10.99%) | |
| 311,973 (5.09%) | 69,151,432 (2.63%) | |
| 25,984 (0.42%) | 9,687,916 (0.37%) |
Figure 4The Mia3 locus from the perspective of both the B6Eve assembly (top) and the GRCm38 mouse reference (bottom). CAT annotation of B6Eve identified three isoforms with an IsoSeq supported exon not found in the reference. The cactus alignments (blue bars) show that there are 43 bp of reference sequence that does not align to B6Eve, and that there are 638 bp of B6Eve not seen in the reference. These 638 bp contain the extra exon. This result is confirmed in the B6Eve IsoSeq GRCm38 alignment, which shows an insertion (white blocks between gray exon alignments).