Literature DB >> 28338821

Phylogenomic Insights into Mouse Evolution Using a Pseudoreference Approach.

Brice A J Sarver¹, Sara Keeble¹, Ted Cosart¹, Priscilla K Tucker², Matthew D Dean³, Jeffrey M Good¹.

Abstract

Comparative genomic studies are now possible across a broad range of evolutionary timescales, but the generation and analysis of genomic data across many different species still present a number of challenges. The most sophisticated genotyping and down-stream analytical frameworks are still predominantly based on comparisons to high-quality reference genomes. However, established genomic resources are often limited within a given group of species, necessitating comparisons to divergent reference genomes that could restrict or bias comparisons across a phylogenetic sample. Here, we develop a scalable pseudoreference approach to iteratively incorporate sample-specific variation into a genome reference and reduce the effects of systematic mapping bias in downstream analyses. To characterize this framework, we used targeted capture to sequence whole exomes (∼54 Mbp) in 12 lineages (ten species) of mice spanning the Mus radiation. We generated whole exome pseudoreferences for all species and show that this iterative reference-based approach improved basic genomic analyses that depend on mapping accuracy while preserving the associated annotations of the mouse reference genome. We then use these pseudoreferences to resolve evolutionary relationships among these lineages while accounting for phylogenetic discordance across the genome, contributing an important resource for comparative studies in the mouse system. We also describe patterns of genomic introgression among lineages and compare our results to previous studies. Our general approach can be applied to whole or partitioned genomic data and is easily portable to any system with sufficient genomic resources, providing a useful framework for phylogenomic studies in mice and other taxa.

Entities: Chemical Disease Gene Species

Keywords: Mus musculus; bioinformatics; comparative genomics; introgression; mapping bias

Mesh：

Year: 2017 PMID： 28338821 PMCID： PMC5381554 DOI： 10.1093/gbe/evx034

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

The efficient generation and analysis of comparative genome-wide data sets remains a key challenge in evolutionary biology. Massively parallel sequencing has made it relatively easy to generate whole-genome sequencing (WGS) data sets, enabling comparative genomic studies across a broad range of evolutionary timescales. However, the empirical and analytical resources required to generate comparative WGS data sets are still somewhat limiting in species groups with large, complex genomes. As a partial solution, various partitioning approaches are often used to generate comparative genome-wide data across broader sets of species (e.g., restriction site-associated DNA sequencing, targeted capture, transcriptomics, etc.; reviewed in Ekblom and Galindo 2011; Jones and Good 2016). These approaches overcome the extra costs associated with WGS, but analyzing such data across a diverse sample of species still presents a number of challenges. In particular, the most sophisticated analytical frameworks often rely upon approaches developed for WGS and the numerous benefits afforded by a high quality reference genome (e.g., efficient genotyping, physical location, and associated functional annotation; Li and Durbin 2009; Li et al. 2009; McKenna et al. 2010; Yandell and Ence 2012). Thus, as with WGS, the types of analyses that can be conducted using genome-wide partitioned data can be limited by the existence, quality, and completeness of a reference genome. One common solution in species lacking genomic resources is to use an established reference from another species. Most genotyping approaches are reference-based at some level and thus depend on accurate sequence read mapping (DePristo et al. 2011), which decreases with increasing sequence divergence from the reference (Li et al. 2008). Although mapping algorithms allow for reference mismatches to account for some divergence, polymorphism, or sequencing error (Nielsen et al. 2011; Ruffalo et al. 2011; Liu et al. 2012), mapping to a divergent reference can generate a number of systematic biases that could compromise comparative evolutionary analyses. For example, sequences that show substantial divergence from a reference will map with lower quality and effectively hide corresponding sample-specific variation. Analyses relying on full sequence information, such as those often used in phylogenetics or molecular evolution, may be particularly sensitive to these issues because called genotypes may converge towards the reference, resulting in an overestimated similarity between subject and reference sequences in divergent regions. This phenomenon, generally referred to as reference (or mapping) bias, has been discussed most frequently with regard to its effect on detecting allele-specific expression in transcriptomic analyses (e.g., Satya et al. 2012; Stevenson et al. 2013; Panousis et al. 2014; Brandt et al. 2015), yet it impacts any comparative study where reads are mapped to a divergent reference. For example, reference bias could lead to the underestimation of rates of molecular evolution or the overestimation of phylogenetic discordance due to stochastic genealogical processes (i.e., incomplete lineage sorting) or hybridization. One approach that has shown some promise in alleviating these concerns is the generation of “pseudogenomes,” or reference genomes that incorporate sample-specific variation (Holt et al. 2013; Huang et al. 2013, 2014). This allows annotation to be carried over from a reference while accounting for sequence divergence during the mapping stage. Here, we extend these previous works by developing a scalable pseudoreference approach to iteratively incorporate sample-specific variation into a reference and thereby reduce the effects of systematic mapping bias in downstream analyses. The house mouse (Mus musculus) is an important model of mammalian biology and a compelling system in which to develop comparative genomic approaches and resources. In addition to extensive genetic and developmental resources, the mouse was the second mammal to be sequenced (Chinwalla et al. 2002), and the mouse reference (C57BL/6, a mosaic lab strain primarily of M. musculus domesticus origin; Yang et al. 2011) remains second in quality only to the human genome. House mice have also emerged as a powerful system to address fundamental questions in genome evolution, population genetics, and speciation (e.g., Good et al. 2010; Halligan et al. 2010; Kousathanas et al. 2014; Turner et al. 2014; Phifer-Rixey and Nachman 2015; Larson et al. 2016). Although most evolutionary genomic studies in this group have focused on a few closely related species and subspecies (e.g., Keane et al. 2011; Yang et al. 2011), house mice are embedded within a radiation of ∼38 species that shared a common ancestor ∼7.5 Ma (Schenk et al. 2013). Several of these species already have developed inbred laboratory strains, providing a unique combination of genetic and genomic resources that could be leveraged to address a wide array of evolutionary questions in mammals. However, aspects of the Mus phylogeny remain unresolved, including uncertainty in the evolutionary relationships among some key lineages that are relatively closely related to house mice (e.g., M. spretus/spicilegus/macedonicus and M. caroli/cookii/cervicolor; Hammer and Silver 1993; Lundrigan et al. 2002; Chevret et al. 2005; Tucker et al. 2005; Bryja et al. 2014). In addition to uncertainty in overall species relationships, it is also unclear how much phylogenetic discordance there is across the house mouse genome due to incomplete lineage sorting or gene flow between species (e.g., Keane et al. 2011, Song et al. 2011). Resolving these outstanding issues is an important step in developing the mouse system for comparative evolutionary studies. In this study, we use targeted capture to generate whole exome data (54 Mb targeted, exons and flanking regions) across 10 species of mice (Mus). We use these data to evaluate the general performance of our pseudoreference approach in mitigating the effects of reference bias. We then use the pseudoreferences to resolve the phylogenetic relationships among these mouse species while assessing phylogenetic discordance at different genomic scales and the extent of introgression between some lineages. In addition to insights into the evolutionary history of these species, our study provides a foundation for future comparative studies in mice and a general framework for rapidly generating phylogenomic data sets in other groups of closely related species.

Materials and Methods

Exome Capture

Illumina sequencing libraries were generated using whole genomic DNA from ten species (Mus caroli, M. cervicolor, M. cookii, M. macedonicus, M. minutoides, M. musculus, M. pahari, M. platythrix, M. spicilegus, and M. spretus) including three wild-derived inbred strains of house mice (M. musculus domesticus: LEWES/EiJ, hereafter domLEWES; M. m. musculus: CZECHII/EiJ and PWK/PhJ, hereafter musCZECHII and musPWK) (supplementary material table S1, Supplementary Material online). Libraries were individually indexed following Meyer and Kircher (2010), pooled (Pool 1: M. caroli, M. cervicolor, M. cookii, M. minutoides, M. pahari, M. platythrix; Pool 2: musCZECHII, domLEWES, M. macedonicus, musPWK, M. spicilegus, M. spretus), enriched with two NimbleGen SeqCap EZ Mouse exome capture reactions (Fairfield et al. 2011), and 100 bp paired-end sequenced on an Illumina HiSeq 2000. This in-solution enrichment platform targets 54.3 Mbp of exonic regions with the mouse genome (NCBI37/mm9).

Quality Assessment and Iterative Mapping

Raw reads were cleaned using the expHTS pipeline (available from https://github.com/msettles/expHTS; last accessed February 28, 2017), which trims adapters and low-quality bases, merges overlapping reads, and removes identical reads (putative PCR duplicates). Initial capture performance statistics were calculated using CollectHsMetrics in Picard v2.5.0 (available from http://github.com/broadinstitute/picard; last accessed February 28, 2017). To mitigate reference bias, we employed an iterative mapping strategy to generate species-specific exomes embedded within the mouse reference genome (GRCm38). Cleaned reads were mapped to the reference genome using the MEM algorithm of BWA v0.7.15 (Li and Durbin 2009; Li 2013). Duplicate reads were identified postmapping using Picard v2.5.0. For multiply mapped reads, only the location with the best mapping quality was included in the analysis. Regions with insertions or deletions (indels) were identified and realigned, and single nucleotide variants (SNVs) were called using HaplotypeCaller within the Genome Analysis Toolkit (GATK) v3.6 (McKenna et al. 2010; DePristo et al. 2011). Resulting SNVs were filtered for a minimum quality of 30 and a minimum sequencing depth of at least five independent reads. These variants were injected back into the original reference using FastaAlternateReferenceMaker within the GATK. Additional processing of files, such as indexing, merging, and sorting, was accomplished using SAMtools v1.3.1 (Li et al. 2009) and Picard v2.5.0, as required. After each round, the modified reference was used as the starting point for additional iterations, starting with remapping of all reads and proceeding through variant calling. The early rounds of this iterative procedure should systematically introduce variants from the sample into the reference, increasing the number of sample reads that map and the number of variants that can be confidently called until the number of incorporated reads stabilizes across subsequent iterations. At this point, we inserted IUPAC ambiguity codes at putative heterozygous positions. It was initially unclear how many iterations of mapping and reference generation ought to be performed to remove reference bias in our study. Preliminary evaluation (data not shown) suggested more than three iterations of mapping and genotyping would be required to incorporate most variation into a pseudoreference. We examined this empirically by identifying the number of iterations (5) at which read incorporation and per-site sequence divergence plateaued in the most divergent species in our sample, M. pahari. We then used this as the number of iterations necessary to produce a stable pseudoreference across all species in our sample. As a final step, each position with insufficient data to confidently call a sample genotype was excluded; an additional round of variant calling was performed with the EMIT_ALL_SITES argument set, producing a VCF with calls at each position. All remaining ambiguous positions (genotype quality <30, read depth <10 or >60) were hard masked (i.e., replaced with an “N”) using GNU awk and bedtools v2.25 (Quinlan and Hall 2010). This produced a final consensus pseudoreference exome for each sample with the same coordinate system as the mouse reference. We also generated pseudoreferences without ambiguity codes for some downstream analyses. These are useful for bioinformatic analyses, including mapping and variant calling, which assume a haploid reference. All code necessary to replicate these procedures starting from cleaned reads is available as part of the pseudo-it project on GitHub (http://www.github.com/bricesarver/pseudo-it; last accessed February 28, 2017), and all pseudoreferences are available upon request.

Phylogenetic Inference

We used a two-tiered approach to resolve the phylogenetic relationships in our sample. First, we estimated the overall phylogeny from a concatenated alignment of gene sequences using the brown rat (Rattus norvegicus) as an outgroup. For each targeted protein-coding gene, we extracted the longest protein-coding transcript sequence based on the UCSC genes track (retrieved through the UCSC Genome Browser) from each iterated pseudoreference and from the whole genome reference sequence for M. m. domesticus strain C57BL/6. For each species, exons were extracted and assembled into transcripts using custom code and the Biostrings package (Pagès et al. 2016) in R v3.1.3 (R Core Team 2015) and then combined into a multispecies alignment. We then used BioMart (Smedley et al. 2015) to identify one-to-one orthologous transcripts in R. norvegicus. Each set of transcripts was translation aligned using TranslatorX (Abascal et al. 2010) with the Muscle progressive alignment algorithm (Edgar 2004). Alignments without a length evenly divisible by three or possessing internal stop codons were discarded (5702 genes). With this filtered gene set, we then performed concatenated analyses by chromosome to simplify data processing and to verify internal consistency of analyses. All transcript sets from each chromosome were combined into a supermatrix using Phyutility v2.2.6 (Smith and Dunn 2008). A tree was estimated for each chromosome with the MPI version of RAxML v8.2.3 (Stamatakis et al. 2005; Stamatakis 2014) using a simultaneous maximum likelihood (ML) search and rapid bootstrapping run under the GTR + Γ model of sequence evolution (autoMRE option). Trees were visualized using FigTree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree; last accessed February 28, 2017). Among-chromosome topological discordance was assessed by rooting trees with rat and estimating pairwise Robinson-Foulds distances (Robinson and Foulds 1981) using the ape library (Paradis et al. 2004) in R. Second, we focused on finer-scale patterns of phylogenetic discordance. A phylogenetic tree assumes a series of bifurcating speciation events. However, the speciation process is not necessarily instantaneous and we expect some regions of the genome to show conflicting phylogenetic histories due to incomplete lineage sorting, hybridization, or undetected gene duplication. In phylogenetics, a distinction is made between the history of a locus (a “gene tree”) and the true relationship among lineages (a “species tree”; Maddison 1997). Several approaches have been developed to account for gene tree-species tree discordance under the multispecies coalescent (e.g., Edwards et al. 2007; Liu et al. 2009; Heled and Drummond 2010), yet many of these approaches are computationally intensive and thus less practical for genome-scale data sets. With these limitations in mind, we accounted for phylogenetic discordance in our data set using the computationally efficient species tree algorithm implemented in ASTRAL v4.10.11 (Mirarab et al. 2014; Mirarab and Warnow 2015; Sayyari and Mirarab 2016). Assuming sets of independent and accurately estimated gene trees, ASTRAL breaks each tree into its constituent quartets (i.e., four-taxa cases) and recovers a consistent estimate of the species tree. Resolution of individual targets or transcript genealogies may be limited in our study, given the low overall levels of coding divergence between our focal species. To increase local phylogenetic signals, we expanded our working data set to include 5'- or 3' untranslated regions (UTR) and all other regions targeted for capture. Though exome probes are usually contained within annotated exons, both the capture process itself and the iterative pseudoreference process allow for the discovery of variation in flanking regions. To incorporate this variation, we extended each target by 200 bp on both ends and merged regions that were up to 1 kbp apart, increasing the total data set from 54.3 to 163.4 Mbp. As above, we first used RAxML (GTR + Γ, 200 bootstrap replicates) to estimate an ML tree per chromosome by extracting extended targets using bedtools v2.2.5 and combining regions with AMAS (Borowiec 2016). For these data, no alignment is required because indel variation is not incorporated into the pseudoreference. We then repeated this procedure across autosomal windows of five different sizes (extended targets, 100 kbp, 500 kbp, 1 Mbp, and 5 Mbp), estimating ML phylogenies from each window using the fast hill-climbing algorithm in RAxML. Strong linkage disequilibrium typically extends 100 kbp or less within wild house mouse (M. musculus) populations (Laurie et al. 2007), suggesting that larger window sizes may combine regions with independent phylogenetic histories. Any window containing only missing data for at least one individual was discarded. For each window size, all trees were combined for species tree inference in ASTRAL. We also calculated among-locus phylogenetic discordance using the normalized quartet score, which quantifies the amount of quartet discordance relative to the species tree.

Testing for Introgression

Motivated by recent studies that identified introgression between mouse lineages (e.g., Teeter et al. 2008; Keane et al. 2011; Yang et al. 2011; Staubach et al. 2012; Janoušek et al. 2015; Liu et al. 2015) we tested for signatures of introgression within and between taxa from the M. musculus group (here, M. m. musculus and M. m. domesticus), the M. spretus group (M. spretus, M. spicilegus, and M. macedonicus), and between M. cervicolor, M. cookii, and M. caroli. We used the D-statistic (i.e., Patterson’s D or the ABBA-BABA test) to characterize patterns among species (Green et al. 2010; Durand et al. 2011). Briefly, the D-statistic is a normalized difference of counts of two site patterns within a rooted four-taxa case: ABBA and BABA. ABBA counts indicate a sharing of alleles between the first taxon and a specified outgroup (A) and the second and third taxa (B), whereas the opposite is true for the BABA case. Significance was assessed using a chi-square test (see Pease and Hahn 2015), and 95% confidence intervals estimated using a nonparametric bootstrap with 10,000 replicates. Additionally, when our sampling allowed, we estimated the minimum proportion of genomic admixture () following Durand et al. (2011). The D-statistic is relatively robust to genotying error (Green et al. 2010; Durand et al. 2011), but could be sensitive to inherent differences in the source and quality of the exome data relative to the reference genome (domC57BL/6). Therefore, we limited our comparisons to sequenced exomes except when directly testing for differential introgression between M. m. musculus and the two available M. m. domesticus genotypes (domLEWES, domC57BL/6).

Results

Efficient Targeted Recovery of Mus Whole Exomes

Multiplex exome capture was successful across all samples. Sequencing efforts produced an average of ∼22 million reads per sample with an average of 1.1% of targets showing no coverage. Given a combined target size of ∼2% of the genome, this represents targeted recovery of 53.8 Mbp of sequence data (table 1) including most annotated genic regions in the mouse genome. Approximately 75% of raw reads were unique, resulting in average target coverage of 30× across samples (range: 20.6–39.3×) with ∼80% of targeted bases sequenced to at least 10× coverage (table 1).

Table 1

Exome Sequencing Coverage across Species

Sample	Total Reads	Bases on-Target	Target Coverage	% Low Quality Bases	% Target Bases ≥ 10×
M. caroli (ref)	18,923,169	1,423,165,223	26.2	8.1	78.2
M. caroli (5)	—	1,435,937,326	26.4	6.9	78.7
M. cervicolor (ref)	29,711,989	2,119,138,428	39.0	8.0	87.0
M. cervicolor (5)	—	2,133,497,561	39.3	6.9	87.5
M. cookii (ref)	29,089,576	2,119,862,202	39.0	8.1	86.5
M. cookii (5)	—	2,134,097,222	39.3	7.1	87.0
M. macedonicus (ref)	17,428,555	1,233,646,735	22.7	6.1	79.4
M. macedonicus (5)	—	1,239,060,658	22.8	5.4	79.7
M. minutoides (ref)	23,340,200	1,703,586,451	31.3	7.8	77.0
M. minutoides (5)	—	1,733,871,618	31.9	6.2	78.2
M. pahari (ref)	17,033,748	1,247,810,134	23.0	9.5	68.7
M. pahari (5)	—	1,273,913,560	23.4	7.7	70.0
M. platythrix (ref)	22,058,259	1,734,397,262	31.9	9.4	78.6
M. platythrix (5)	—	1,756,287,920	32.3	8.2	79.5
M. spicilegus (ref)	14,814,946	1,120,139,640	20.6	8.2	73.2
M. spicilegus (5)	—	1,124,893,538	20.7	7.6	73.5
M. spretus (ref)	16,200,749	1,157,756,368	21.3	6.9	73.6
M. spretus (5)	—	1,163,128,639	21.4	6.2	73.9
M. m. domesticus LEWES (ref)	25,565,922	1,920,922,154	35.3	6.2	87.8
M. m. domesticus LEWES (5)	—	1,921,986,302	35.4	6.1	87.8
M. m. musculus CZECHII (ref)	24,773,619	1,946,919,016	35.8	7.2	86.5
M. m. musculus CZECHII (5)	—	1,949,472,117	35.9	6.9	86.6
M. m. musculus PWK (ref)	22,785,276	1,652,768,831	30.4	6.3	85.4
M. m. musculus PWK (5)	—	1,655,627,360	30.5	6.0	85.5

Note.—Exome sequencing coverage for each species when mapped to the mouse reference genome (ref) or a species-specific pseudoreference after five rounds of iterative mapping (5). Shown are the total reads per library after cleaning (Total Reads), the number of bases in regions targeted by the capture (Bases On-Target), average coverage per target (Target Coverage), the percentage of bases in reads mapped with a MAPQ greater than zero (% Low Quality Bases), and the percentage of bases in targeted regions with at least 10× coverage.

Exome Sequencing Coverage across Species Note.—Exome sequencing coverage for each species when mapped to the mouse reference genome (ref) or a species-specific pseudoreference after five rounds of iterative mapping (5). Shown are the total reads per library after cleaning (Total Reads), the number of bases in regions targeted by the capture (Bases On-Target), average coverage per target (Target Coverage), the percentage of bases in reads mapped with a MAPQ greater than zero (% Low Quality Bases), and the percentage of bases in targeted regions with at least 10× coverage.

Evaluation of Iterative Pseudoreference Generation

To assess the performance of the iterative approach, we compared the same set of cleaned reads mapped to the mouse reference and to five-iteration pseudoreferences for each species (table 1). In all cases, mapping to a five-iteration pseudoreference resulted in minor increases in the coverage of targeted bases (e.g., 23.0–23.4× in M. pahari) and the percentage of targeted bases recovered at a given depth (e.g., +1.3% for targets with at least 10× coverage in M. pahari; table 1). In addition, reads were more confidently placed with each pseudoreference, resulting in an increase in usable bases and fewer reads discarded due to low mapping quality, as evidenced across iterations for the M. pahari exome (supplementary material table S2, Supplementary Material online). In addition to modest increases in overall coverage, pseudoreference construction should also help mitigate systematic biases in standard descriptive statistics when mapping to a distantly related reference genome. To test this, we calculated the per-site divergence for targeted bases on Chromosome 1 (i.e., the number of homozygous alternative calls relative to the C57BL/6 reference divided by the total number of confidently genotyped sites) at each iteration for three samples—M. m. domesticus (domLEWES), M. spretus, and M. pahari—of increasing evolutionary distance from the reference. Divergence estimates were notably higher in all three species when using a five-iteration pseudoreference when compared with mapping straight to the mouse genome (fig. 1). Increases in per-site divergence were lowest for M. m. domesticus (domLEWES, 0.19% vs. 0.22%; fig. 1), and highest for the most distantly related lineage in our study, M. pahari (3.34% vs. 4.24%). In all cases, the most dramatic change was observed after mapping to the first estimated pseudoreference (i.e., iteration 2) and appeared to reach an asymptote by the fourth iteration. However, the relative magnitude of change scaled with divergence (fig. 1), assuming that incremental increases reflect divergence estimates asymptotically approaching their true value. These results indicate that the number of iterations required to mitigate biases will be contingent on the divergence levels between sample(s) and reference(s) in a given study.

Reference bias and sequence divergence. (A) Per-site sequence divergence per iteration using confidently called positions on Chromosome 1 for M. m. domesticus (domLEWES), M. spretus, and M. pahari. (B) The bias in divergence estimates (% under-estimation) at each iteration relative to the per-site divergence of the sample’s five iteration pseudoreference using the same data. (C) Per-site divergence for M. pahari partitioned by protein-coding sequence (CDS), untranslated exonic regions (5'/3'-UTR) and flanking sequences. The impact of pseudoreference construction on estimates of sequence divergence should also be apparent within a genome, across sites that vary in levels of functional constraint, for example. To test this, we classified all confidently called sites in M. pahari as belonging to protein-coding exon sequence, 5'- or 3'-UTRs, or flanking regions (introns or intergenic). We observe the same trends, with the most dramatic changes in per-site divergence detected in the less constrained flanking regions, followed by UTRs and protein coding domains (fig. 1). Finally, we looked at the number and quality of variants called for M. m. domesticus (domLEWES), M. spretus, and M. pahari using the mouse reference and a five-iteration exome pseudoreference (Chromosome 1). We used HaplotypeCaller in the GATK (with –emitRefConfidence BP_RESOLUTION) to return genotype calls at each position and applied common quality filters to each set (as above). Although the number of confidently called sites decreased with divergence from the mouse reference genome, the total number of confidently called sites relative to the first iteration increased (table 2). Intuitively, we would also expect that genotype qualities should tend to increase in the context of pseudoreferences. Consistent with this, we observed a positive skew in genotype qualities for all three species at positions that were confidently called relative to the mouse reference genome and the final pseudoreference (fig. 2). However, we also observed many sites where the genotype quality decreased, frequently reflecting the loss of reads at a position due to being more confidently placed elsewhere after iteration. We also observe cases where sites called as homozygous reference or alternative relative to the mouse reference are called heterozygous (and vice versa) due to the placement of reads with alternate alleles at a given site.

Table 2

Confidently Called Genotypes on Chromosome 1 Using the Mouse Reference Genome and a Five-Iteration Pseudoreference

Species	Genotypes, Mouse Reference	Genotypes, Five-Iteration Pseudoreference	Δ Genotypes Called	% Increase
M. m. domesticus (dom^LEWES)	4,199,062	4,204,382	5,320	0.13
M. spretus	3,298,609	3,332,638	34,029	1.03
M. pahari	2,717,373	2,791,970	74,597	2.75

Note.—Confidently called genotypes on Chromosome 1 using the mouse reference genome and a five-iteration pseudoreference. M. m. domesticus is most closely related to the mouse reference genome (a mosaic lab strain primarily of M. m. domesticus origin), followed by M. spretus and M. pahari.

Differences in genotype qualities at shared positions called using a five-iteration pseudoreference or the mouse reference genome. Positive values reflect higher genotype qualities in the five-iteration pseudoreference (shown in gray). Positions with no change in genotype qualities are excluded to improve visualization. In each case, the distributions are skewed positively, indicating a trend towards more confident genotype calls in the pseudoreference.

Confidently Called Genotypes on Chromosome 1 Using the Mouse Reference Genome and a Five-Iteration Pseudoreference Note.—Confidently called genotypes on Chromosome 1 using the mouse reference genome and a five-iteration pseudoreference. M. m. domesticus is most closely related to the mouse reference genome (a mosaic lab strain primarily of M. m. domesticus origin), followed by M. spretus and M. pahari. Differences in genotype qualities at shared positions called using a five-iteration pseudoreference or the mouse reference genome. Positive values reflect higher genotype qualities in the five-iteration pseudoreference (shown in gray). Positions with no change in genotype qualities are excluded to improve visualization. In each case, the distributions are skewed positively, indicating a trend towards more confident genotype calls in the pseudoreference.

Resolving the Mus Phylogeny

We first estimated a phylogeny for each chromosome based on concatenation of protein-coding transcripts. After filtering, this data set consisted of 15,620 aligned transcripts (28.2 Mbp) with one-to-one orthologs in rat. RAxML produced the same fully resolved tree for all chromosomes with 100% bootstrap support for each bipartition (supplementary material fig. S1, Supplementary Material online). There was no topological discordance among chromosomes (Robinson-Foulds distances equal to zero). Additionally, there was no discordance among trees estimated using sets of transcripts without Rattus (26,624 transcripts with a total length of 43.7 Mbp, analysis not shown), and all trees were resolved with 100% bootstrap support. These analyses also confirmed that M. pahari is an outgroup relative to the other sequenced species based on the rooted phylogeny (supplementary material fig. S1, Supplementary Material online). We then repeated this procedure for an expanded data set including all targeted and flanking regions in mice (and excluding rats), and found the same general results of a fully resolved concatenated phylogeny with no discordance among chromosomes (fig. 3). Notably, these concatenated phylogenies resolve M. spretus/spicilegus/macedonicus and M. caroli/cookii/cervicolor as monophyletic groups with M. spretus and M. caroli placed as the basal lineages within each.

Mus phylogeny, rooted on M. pahari, estimated using all extended targets from Chromosome 1. ML bootstrap support values are listed above branches. There was no discordance between this tree and trees estimated from other chromosomes. The inset provides the normalized quartet scores calculated with ASTRAL from local genealogies estimated at five genomic scales. Using concatenation to resolve a phylogeny effectively averages over fine-scale discordance, which can inflate confidence in the overall tree and obscure important sources of incongruence (Hahn and Nakhleh 2016). Therefore, we also used a species-tree approach to quantify fine-scale topological discordance. To do this, we first estimated individual ML genealogies trees using all extended targets with data partitioned into five window sizes: 100,531 extended targets (mean alignment length = 1,622 bp; parsimony informative sites per target: mean = 26.8, median = 15.0), 13,628 100 kbp intervals (mean alignment length = 10,621 bp); 4,036 500 kbp intervals (37,322 bp); 2,665 1 Mbp intervals (66,743 bp); and 511 5 Mbp intervals (291,249 bp). We then used these trees to estimate species trees while accounting for among-locus topological discordance. We detected no appreciable discordance in the point-estimate of the species tree (rooted on M. pahari; fig. 3) when compared with the per-chromosome concatenated trees at the 100 kbp, 500 kbp, 1 Mbp, and 5 Mbp scales (fig. 4). However, quartet support for some branches did vary by window size, and there was discordance at the target-level scale (fig. 4). The lowest support was found for branches defining the M. pahari-platythrix-minutoides group at the base of the tree, suggesting some uncertainty in the placement of these deep nodes. Indeed, M. platythrix and M. pahari share a common ancestor in the species tree estimated using the extended targets, contrary to all other analyses. Only 36% of quartets support this clade, and the branch is extremely short. We also observed some variation in support levels within other groups. For example, although the M. spretus-spicilegus-macedonicus clade itself was well supported across most analyses, only 43% of quartets support the species tree designation of this clade at the level of targets (fig. 4). Support steadily increased to 59% at the 100 kbp scale, 74% at the 500 kbp scale, 82% at the 1 Mbp scale, and 96% at the 5 Mbp scale. Thus, there is some fine-scale discordance in this group of interest, but the overall species tree generally shows more support than alternative phylogenies. Likewise, support for the M. caroli-cookii-cervicolor started at 60% at the target scale and reached 100% at the 5 Mbp scale. Normalized quartet scores suggest ∼80% of all quartets support the species tree at the target scale, and this increased to ∼99% at the 5 Mbp scale. Considering all analyses, the phylogeny for these taxa appears reasonably well resolved with relatively low levels of topological discordance, at least at the scales that can be reasonably evaluated with our exome data.

Unrooted species tree estimates from ASTRAL across four different window sizes (5 Mbp not shown). Branches are annotated with their local quartet scores.

Introgression

We detected genotype asymmetries consistent with significant autosomal introgression between M. m. domesticus and M. m. musculus. We also detected some evidence for significant introgression between M. cookii and M. caroli. We did not detect autosomal introgression in other cases, including between lineages of the M. spretus group (M. macedonicus, M. spicilegus, and M. spretus) (fig. 5; supplementary material table S3, Supplementary Material online). Patterns of between-lineage allele sharing were variable among strains within M. m. domesticus and M. m. musculus, consistent with the notion of differential introgression due to recent gene flow (Yang et al. 2011). Our sampling allows us to estimate the minimum admixture proportion for a few of these instances. We estimated that ∼7% of the genomes of musPWK (6.6%) and domC57BL/6 (6.8%) descend from introgression between M. m. musculus and M. m. domesticus.

Introgression among taxa inferred from the ABBA-BABA test. The x axis identifies the three taxa examined for signatures of autosomal introgression using the D statistic where positive values reflect an excess of ABBA sites. CERV = M. cervicolor, COOK = M. cookii, CARO = M. caroli, MACE = M. macedonicus, SPICI = M. spicilegus, SPRET = M. spretus, DOM = M. m. domesticus (domLEWES), and MUS = M. m. musculus (musCZECHII) unless otherwise noted. CARO was used as the outgroup in all comparisons except for ((CERV, COOK) CARO), which used M. pahari (shown) or M. m. musculus (musCZECHII). Black circles and boldface taxa indicate significant deviations from zero (χ2 test; corrected P value < 0.01; see details in supplementary material table S3, Supplementary Material online).

Discussion

Genomic data sets are now commonplace in model and nonmodel systems. However, using a divergent reference genome to analyze genomic data sets can introduce reference biases that can affect biological inferences. To help address this outstanding issue, we developed a scalable pseudoreference approach to iteratively incorporate sample-specific variation into an established reference. Additionally, we describe the first targeted sequencing effort of complete exomes for approximately one-third of described Mus species diversity. Using these data, we resolve the phylogenetic relationships between these mouse species and describe patterns of introgression among lineages. Our analyses demonstrate that targeted exome sequencing is useful for both of these tasks and provides a proof-of-concept for similar analyses in other systems. More generally, our pseudoreference framework alleviates mapping biases that can lead to systematic underestimates in divergence and related statistics, providing a useful tool for comparative genomic analyses. Below, we discuss the general utility and limitations of our approach as well as the specific insights of our data to mouse evolution.

Exome Capture and Pseudoreference Construction

Ongoing work will continue to generate assembled and annotated reference genomes for many species of interest. However, high-quality reference genomes, which are critical to mapping reads generated from high throughput sequencing technologies, are still relatively scarce (Ellegren 2014). We were able to capture whole exomes across ten species spanning ∼7.5 Myr of divergence (Schenk et al. 2013). Given the strong and comparable performance across all species, we anticipate that this capture approach would be effective over deeper evolutionary timescales. In addition to basic phylogenetic insights, our approach could also be used to generate comparative genomic data for in-depth analyses of molecular evolution over moderate timescales or as a supplement to lower-coverage whole genome data. Other studies have shown that targeted capture can be used to recover exome data over a broad range of evolutionary timescales (Vallender 2011; Bi et al. 2012; Jin et al. 2012; Hedtke et al. 2013), though the integration of such data into a well-annotated reference genome had not been explored. Our transspecific capture and iterative pseudoreference approach leveraged the benefits of the mouse reference, including position and annotation information, while mitigating the confounding effects of reference bias. Even among closely related species, we demonstrated that reference bias can have a strong impact on estimation of basic parameters, such as genetic divergence (fig. 1) and genotype quality (fig. 2). These simple comparisons illustrate that while reference-based genotyping is sensitive to divergence, the iterative pseudoreference approach reduces these biases over moderate levels of sequence divergence. Pseudoreferences, therefore, should generally increase the quality of and confidence in downstream analyses through incorporating additional reads and placing them with greater confidence. Implementation of our approach is straightforward (with the pseudo-it package), requires the same set of resources as standard mapping and variant calling, and preserves the coordinate system of the original reference. An alternative approach would be to de novo assemble targeted regions within each species (e.g., Bi et al. 2012). Whereas mapping to contigs assembled de novo is not expected to introduce reference bias, assembly requires substantially more computational power and results in a new coordinate system that needs to be linked between species. Any hard-earned empirical or computational annotation afforded by a reference would also need to be reestablished. Several other approaches have been developed to combine sets of loci into workable references that can be used to call variants (e.g., PRGmatic; Hird et al. 2011), but are not iterative and cluster regions based on overall similarity. Recent studies using restriction-site associated DNA sequencing (RAD-seq) have shown that overall data quality is considerably higher when using a reference (Fountain et al. 2016; Shafer et al. 2016). Mapping of reads followed by de novo assembly would also be expected to reduce mapping bias and genotyping errors but consumes substantially more resources (Gan et al. 2011; Hunter et al. 2015). It is possible to obtain genomic coordinates of contig sets assembled de novo by aligning to a reference. However, such an approach is computationally demanding. For example, de Bruijn graph-based assembly would need to be performed under a range of k-mer values and clustered, and each assembly is both CPU and memory intensive. Furthermore, de novo assemblies from transcriptomic or capture data sets are often highly fragmented (e.g., Bi et al. 2012), leading to additional complications. We have shown that studies lacking species-specific references may benefit from an iterative approach, provided that a reference genome exists within a moderate evolutionary distance. We also illustrated that the use of pseudoreferences can be combined with exome capture to resolve a species-level phylogeny (figs. 3 and 4) and inform about patterns of introgression (fig. 5). A resolved phylogeny is important for answering a variety of questions in evolutionary biology, including estimating speciation rates (e.g., Nee 2001), inferring rates of morphological evolution (e.g., Pennell and Harmon 2013), and characterizing patterns of molecular evolution (e.g., Zhang et al. 2005). In addition to exonic sequences, noncoding (e.g., introns and intergenic regions) may also be targeted for capture or recovered through anonymous partitioning approaches. Given that they tend to evolve more quickly than exons (fig. 1), these noncoding regions would aid in resolving relationships among closely related taxa, inferring rates of evolution in concert with the phylogeny, or investigating finer-scale patterns of phylogenetic discordance. Because reference mapping biases scale with divergence (fig. 1), iterative mapping is likely to be particularly useful when analyzing more rapidly evolving nongenic regions. Many of the questions listed above require that the tree be ultrametric (i.e., scaled relative to time), but it is still computationally intractable to estimate an ultrametric species tree with genome-scale data using Bayesian methods. To address this, others have recommended restricting analyses of whole genome data to the most informative regions or combining regions with similar underlying topologies (e.g., Jarvis et al. 2015; Mirarab et al. 2015). Given the need to subset WGS data, partitioned comparative data sets are obviously well suited for this general approach (though whole exome data would still likely need to be subsampled). Using a reduced data set, it should be possible to fix the topology to the estimated species tree and use Bayesian approaches to estimate a substitution rate scaled relative to time (and scaled relative to absolute time if fossil calibrations are used). Fixing the tree eliminates one of the most computationally intensive parts of likelihood-based phylogenetic estimation, the recalculation of the likelihood after topological rearrangement, and would facilitate an analysis using many loci for a more accurate calculation of the substitution rate per unit time. Our iterative approach is not without important limitations. For example, we did not take indel variation into account when iterating our pseudoreferences in order to maintain a consistent coordinate system across many species. Due to the deleterious effect of frameshift mutations, indels tend to be rare in protein coding regions and we chose to ignore them within our study. Others have incorporated indel information within pseudogenomes (Holt et al. 2013; Huang et al. 2013, 2014), though these studies were focused on pairwise contrasts between very closely related genomes and did not use iteration. Extending the pseudoreference approach to efficiently incorporate small-scale indels across a phylogenetic sample remains an important goal for future studies; however, this reference-based framework will always be limiting with respect to larger-scale structural variation (e.g., chromosomal translocations and inversions). Thus, the approach outlined here will be most useful for generating comparative evolutionary genomic data sets of orthologous loci that can be used for phylogenetic and population genomic inferences. The relevance of such reference-based comparative studies should continue to grow as high quality reference genomes become increasingly common across the tree of life.

Mus Phylogenomics

Previous works focused on Mus systematics lacked several lineages included in this study or were uncertain with respect to the branching order within certain clades. In particular, the relationships between M. spretus/spicilegus/macedonicus and M. caroli/cookii/cervicolor have remained unclear (e.g., Lundrigan et al. 2002; Tucker et al. 2005; Tucker 2008). For example, there was conflicting evidence about the relationships among M. spretus, M. spicilegus, and M. macedonicus and the placement of each relative to the M. musculus species group. Our analyses resolved this group as monophyletic as well as the phylogenetic relationships among all ten species (fig. 3). Though discordance among M. spretus, M. spicilegus, and M. macedonicus is appreciable at the scale of extended targets, a majority of quartets still support the species relationships inferred from all other data sets (fig. 4). Overall, the species phylogeny is relatively well supported even when accounting for among-locus phylogenetic discordance (fig. 4). This information is crucial for effectively designing genomic or functional genetic experiments in house mice that require comparisons to closely related species. Moreover, we note that the phylogenetic relationships that we recovered were robust across individual genealogies estimated at different local scales (fig. 4) and when considering targets from smaller subsets of the whole exome capture (e.g., by chromosome). However, in one case, using extended targets alone for species tree analysis transposed the relationships at the base of the tree on a short branch with low quartet support, presumably reflecting a lack of informative sites in the alignments. These patterns suggest that the same general phylogenetic conclusions would have been apparent using a much smaller set of targeted loci as long as enough phylogenetically informative sites are present to confidently resolve relationships. We also detected introgression between some mouse lineages. These results were not unexpected and are in strong agreement with other studies investigating whole-genome ancestry among mouse strains. Classic inbred strains derive from early breeding efforts of mouse fanciers (Beck et al. 2000), which included some crosses between species and subspecies (Ferris et al. 1982; Tucker et al. 1992; Ideraabdullah et al. 2004). The mosaic nature of classic inbred strains of mice is well known (Bonhomme et al. 1987; Yalcin et al. 2010; Didion and Pardo-Manuel De Villena 2013), but the extent of introgression has been the subject of some debate. Using a genome-wide SNP genotyping platform, Yang et al. (2011) estimated that M. m. domesticus strain C57BL/6 has a genome composed of ∼93% M. m. domesticus, and ∼7% M. m. musculus. Our estimate of 6.8% is in close agreement with their inferences, indicating that levels of introgression within sequenced genic regions is similar to genome-wide patterns based on SNVs and that variable ascertainment schemes used to populate the Mouse Diversity Genotyping Array (Yang et al. 2009) do not appear to bias overall signatures of gene flow. Additionally, we detected a strong signature of M. m. domesticus introgression into the wild-derived M. m. musculus strain PWK/PhJ, consistent with Yang et al. (2011) (fig. 5, supplementary material table S3, Supplementary Material online). The context of introgression involving this and some other wild-derived strains remains unclear. For PWK/PhJ, this could reflect natural gene flow as this strain was derived from the Czech Republic near the natural hybrid zone between M. m. domesticus and M. m. musculus. However, it has also been suggested that the haplotype structures of introgressed regions in this and a few other wild-derived inbred stains are consistent with very recent gene flow, perhaps occurring in the laboratory subsequent to strain derivation (Yang et al. 2011). Liu et al. (2015) found 0.02–0.8% M. spretus ancestry within M. m. domesticus, and other studies have described natural introgression between these taxa (Orth et al. 2002; Keane et al. 2011; Song et al. 2011). We did not detect appreciable introgression between M. m. domesticus (domLEWES) and M. spretus, suggesting that the extent of introgression between these species is variable among individuals. As expected, we did not detect introgression between M. spicilegus or M. macedonicus and the currently allopatric M. spretus. However, we did detect introgression between M. cookii and M. caroli. These species are broadly distributed throughout Eastern and Southeastern Asia and can cooccur in the same localities along with M. cervicolor (Suzuki and Aplin 2012). Introgression between these lineages, therefore, is not unexpected. Interestingly, these findings further support the notion that association with humans may contribute to hybridization between Mus species. Evidence for natural introgression within Mus includes cases of secondary contact following human-associated range expansions of M. spretus and M. musculus and between various M. musculus lineages (Palomo et al. 2009; Jones et al. 2010; Gabriel et al. 2010; Bonhomme et al. 2011; Song et al. 2011; Suzuki et al. 2013). Additionally, while the historical relationships of M. cookii and M. caroli and humans are less clear, both species (along with M. cervicolor) primarily occur in rice fields and nearby areas (Suzuki and Aplin 2012). This suggests that contact between lineages, and subsequent hybridization, may have been facilitated by agricultural development. Collectively, our results suggest that exome capture approaches may provide a powerful tool to reliably investigate finer-scale patterns of introgression among Mus species.

Supplementary Material

Supplementary data are available at Genome Biology and Evolution online. Click here for additional data file.

80 in total

Review 1. A beginner's guide to eukaryotic genome annotation.

Authors: Mark Yandell; Daniel Ence
Journal: Nat Rev Genet Date: 2012-04-18 Impact factor: 53.242

2. Estimating species phylogenies using coalescence times among sequences.

Authors: Liang Liu; Lili Yu; Dennis K Pearl; Scott V Edwards
Journal: Syst Biol Date: 2009-07-16 Impact factor: 15.683

3. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level.

Authors: Jianzhi Zhang; Rasmus Nielsen; Ziheng Yang
Journal: Mol Biol Evol Date: 2005-08-17 Impact factor: 16.240

4. Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies.

Authors: Nikolaos I Panousis; Maria Gutierrez-Arcelus; Emmanouil T Dermitzakis; Tuuli Lappalainen
Journal: Genome Biol Date: 2014-09-20 Impact factor: 13.583

5. Widespread over-expression of the X chromosome in sterile F₁hybrid mice.

Authors: Jeffrey M Good; Thomas Giger; Matthew D Dean; Michael W Nachman
Journal: PLoS Genet Date: 2010-09-30 Impact factor: 5.917

6. Phylogenomic analyses data of the avian phylogenomics project.

Authors: Erich D Jarvis; Siavash Mirarab; Andre J Aberer; Bo Li; Peter Houde; Cai Li; Simon Y W Ho; Brant C Faircloth; Benoit Nabholz; Jason T Howard; Alexander Suh; Claudia C Weber; Rute R da Fonseca; Alonzo Alfaro-Núñez; Nitish Narula; Liang Liu; Dave Burt; Hans Ellegren; Scott V Edwards; Alexandros Stamatakis; David P Mindell; Joel Cracraft; Edward L Braun; Tandy Warnow; Wang Jun; M Thomas Pius Gilbert; Guojie Zhang
Journal: Gigascience Date: 2015-02-12 Impact factor: 6.524

7. Multiple reference genomes and transcriptomes for Arabidopsis thaliana.

Authors: Xiangchao Gan; Oliver Stegle; Jonas Behr; Joshua G Steffen; Philipp Drewe; Katie L Hildebrand; Rune Lyngsoe; Sebastian J Schultheiss; Edward J Osborne; Vipin T Sreedharan; André Kahles; Regina Bohnert; Géraldine Jean; Paul Derwent; Paul Kersey; Eric J Belfield; Nicholas P Harberd; Eric Kemen; Christopher Toomajian; Paula X Kover; Richard M Clark; Gunnar Rätsch; Richard Mott
Journal: Nature Date: 2011-08-28 Impact factor: 49.962

8. AMAS: a fast tool for alignment manipulation and computing of summary statistics.

Authors: Marek L Borowiec
Journal: PeerJ Date: 2016-01-28 Impact factor: 2.984

9. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales.

Authors: Ke Bi; Dan Vanderpool; Sonal Singhal; Tyler Linderoth; Craig Moritz; Jeffrey M Good
Journal: BMC Genomics Date: 2012-08-17 Impact factor: 3.969

10. An effort to use human-based exome capture methods to analyze chimpanzee and macaque exomes.

Authors: Xin Jin; Mingze He; Betsy Ferguson; Yuhuan Meng; Limei Ouyang; Jingjing Ren; Thomas Mailund; Fei Sun; Liangdan Sun; Juan Shen; Min Zhuo; Li Song; Jufang Wang; Fei Ling; Yuqi Zhu; Christina Hvilsom; Hans Siegismund; Xiaoming Liu; Zhuolin Gong; Fang Ji; Xinzhong Wang; Boqing Liu; Yu Zhang; Jianguo Hou; Jing Wang; Hua Zhao; Yanyi Wang; Xiaodong Fang; Guojie Zhang; Jian Wang; Xuejun Zhang; Mikkel H Schierup; Hongli Du; Jun Wang; Xiaoning Wang
Journal: PLoS One Date: 2012-07-27 Impact factor: 3.240

16 in total

1. Instability of the Pseudoautosomal Boundary in House Mice.

Authors: Andrew P Morgan; Timothy A Bell; James J Crowley; Fernando Pardo-Manuel de Villena
Journal: Genetics Date: 2019-04-26 Impact factor: 4.562

2. Whole exome sequencing of wild-derived inbred strains of mice improves power to link phenotype and genotype.

Authors: Peter L Chang; Emily Kopania; Sara Keeble; Brice A J Sarver; Erica Larson; Annie Orth; Khalid Belkhir; Pierre Boursot; François Bonhomme; Jeffrey M Good; Matthew D Dean
Journal: Mamm Genome Date: 2017-08-17 Impact factor: 2.957

3. DNA Methylation Divergence and Tissue Specialization in the Developing Mouse Placenta.

Authors: Benjamin E Decato; Jorge Lopez-Tello; Amanda N Sferruzzi-Perri; Andrew D Smith; Matthew D Dean
Journal: Mol Biol Evol Date: 2017-07-01 Impact factor: 16.240

4. The Evolution of Polymorphic Hybrid Incompatibilities in House Mice.

Authors: Erica L Larson; Dan Vanderpool; Brice A J Sarver; Colin Callahan; Sara Keeble; Lorraine L Provencio; Michael D Kessler; Vanessa Stewart; Erin Nordquist; Matthew D Dean; Jeffrey M Good
Journal: Genetics Date: 2018-04-24 Impact factor: 4.562

5. Sequence and Structural Diversity of Mouse Y Chromosomes.

Authors: Andrew P Morgan; Fernando Pardo-Manuel de Villena
Journal: Mol Biol Evol Date: 2017-12-01 Impact factor: 16.240

6. The Legacy of Recurrent Introgression during the Radiation of Hares.

Authors: Mafalda S Ferreira; Matthew R Jones; Colin M Callahan; Liliana Farelo; Zelalem Tolesa; Franz Suchentrunk; Pierre Boursot; L Scott Mills; Paulo C Alves; Jeffrey M Good; José Melo-Ferreira
Journal: Syst Biol Date: 2021-04-15 Impact factor: 15.683

7. Molecular Evolution of Ecological Specialisation: Genomic Insights from the Diversification of Murine Rodents.

Authors: Emily Roycroft; Anang Achmadi; Colin M Callahan; Jacob A Esselstyn; Jeffrey M Good; Adnan Moussalli; Kevin C Rowe
Journal: Genome Biol Evol Date: 2021-07-06 Impact factor: 3.416

8. Evolutionary, proteomic, and experimental investigations suggest the extracellular matrix of cumulus cells mediates fertilization outcomes†.

Authors: Sara Keeble; Renée C Firman; Brice A J Sarver; Nathan L Clark; Leigh W Simmons; Matthew D Dean
Journal: Biol Reprod Date: 2021-10-11 Impact factor: 4.161

9. Comparative Sperm Proteomics in Mouse Species with Divergent Mating Systems.

Authors: Alberto Vicens; Kirill Borziak; Timothy L Karr; Eduardo R S Roldan; Steve Dorus
Journal: Mol Biol Evol Date: 2017-06-01 Impact factor: 16.240

10. How being synanthropic affects the gut bacteriome and mycobiome: comparison of two mouse species with contrasting ecologies.

Authors: Barbora Bendová; Jaroslav Piálek; Ľudovít Ďureje; Lucie Schmiedová; Dagmar Čížková; Jean-Francois Martin; Jakub Kreisinger
Journal: BMC Microbiol Date: 2020-07-06 Impact factor: 3.605