Venkat Talla1, Faheema Kalsoom1, Daria Shipilina2, Irina Marova2, Niclas Backström3. 1. Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, 752 36, Sweden. 2. Department of Vertebrate Zoology, Lomonosov Moscow State University, 119991, Russia. 3. Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, 752 36, Sweden niclas.backstrom@ebc.uu.se.
Abstract
Identification of candidate genes for trait variation in diverging lineages and characterization of mechanistic underpinnings of genome differentiation are key steps toward understanding the processes underlying the formation of new species. Hybrid zones provide a valuable resource for such investigations, since they allow us to study how genomes evolve as species exchange genetic material and to associate particular genetic regions with phenotypic traits of interest. Here, we use whole-genome resequencing of both allopatric and hybridizing populations of the European (Phylloscopus collybita abietinus) and the Siberian chiffchaff (P. tristis)-two recently diverged species which differ in morphology, plumage, song, habitat, and migration-to quantify the regional variation in genome-wide genetic diversity and differentiation, and to identify candidate regions for trait variation. We find that the levels of diversity, differentiation, and divergence are highly heterogeneous, with significantly reduced global differentiation, and more pronounced differentiation peaks in sympatry than in allopatry. This pattern is consistent with regional differences in effective population size and recurrent background selection or selective sweeps reducing the genetic diversity in specific regions prior to lineage divergence, but the data also suggest that postdivergence selection has resulted in increased differentiation and fixed differences in specific regions. We find that hybridization and backcrossing is common in sympatry, and that phenotype is a poor predictor of the genomic composition of sympatric birds. The combination of a differentiation scan approach with identification of fixed differences pinpoint a handful of candidate regions that might be important for trait variation between the two species.
Identification of candidate genes for trait variation in diverging lineages and characterization of mechanistic underpinnings of genome differentiation are key steps toward understanding the processes underlying the formation of new species. Hybrid zones provide a valuable resource for such investigations, since they allow us to study how genomes evolve as species exchange genetic material and to associate particular genetic regions with phenotypic traits of interest. Here, we use whole-genome resequencing of both allopatric and hybridizing populations of the European (Phylloscopus collybita abietinus) and the Siberian chiffchaff (P. tristis)-two recently diverged species which differ in morphology, plumage, song, habitat, and migration-to quantify the regional variation in genome-wide genetic diversity and differentiation, and to identify candidate regions for trait variation. We find that the levels of diversity, differentiation, and divergence are highly heterogeneous, with significantly reduced global differentiation, and more pronounced differentiation peaks in sympatry than in allopatry. This pattern is consistent with regional differences in effective population size and recurrent background selection or selective sweeps reducing the genetic diversity in specific regions prior to lineage divergence, but the data also suggest that postdivergence selection has resulted in increased differentiation and fixed differences in specific regions. We find that hybridization and backcrossing is common in sympatry, and that phenotype is a poor predictor of the genomic composition of sympatric birds. The combination of a differentiation scan approach with identification of fixed differences pinpoint a handful of candidate regions that might be important for trait variation between the two species.
Characterization of the genetic basis of reproductive isolation is essential for understanding the speciation process, and identification of candidate genes that may underlie trait variation related to reproductive barriers is a key step in this process (Feder ; Westram and Ravinet 2017). Until recently, speciation genetic research revolved predominantly around identification of genes that lie behind intrinsic postzygotic incompatibilities between deeply divergent lineages in the laboratory (Wolf ; Coyne and Orr 2004; Noor and Feder 2006). However, recent developments in DNA sequencing technology and genome analysis have paved the way for detailed investigations of the genetic underpinnings of reproductive isolation also in species that previously have lacked genomic tools (Baird 2017). These technological advancements hold the promise of contributing to the discovery of genetic components underlying traits that contribute to species divergence in natural settings, both at early and late stages of the divergence process (Storz 2005; Wolf and Ellegren 2017; Seehausen ). When lineages diverge, they may evolve allele combinations that are coadapted and lineage specific (Wu and Ting 2004). As a consequence, genetic incompatibilities may arise in hybrids (Bateson 1909; Dobzhansky 1940; Muller 1942) when divergent lineages meet and interbreed (Coyne and Orr 2004; Turelli and Orr 2000). The effect may be seen in first generation hybrids, but should be particularly pronounced in later stage backcrosses (e.g., Qvarnström ), and can lead to reinforcement of the speciation progression via selection against interspecific pairings and backcrosses (e.g., Sætre ; Sætre and Sæther 2010). Hybrid zones, regions where distribution ranges of nascent species coincide and interspecific crosses occur, therefore provide an ultimate study system for identification of genetic elements underlying initialization and establishment of barriers to gene flow in natural settings (Jiggins ; Barton and Hewitt 1989; Abbott ; Payseur and Rieseberg 2016).Of late, there has been gradually increasing interest in understanding how adaptation to the local environment can drive the formation of reproductive barriers between diverging populations (Nosil and Feder 2012). This has been termed “ecological speciation,” and refers to a scenario where reproductive isolation evolves as a side-effect of adaptation to the environment (Schluter 2009). Indirect support for ecological speciation predominantly comes from experimental evolution investigations in captive populations of micro-organisms (e.g., Buckling ), but also from a handful of studies in free-ranging natural populations (e.g., Via 2009; Bernatchez ; Chung ; Gompert ; Egan ). At early stages of population divergence, it is expected that genetic loci governing traits that affect adaptation will be more differentiated than the genomic average (Nosil and Feder 2012; Seehausen ; Wolf and Ellegren 2017). Theoretically, this allows for using genome scans for diversifying selection in incipient species or differentiated populations to identify genes that might be, or have been, involved in the speciation process (Wolf ; Seehausen ; Wolf and Ellegren 2017). Importantly, selection may also reduce the interspecific recombination rate in regions that are in physical linkage with the selected locus (Via and West 2008). The size of such regions may vary, and depends primarily on the relative strength of selection and the local rate of recombination (Charlesworth ); in some cases, large genomic regions of chromosomes can show this pattern (Via and West 2008; Feder and Nosil 2010; Rogers and Bernatchez 2007); in others, a very restricted portion or a single gene may underlie both ecological divergence and reproductive isolation (Chung ; Joron ; Martin ). Hence, by applying a combination of high-throughput DNA sequencing techniques with population genomic methods to identify loci that deviate from the expectations from neutrality, it is now a realistic goal to perform genome-scans for genes involved in adaptation, and to try to identify a potential link to speciation also in nonmodel species (Storz 2005).Recent genome-scan approaches using population samples from a range of taxa at different levels of divergence have shown that a common pattern encountered is a mosaic of highly and lowly differentiated genomic regions (Ellegren ; Turner ; Martin ; Renaut ; Ferchaud and Hansen 2016; Jones ; Wolf and Ellegren 2017; Burri ; Vijay ; Harr 2006; Poelstra ). This is in accordance with a scenario where gene-flow has reduced the level of genetic differentiation in general, but where particular genomic regions are sheltered from intermixing of parental alleles, and highly differentiated regions (so called “speciation islands” or “differentiation islands”) were therefore initially interpreted to contain incompatibility loci. However, with the recognition that a heterogeneous differentiation landscape may evolve also without gene-flow (Cruickshank and Hahn 2014), elevated differentiation may for example reflect low ancestral genetic diversity, and/or be a side-effect of reduced regional effective population size as a consequence of lower than average recombination rate (Burri ; Noor and Bennett 2009), the interpretations of patterns of genomic differentiation have become more careful (Pennisi 2014; Ravinet ). Although natural selection is a key driver toward patterns of heterogeneous differentiation between diverging lineages with or without gene-flow (Cruickshank and Hahn 2014), linking patterns of genomic differentiation to reproductive isolation is challenging (Ravinet ). One way forward is to have robust and well-planned study designs with genome-wide data of populations at different stages of divergence, and/or including data from multiple geographical locations involving both allopatric and sympatric regions, including potential hybrid zones (Seehausen ). Detailed analysis of population genetic summary statistics in allopatry and in sympatry complemented by comparisons between empirical and theoretical distributions to detect significant outliers among statistics can assist in the understanding of the initiation and establishment of divergence process underlying observed patterns of differentiation, and can shed light on the progression of the speciation process (Seehausen ; Wolf and Ellegren 2017), with or without gene-flow (Feder , 2012).Here, we present a population genomic survey using both allopatric and sympatric samples of the common chiffchaff (Phylloscopus collybita abietinus) and Siberian chiffchaff (P. tristis) as a study system to quantify patterns of genetic diversity and differentiation across the genome. The chiffchaff superspecies complex is part of the Eurasian Old World leaf warbler group, and, historically, the complex has been divided in four distinct species: Canarian chiffchaff, Iberian chiffchaff, mountain chiffchaff, and common chiffchaff (Helbig ). The common chiffchaff (P. collybita) has been divided into five different subspecies, including subspecies abietinus (eastern European common chiffchaff) and tristis (Siberian chiffchaff). P. tristis has recently been given full species status (del Hoyo and Collar 2017), now being considered a sister species to P. collybita, and we therefore refer to common chiffchaff and Siberian chiffchaff as distinct species. Since our particular focus concerns the comparison of P. tristis to one of the European common chiffchaff subspecies (P. c. abietinus) only, and to accommodate consistency with recent previous analyses of this particular pair of taxa (Shipilina ), we will refer to the two species as abietinus and tristis from here on. The breeding range of abietinus covers the regions from north-eastern Europe to the Ural mountains, and the seasonal migration goes to wintering areas in eastern and central Africa (del Hoyo ; Stepanyan 1990). The Siberian chiffchaff (tristis) is distributed from the Ural mountains to eastern Siberia during breeding season, and migrates to south-central and south-eastern Asia (del Hoyo ). Previous analyses using both phenotypic, acoustic, and genetic data have shown that abietinus and tristis differ significantly in appearance, body size, vocalizations, habitat preference, and migration patterns (Komarova and Shipilina 2010; Marova and Alekseev 2008; Marova , 2013, 2017; Marova and Leonovich 1993; Martens and Meinche 1989; Ticehurst 1938; Van den Berg 2009; Svensson 1992), and that they co-occur in a zone of secondary contact in a region from the southern Ural mountains in the south to the White Sea in the north, where they commonly hybridize and back-cross (Marova ; Shipilina ). Abietinus is more yellowish-green in plumage color, has a slightly larger body size, sings a slower and more rhythmic song with wider frequency range, and prefers mixed forests, whereas tristis has a duller appearance with no yellow tones (except on under wing coverts), is smaller in size, sings a faster and more diverse song, and preferably inhabits boreal forests (Marova and Alekseev 2008; Marova , 2013, 2017). In the sympatric zone, distinct abietinus and tristis individuals appear, but a relatively large proportion of individuals are characterized by showing intermediate plumage color, singing a mixed song type, and having a genetic composition consisting of a mixture of diagnostic abietinus and tristis alleles (Marova ; Shipilina ; Lindholm 2008; Ticehurst 1938). This makes the abietinus and tristis species pair an ideal model system to characterize the patterns of genomic differentiation as a result of divergence in allopatry followed by recurrent hybridization and back-crossing in sympatry. Our aims with the study were to: i) use population whole genome resequencing data to get global (genome-wide) and regional (locus specific) estimates of diversity and differentiation in allopatry and sympatry, ii) to use a genome-scan approach to characterize the landscape of differentiation between typical abietinus and tristis morphotypes in allopatry and in sympatry to detect regions that appear impermeable to gene-flow, and iii) to identify candidate genes for local adaptation and reproductive isolation between the species.
Materials and Methods
Sampling information and DNA sequencing
Male individuals of both abietinus and tristis were collected in the field during breeding seasons 2007–2009 (for details, see Shipilina ). Ten allopatric and 10 sympatric birds were selected for population resequencing from each of the two species—in total 40 birds. All birds were captured using mist-netting with song traps, and blood samples were collected using a standard syringe to puncture the brachial vein. Blood was stored on filter paper and DNA was extracted using a standard phenol-chloroform protocol (Sambrook ). Illumina, paired-end, individually barcoded, libraries with 380-bp insert sizes were constructed, and sequenced in a multiplex set-up in two separate batches on an Illumina HiSeq2000 sequencer—in total, all 40 samples were run in multiplex on eight lanes, generating, on average, 3.7 Gb of raw sequence data for each individual (Supplemental Material, Table S1 in File S1). A single allopatric abietinus male (Pcol09) was selected for deep sequencing, and two different paired-end libraries (180 and 380 bp) were generated for this individual and sequenced on four lanes—this generated, in total, 100.2 Gb of raw sequence data (91.1 Gb after filtering out bases with Q-scores <30) for this individual (Table S1 in File S1). Libraries were prepared by the SNPandSEQ Technology Platform in Uppsala, and sequencing was performed in two separate batches, one batch using Illumina HiSeq2000 technology (100 bp read length) at the SNPandSEQ Technology Platform in Uppsala (four lanes with 10 samples per lane, data also used in Shipilina ), and a supplementary batch using the same technology at the National Genomics Infrastructure (NGI) center in Stockholm (again four lanes with 10 samples per lane). All sequence reads have been deposited in the European Nucleotide Archive (http://www.ebi.ac.uk/ena) under accession number: PRJEB21643 (Table S1 in File S1). The sample names Pcol01-Pcol704 follow the nomenclature established during the sampling stage, when abietinus and tristis were classified as subspecies of the common chiffchaff (P. collybita; i.e., P. c. abietinus and P. c. tristis), and we decided to keep the sample names for easier comparison of results from studies of the same system (Shipilina ).
Assembly and quality assessment
A reference-assisted nuclear genome assembly of the Pcol09 individual was generated using version fAlb15 of the collared flycatcher (Ficedula albicollis) assembly as a reference (Ellegren ; Kawakami ). First, paired-end sequence reads from Pcol09 (>100× sequence coverage in total for the two different insert size libraries) were aligned to the F. albicollis assembly using BWA version 0.7.8 (Li and Durbin 2010). The consensus sequence from all mapped reads with mapping quality ≥30, and a coverage between 5× and 50× per base was extracted from the resulting BAM file using the mpileup function in SAMtools version 1.3 (Li ). The vcfutils function in BCF-tools was used to convert the assembly VCF-file into a FASTQ-file, and seqtk (https://github.com/lh3/seqtk) was used for the conversion of the FASTQ-file format to a FASTA formatted final assembly. Ambiguous bases in the reference-assisted assembly were replaced by N if they represented more than two alleles, and diallelic ambiguities were assigned to either allele randomly using an inhouse developed python script (https://github.com/venta380/chiffchaff_project/). The reference assisted chiffchaff genome assembly covered 89% of the entire F. albicollis reference genome, and spanned, in total, 1.04 Gb. It should be noted that a reference assisted assembly is blind to structural changes and differences in synteny can affect the interpretation of genome-wide patterns of genetic variation. Although the divergence time of Ficedula and Phylloscopus is ∼40 MY (Claramunt and Cracraft 2015; Selvatti ), the avian karyotype is remarkably conserved (Ellegren 2009), and the reference-assisted chiffchaff assembly likely contains only a limited number of structural inconsistencies, with a minor effect on the main conclusions of this study. The quality of the chiffchaff assembly was evaluated using the QUAST (Quality Assessment Tool for Genome Assemblies) software version 3.2 (Gurevich ), and by browsing for presence of conserved vertebrate and eukaryote gene sets using BUSCO (Benchmarking Universal Single-Copy Orthologs) version 1.1b1 (Simao ). The quality assessment results obtained for the chiffchaff assembly were compared to BUSCO results for the F. albicollis reference assembly and for the crow (Corvus corone) genome assembly, to obtain comparative data for high-quality avian genome assemblies. In order to assure sufficient mapping efficiency of the population resequencing data, one random sample (Pcol41) from the resequenced sample set was mapped to the chiffchaff assembly using BWA version 0.7.8 (Li and Durbin 2010), and mapping coverage and depth were computed using SAM-tools version 1.3 (Li ) and BED-tools version 2.25.0 (Quinlan and Hall 2010). All assembly and quality assessment statistics and comparisons to reference bird species (F. albicollis and C. corone) are presented in Table S2 and Table S3 in File S1.Complete chiffchaff mtDNA genome assemblies were generated for each of the 40 samples using mapping with BWA (Li and Durbin 2010), with the previously available F. albicollis mitochondrial genome as template to identify mtDNA reads. Mapped reads were subsequently used in de novo assemblies of chiffchaff mtDNA genome assemblies using the iterative mapping and assembly tool MITObim (Hahn 2013). The procedure was iterated five times using the most recently generated chiffchaff instead of the F. albicollis mtDNA assembly as a template. The final chiffchaff mtDNA assemblies were ∼17,000 bp long, and we annotated them with the web server MITOS (Bernt ). A comparison between the F. albicollis and the chiffchaff mtDNA assembly for individual Pcol09 (the individual with deepest sequencing coverage also used for reference assisted genome assembly) revealed that gene order and gene sizes were extremely conserved between the two lineages (Figure S1 in File S1)—note that the mtDNA genome is circular and the presentation in linear format generates a random cut point so that start and stop positions are different in the two assemblies.
Population resequencing and variant calling
The chiffchaff assembly was prepared to be used as a reference for mapping by indexing to different formats compatible with the GATK (Genome Analysis Tool Kit) SNP calling algorithms using BWA version 0.7.8 (Li and Durbin 2010) and SAMtools version 1.3 (Li ). An assembly sequence database with scaffold ID, scaffold size, and genomic anchoring positions on the F. albicollis reference genome was created using Picard Tools version 1.127 (available at: https://github.com/broadinstitute/picard). All 40 resequenced samples were mapped individually to the chiffchaff reference assembly using BWA version 0.7.8 (Li and Durbin 2010). The resulting SAM (Sequence Alignment/Map) files were converted into BAM (Binary Alignment/Map) files and merging of BAM-files for different sequence sets for each respective individual was done using SAMtools version 1.3 (Li ). In order to prepare the obtained 40 BAM files for the variant calling step, different preprocessing steps of the GATK (Genome Analysis Tool Kit) best practices were performed (McKenna ); the PCR duplicates were marked and deleted with MarkDuplicates option in Picard tools version 1.127 (available at: https://github.com/broadinstitute/picard), and the sequences flanking indels were realigned using IndelRealigner in GATK (McKenna ). Then, all reads from the individual used for reference assisted genome assembly (Pcol09) were mapped back to the assembly using BWA (Li and Durbin 2010), and the HaplotypeCaller in GATK (McKenna ) was run to generate an initial set of variants. The resulting VCF (Variant Call Format) file obtained from this analysis was used together with the BaseRecalibrator utility in GATK (McKenna ) to call a reference set of variants for the recalibration of bases in all the population samples. The quality of the recalibrated BAM files was checked with QualiMap version 2.0.2 (Okonechnikov ). One sample (Pcol703, an allopatric tristis individual) was found to have considerably lower mapping quality than all other samples, and was removed from the downstream analysis in order to reduce biases in the results as a consequence of differences in mapping efficiency, coverage, and variant calling. Sequence polymorphisms were called for each sample individually using HaplotypeCaller, subsequently joint genotyped using GenotypeGVCFs, and filtered using standard settings for the VariantRecalibrator in GATK (McKenna ).
Assignment analysis and visualization of genetic relationships
Population structure:
A SNP-based principal component analysis (PCA) was conducted on genotypes that were covered with a minimum of seven reads per individual in all samples in each of the four populations. The PCA was performed with 220,097 SNPs using the R Bioconductor package SNPRelate (Zheng 2012), using a linkage disequilibrium threshold of r < 0.2 (https://github.com/venta380/chiffchaff_project/). A population structure analysis was conducted using the R Bioconductor package LEA (Landscape and Ecological Association Studies) (Frichot 2015) on the same set of 220,097 high-quality SNPs used for the PCA analysis. The LEA package uses a program called sNMF (Frichot 2014) to cluster multi-locus genotypes based on a set value of number of populations (K) provided to the program. The range of K-values were set to be from 1 to 8 to find out the optimal number of populations explaining the data using cross-entropy values (Frichot 2014, 2015).
Mitochondrial DNA:
The mitochondrial genome was assembled for each individual independently (see above), and all mtDNA genome sequences were aligned with MAFFT version 7 (Katoh and Standley 2013), generating an alignment of 17,240 bp. The alignment was visually inspected for potential errors and ends (representing parts of the hypervariable D-loop region) were discarded to avoid false signals in the analysis of relationships. In total, 16,998 bp were used for the subsequent analyses. A branch length scaled mtDNA genome phylogeny was generated using Neighbor-Joining (Saitou and Nei 1987). The distances (substitutions per site) were computed using the Maximum Composite Likelihood (Tamura ), and rate variation among sites was modeled using a gamma distribution with shape parameter 1 as implemented in MEGA7 (Kumar ). Ambiguous positions were removed for pairwise comparisons. The mtDNA genome alignment was used to infer intrapopulation levels of nucleotide diversity by averaging over all sequence pairs in each respective population, and omitting missing data in pairwise comparisons using the Maximum Composite Likelihood method (Tamura ) as implemented in MEGA7 (Kumar ). We also estimated average mtDNA genome sequence differentiation across population pairs, again averaging over cross-population sequence pairs using the Maximum Composite Likelihood method (Tamura ) as implemented in MEGA7 (Kumar ), and ignoring missing data in pairwise comparisons.
Genome resequencing analysis
Repeat filtering:
Before any additional analyses, interspersed and tandem repeats in the chiffchaff assembly were hard masked using RepeatMasker version 4.0.6 (Smit ), using previously available F. albicollis repeat annotation libraries (https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ficedula_albicollis/101/) as reference.
Genotype filtering:
The sequence coverage depth at each genotype was calculated using the geno-depth option in Vcf-tools version 0.1.14 (Danecek ). To avoid considerable biases in allele frequency estimates in downstream analysis, only SNPs that were present in at least seven individuals in each respective population [allopatric abietinus (n = 10), allopatric tristis (n = 9), sympatric abietinus (n = 10), and sympatric tristis (n = 10)], with a minimum sequence coverage of five reads (i.e., each base had to be covered by at least five different reads), were retained for subsequent analysis. Genotype calls were translated to specific polymorphism classes for both the allopatric and sympatric species comparison; fixed differences between the two species, private polymorphisms in abietinus, private polymorphisms in tristis, and shared SNPs (polymorphic in both species).
Population genetic summary statistics:
The ANGSD (Analysis of Next Generation Sequencing Data) package version 0.902 (Korneliussen ) was used to calculate population genetic summary statistics within each of the populations [pair-wise nucleotide diversity (Ɵ), Watterson’s theta (Ɵ) and Tajima’s D] and genetic differentiation (F) between allopatric and sympatric species pairs. As the ancestral state for segregating sites was not known, the folded allele frequency likelihoods and allele frequency spectra were computed using GATK’s genotype likelihood model as implemented in ANGSD (Korneliussen ). We used inhouse developed python scripts (https://github.com/venta380/chiffchaff_project/) to calculate absolute divergence (D) between allopatric and sympatric species pairs, and independent estimates of Ɵ for each respective population based on the allele frequencies estimated using ANGSD. The rationale behind this was that the inhouse calculated D estimates, and the Ɵ estimates from ANGSD were not calculated using identical settings, and to compare intraspecific polymorphism levels with absolute divergence we needed to apply the same settings for these summary statistics. All summary statistics were averaged over nonoverlapping windows of 10 kb across the genome. The levels of linkage disequilibrium (LD, estimated with the correlation between alleles at different loci: r) for each respective population were calculated for SNP pairs <100 kb apart using Plink version 1.90 (Purcell ). To avoid inflated LD estimates due to low allele frequencies, we only included SNPs with minor allele frequency (MAF) > 0.2. The population specific average LD per SNP pair in each 10-kb window was calculated using custom python scripts (https://github.com/venta380/chiffchaff_project/), with the filtered LD output from each population specific analysis from Plink (Purcell ).To assess the genomic distribution of the most highly differentiated regions between abietinus and tristis, and identify differentiation outliers, we Z-transformed the F estimates (F) for each comparison by standardizing with the genomic average and variance across windows according to the following formula:The estimates were calculated for the autosomes and the Z-chromosome separately since the effective population size (N) for the Z-chromosome was reduced as compared to the autosomes, and, hence, the expected baseline levels differ between these chromosome classes. Autosomal windows with F values higher than the 99th percentile in the distributions were considered outliers (≥4 SD higher than the global mean) for the allopatric and the sympatric comparison, respectively. In order to evaluate if the overall sequence coverage affected the F distribution (hypothetically the level of differentiation may be biased upwards as a result of unusual low coverage), per-site genome coverage was calculated using the coverageBed utility in the BEDTools version 2.25.0 (Quinlan and Hall 2010), and the correlation between the coverage and F was checked. We found only weak and inconsistent correlations between sequence coverage and F (range of Pearson’s r = −0.007 to 0.025 for the four different populations). To visually inspect the landscape of diversity and differentiation in the allopatric and the sympatric species comparisons, all population genetic summary statistics were plotted against physical positions along chromosomes using the ggplot2 library in R (https://cran.r-project.org/). The most extreme differentiation regions in each comparison (F > 99th percentile in the empirical distribution) were further investigated by extracting the coordinates for clusters (≥10 outlier windows present in a 0.5-Mb genomic region) of the windows with the highest level of relative differentiation, and intersect these with annotation data from the F. albicollis genome (http://www.ensembl.org/Ficedula_albicollis/Info/Index: 2017-06-30). The rationale behind this strategy was to identify coding genes to investigate if any particular functional category of genes was over-represented in the most highly differentiated regions. We used a binomial distribution to estimate the probability of this level of clustering or higher, since the sample size (50 windows per block) was much smaller than the total population (103,800 windows in total; 1038 outlier windows). The probability of 10 outlier windows clustering in a 500 kb (50 windows) region was estimated at 7.133 × 10−11. Importantly, relaxing the significance level did not affect the results quantitatively, i.e., very few additional blocks were identified if the limits were set to 5 (p-value ≈ 0.0001) or even three outliers (p-value ≈ 0.01) within a 50-window block (data not shown). Enrichment analysis of specific gene ontology (GO) terms for genes located in chromosome regions with higher than expected density of relative differentiation outliers was performed using GOstat with a custom reference database based on GO terms from the F. albicollis gene set as available in Ensembl (http://www.ensembl.org/Ficedula_albicollis/Info/Index: 2017-06-30). The identities and potential functions of the identified genes in highly differentiated regions were inferred using orthology information as available in the Biomart tool in Ensembl (http://www.ensembl.org/biomart/martview/c02baa1e6506b4840e7e34ceac0a49da: 2017-06-30).
Functional analysis of fixed, shared, and private polymorphisms
To further assess the potential functional relevance of different types of polymorphisms, we used custom PyVCF 0.6.8 scripts (https://github.com/venta380/chiffchaff_project/) to sort previously called SNPs into four categories for the allopatric species comparison (as mentioned above, we found no fixed differences between sympatric abietinus and tristis): fixed differences between the two species, private polymorphisms in abietinus, private polymorphisms in tristis and shared SNPs (polymorphic in both species). Only SNPs covered by at least five reads in each individual and where information was present for at least nine individuals in each of the populations were retained for this analysis (n = 2,810,678) to minimize the bias toward fixed differences when representation in individuals is low. We identified the functional relevance of each SNP using the annotation software snpEFF version 4.3 (Cingolani ) with a manually built custom database based on genome annotation information from F. albicollis (http://www.ensembl.org/Ficedula_albicollis/Info/Index), which is the closest relative to chiffchaff with extensive genome annotation information. The annotations were grouped into classes “downstream region,” “upstream region,” “exon,” “intron,” “intergenic region,” and “other” (the latter including 3′UTR, 5′UTR, and splice sites). The relative frequencies of different polymorphisms were then compared across annotation classes for autosomes and the Z-chromosome separately to get information about the relative effects of selection and genetic drift in driving the patterns of differentiation, heterozygosity, allele sharing, and accumulation of fixed differences between abietinus and tristis.
Data availability
All sequence reads in this study have been deposited in the European Nucleotide Archive under project accession number PRJEB21643. Individual accession numbers for all individuals are available in Table S1 in File S1.
Results
Assembly statistics
The total length of the chiffchaff assembly was 1.04 Gb, similar to previously available avian genome assemblies (Zhang ). The longest scaffold was 157.6 Mb, and the scaffold N50 length was 70.4 Mb, identical to the fAlb15 assembly (Kawakami ), as expected, given that F. albicollis was used as reference for mapping. The number of Ns in the assembly was 19%, and the total GC content was 31% (AT = 50%). All statistics of the chiffchaff assembly are presented in Table S2 in File S1. The assessment of the chiffchaff assembly completeness based on conserved gene sets showed that 80.9% of vertebrate and 50.3% of eukaryotic genes were completely covered. The fractions of fragmented genes in the vertebrate and eukaryotic gene sets were 6.5 and 4.9%, respectively, and missing genes constituted 12.6 and 44.8% of the gene sets (Table S3 in File S1). For comparison, the high-quality F. albicollis and C. corone assemblies cover only slightly higher fractions of complete vertebrate (92.8 and 84.4%, respectively) and eukaryotic (50.3% in both species) genes, and have a slightly lower proportion of completely missing genes; 2.5 and 10.1% in the vertebrate gene set (Table S3 in File S1).
Global estimates of genetic diversity and differentiation
In total, we detected ∼7 million SNPs across all individuals from both species, of which 6,912,505 were scored in the allopatric population pairs and 6,843,490 in the sympatric population pairs (Table 1). We observed a significantly higher proportion of shared SNPs (sympatry = 48.0%, allopatry = 35.5%), and a lower proportion of fixed differences (sympatry = 0.0%, allopatry = 0.3%) between abietinus and tristis in sympatry than in allopatry (Chi-Square test: χ2 = 283,510, df = 3, p-value < 2.2 × 10−16; Table 1). The proportion of fixed differences on the Z-chromosome was significantly higher in the comparison of allopatric abietinus and tristis (5.4%), but there were no fixed differences on the Z-chromosome in the sympatric comparison (Chi-Square test: χ2 = 198,380, df = 3, p-value < 2.2 × 10−16; Table 1). Both the PCA- and the structure (sNMF) analysis clearly separated allopatric abietinus and tristis (Figure 1 and Figure 2) with individuals from the sympatric regions clustering at the intermediate space in the PCA analysis (Figure 1), and showing considerable admixture proportions in the sNMF analysis (Figure 2). In the PCA, allopatric abietinus showed more interindividual variation than allopatric tristis, and sympatric tristis tended to cluster closer with allopatric tristis, while sympatric abietinus were more scattered across the range between the allopatric groups (Figure 1). This pattern was consistent with the clustering analysis where allopatric abietinus and tristis form clearly separated groups (optimal K = 2), and all sympatric tristis had >75% of the assigned proportions shared with allopatric tristis while sympatric abietinus varied from being assigned with 100% to abietinus to >90% with tristis (Figure 2).
Table 1
Summary of numbers (#) and proportions (in %) of shared, private, and fixed polymorphisms across allopatric and sympatric abietinus (ab.) and tristis (tr.) for the autosomes (A), the Z-chromosome (Z), and all chromosomes jointly (All)
Region (Chrom)
# Shared
# Private ab.
# Private tr.
Fixed
# Total
Allopatry (A)
2,426,608 (36.0%)
2,687,834 (39.9%)
1,614,661 (24.0%)
l9,505 (0.1%)
6,738,608
Sympatry (A)
3,188,645 (48.0%)
1,884,193 (28.3%)
1,574,541 (23.7%)
0 (0.0%)
6,647,379
Allopatry (Z)
27,559 (15.8%)
83,921 (48.3%)
52,973 (30.5%)
9,444 (5.4%)
173,897
Sympatry (Z)
95,951 (48.9%)
57,463 (29.3%)
42,697 (21.8%)
0 (0.0%)
196,111
Allopatry (All)
2,454,167 (35.5%)
2,771,755 (40.1%)
1,667,634 (24.1%)
18,949 (0.3%)
6,912,505
Sympatry (All)
3,284,596 (48.0%)
1,947,656 (28.4%)
1,617,238 (23.6%)
0 (0.0%)
6,843,490
Figure 1
A visual illustration of genetic similarities across all chiffchaff individuals using a PCA as implemented in the R Bioconductor package SNPRelate (Zheng 2012) of 220,097 unlinked (pairwise LD as estimated by r < 0.2) high quality SNPs. Samples are color-coded according to origin of individuals (yellow, allopatric abietinus; blue, sympatric abietinus; green, sympatric tristis; brown, allopatric tristis). Axes show the eigenvalues from the PCA for the two components explaining most of the variation (PC1 = 4.6% and PC2 = 3.2%).
Figure 2
Summary of the analysis of sample clustering using the R Bioconductor package LEA with 220,097 high quality SNPs (Frichot 2014, 2015). (A) Yellow, abietinus-derived alleles; brown, tristis derived alleles. Each vertical bar represents one individual, and samples are sorted according to population origin with allopatric abietinus to the far left, sympatric populations in the middle, and allopatric tristis to the far right. (B) Summary of error (entropy) rates for K = 1–8 predefined populations. The largest rate change in entropy occurs from two to three populations indicating that K = 2 is the optimal number of clusters (Frichot 2014, 2015).
A visual illustration of genetic similarities across all chiffchaff individuals using a PCA as implemented in the R Bioconductor package SNPRelate (Zheng 2012) of 220,097 unlinked (pairwise LD as estimated by r < 0.2) high quality SNPs. Samples are color-coded according to origin of individuals (yellow, allopatric abietinus; blue, sympatric abietinus; green, sympatric tristis; brown, allopatric tristis). Axes show the eigenvalues from the PCA for the two components explaining most of the variation (PC1 = 4.6% and PC2 = 3.2%).Summary of the analysis of sample clustering using the R Bioconductor package LEA with 220,097 high quality SNPs (Frichot 2014, 2015). (A) Yellow, abietinus-derived alleles; brown, tristis derived alleles. Each vertical bar represents one individual, and samples are sorted according to population origin with allopatric abietinus to the far left, sympatric populations in the middle, and allopatric tristis to the far right. (B) Summary of error (entropy) rates for K = 1–8 predefined populations. The largest rate change in entropy occurs from two to three populations indicating that K = 2 is the optimal number of clusters (Frichot 2014, 2015).The mtDNA genome analyses revealed two distinct mtDNA clades predominantly reflecting differentiation between allopatric abietinus and tristis, but with considerable nesting of sympatric abietinus individuals within the tristis clade (Figure 3 and Figure S2 in File S1). Allopatric tristis clustered in two separate groups, birds from Southern Siberia being in a derived clade while birds from Central and Eastern Siberia and the Ural Mountains were scattered in basal lineages in the “tristis”-type mtDNA group, and birds from the Southern sympatric zone showed higher diversity in mtDNA haplotypes than birds from the Northern sympatric region (Figure S2 in File S1). The analyses of mtDNA genetic variation within and between populations showed that diversity was higher in abietinus than in tristis in allopatry (Ɵ = 5.4 ± 0.3 × 10−3
vs. 4.0 ± 0.3 × 10−3 for abietinus and tristis, respectively), and the pattern was even more pronounced in sympatry (Ɵ = 12.6 ± 0.5 × 10−3
vs. 3.7 ± 0.3 × 10−3 for abietinus and tristis, respectively). MtDNA genetic divergence was also higher between allopatric than between sympatric abietinus and tristis (Maximum Likelihood corrected numbers of substitutions per site = 0.0213 ± 0.0008 and 0.0117 ± 0.0004 in allopatry and sympatry, respectively, Table S4 in File S1). Overall, this picture a scenario where abietinus and tristis diverged in allopatry roughly 1–2 MYA (following the rationale behind a body-size corrected mtDNA genome divergence rate as proposed by Nabholz ), and subsequently came into secondary contact in the current sympatric region where hybridization and backcrossing resulted in higher levels of allele sharing between species, with higher rates of introgression from tristis into abietinus.
Figure 3
An unrooted phylogeny representing the entire mitochondrial genome for all 40 samples included in the study. The length of the scale bar represents 0.01 substitutions per site, and nodes are color-coded according to origin of populations (yellow, allopatric abietinus; blue, sympatric abietinus; green, sympatric tristis; brown, allopatric tristis).
An unrooted phylogeny representing the entire mitochondrial genome for all 40 samples included in the study. The length of the scale bar represents 0.01 substitutions per site, and nodes are color-coded according to origin of populations (yellow, allopatric abietinus; blue, sympatric abietinus; green, sympatric tristis; brown, allopatric tristis).There was no striking difference in nuclear genetic diversity across populations although abietinus showed slightly higher global genetic diversity than tristis both in allopatry and sympatry (average Ɵ in allopatric and sympatric abietinus = 0.0053 ± 0.0031 and 0.0051 ± 0.0029, respectively and 0.0047 ± 0.0031 and 0.0049 ± 0.0030 in allopatric and sympatric tristis, Table 2). As expected given a lower Nof sex-chromosomes than of autosomes, both species had lower genetic diversity on the Z-chromosome (average Ɵ = 0.0031 ± 0.0035 and 0.0025 ± 0.0030, for allopatric abietinus and tristis, respectively) as compared to the autosomes combined (Ɵ = 0.0054 ± 0.0030 and 0.0048 ± 0.0030) – the same pattern was observed in sympatry (Table 2). Since demographic history may affect allele frequency distributions and patterns of LD, we assessed if further understanding of the underlying processes behind allele sharing in sympatry could be gained by using standard population genetic summary statistics (Tajima’s D) and linkage information. In summary, the global estimates of Tajima’s D were generally negative, in line with a bottleneck + expansion model for all populations, but less pronounced in allopatric tristis (Table 2). There were only minor differences in global decay patterns of LD across populations (Figure S3 in File S1 and Table 2). The history of the two species provides a starting point for exploring patterns of genetic differentiation across the genome as a consequence of secondary contact and recurrent hybridization, and to investigate how gene-flow affects global and regional genomic differentiation and characterize regions that may be shielded from introgression.
Table 2
Summary of intrapopulation nuclear genomic pair-wise nucleotide diversity (Ɵ), Watterson’s theta (Ɵ), Tajima’s D (T), and LD (r, per base pair) as estimated in nonoverlapping 10 kb windows across the chiffchaff genome
Population
Ɵπ
ƟW
TD
r2
Autosomes
Allopatric abietinus
0.0054 ± 0.0030
0.0064 ± 0.0025
−0.68 ± 0.37
0.029 ± 0.049
Allopatric tristis
0.0048 ± 0.0030
0.0051 ± 0.0025
−0.28 ± 0.39
0.030 ± 0.053
Sympatric abietinus
0.0052 ± 0.0029
0.0060 ± 0.0024
−0.58 ± 0.36
0.028 ± 0.052
Sympatric tristis
0.0050 ± 0.0029
0.0056 ± 0.0024
−0.50 ± 0.39
0.029 ± 0.055
Z-chromosome
Allopatric abietinus
0.0031 ± 0.0035
0.0038 ± 0.0027
−0.85 ± 0.45
0.022 ± 0.062
Allopatric tristis
0.0025 ± 0.0030
0.0028 ± 0.0024
−0.59 ± 0.49
0.025 ± 0.080
Sympatric abietinus
0.0034 ± 0.0027
0.0039 ± 0.0022
−0.50 ± 0.41
0.023 ± 0.057
Sympatric tristis
0.0030 ± 0.0029
0.0035 ± 0.0023
−0.66 ± 0.37
0.022 ± 0.061
All summary statistics are global averages for autosomes and the Z-chromsome, respectively, with SD estimated from variance across windows.
All summary statistics are global averages for autosomes and the Z-chromsome, respectively, with SD estimated from variance across windows.The global level of genetic differentiation between species was significantly lower in the sympatric region (mean ± SD of F across windows = 0.099 ± 0.038) than in the allopatric regions (0.220 ± 0.064) (Wilcoxon’s test: W = 1.04 × 1010, p-value <2.2 × 10−16), and the level of genetic differentiation was significantly higher on the Z-chromosome than on the autosomes (only allopatric comparison shown: Wilcoxon’s test: W = 1.05 × 108, p-value < 2.2 × 10−16; Figure 4 and Table 3). An opposite pattern was observed for absolute divergence (D), with reduced divergence on the Z-chromosome as compared to autosomes [Wilcoxon’s test: W = 1.17 × 108, p-value < 2.2 × 10−16 (allopatry); W = 1.0 × 108, p-value < 2.2 × 10−16 (sympatry)], although the quantitative difference between allopatry and sympatry was less pronounced [global D = 0.0085 ± 0.0058 (allopatry) and 0.0081 ± 0.0056 (sympatry); Wilcoxon’s test: W = 5.76 × 109, p-value < 2.2 × 10−16], except for the Z-chromosome where D was considerably reduced in sympatry as compared to allopatry [Z-chromosome D = 0.0061 ± 0.0039 (allopatry) and 0.0055 ± 0.0034 (sympatry); Wilcoxon’s test: W = 5.1 × 109, p-value < 2.2 × 10−16; Table 3). The overall level of genetic differentiation was hence higher on the Z-chromosome than on the autosomes, both in allopatry and sympatry, but the opposite was observed for absolute divergence (Figure 4, Figure 5, and Figure 6).
Figure 4
Violin plots showing the distributions of genetic differentiation (F) on the Z-chromosome, autosomes and all chromosomes combined in comparisons of allopatric (gray) and sympatric (red) abietinus and tristis pairs. Horizontal solid lines within distribution curves indicate means, and broken lines the SD across all 10-kb windows in the genome.
Table 3
Summary of the levels of genetic differentiation (F) and absolute divergence (D) across allopatric and sympatric abietinus and tristis for the autosomes (A), the Z-chromosome (Z), and all chromosomes jointly (All)
Region (chrom)
FST
DXY
Allopatry (A)
0.214 ± 0.057
0.0086 ± 0.0059
Sympatry (A)
0.097 ± 0.036
0.0083 ± 0.0056
Allopatry (Z)
0.313 ± 0.093
0.0061 ± 0.0039
Sympatry (Z)
0.133 ± 0.051
0.0055 ± 0.0034
Allopatry (All)
0.220 ± 0.064
0.0085 ± 0.0058
Sympatry (All)
0.099 ± 0.038
0.0081 ± 0.0056
All summary statistics are averages from windows within each respective chromosome class with SD estimated from variance across windows.
Figure 5
Illustration of regional variation in fixed differences (allopatric comparison only, top panel, # Fix), genetic differentiation (second panel, F), absolute divergence (middle panel, D), nucleotide diversity (fourth panel, Ɵ) and Tajima’s D (bottom panel, TajD) represented by chromosome 12. Values on y-axes represent ranges of each respective parameter and colors indicate species comparison (gray, allopatry; red, sympatry for # Fix, F and D) or population (yellow, allopatric abietinus; brown, allopatric tristis; blue, sympatric abietinus; green, sympatric tristis for Ɵπ and TajD).
Figure 6
Illustration of the variation in genetic differentiation (F, top panel) and absolute genetic divergence (D, bottom panel) across the genome between allopatric (gray line) and sympatric (red line) abietinus and tristis. Chromosomes are ordered from chromosome 1 to chromosome Z, and every other chromosome is shaded in light gray.
Violin plots showing the distributions of genetic differentiation (F) on the Z-chromosome, autosomes and all chromosomes combined in comparisons of allopatric (gray) and sympatric (red) abietinus and tristis pairs. Horizontal solid lines within distribution curves indicate means, and broken lines the SD across all 10-kb windows in the genome.All summary statistics are averages from windows within each respective chromosome class with SD estimated from variance across windows.Illustration of regional variation in fixed differences (allopatric comparison only, top panel, # Fix), genetic differentiation (second panel, F), absolute divergence (middle panel, D), nucleotide diversity (fourth panel, Ɵ) and Tajima’s D (bottom panel, TajD) represented by chromosome 12. Values on y-axes represent ranges of each respective parameter and colors indicate species comparison (gray, allopatry; red, sympatry for # Fix, F and D) or population (yellow, allopatric abietinus; brown, allopatric tristis; blue, sympatric abietinus; green, sympatric tristis for Ɵπ and TajD).Illustration of the variation in genetic differentiation (F, top panel) and absolute genetic divergence (D, bottom panel) across the genome between allopatric (gray line) and sympatric (red line) abietinus and tristis. Chromosomes are ordered from chromosome 1 to chromosome Z, and every other chromosome is shaded in light gray.
Regional variation in genetic diversity and differentiation
Global estimates of nuclear genetic diversity showed highly heterogeneous patterns across the genome of both species, both in allopatric and sympatric populations (Figure 5, Figure 6, and Figure S4 in File S1). Similar to many previous studies of avian (and other) species pairs, the level of genetic differentiation varied extensively across the genome, with presence of distinct F “peaks” in both the allopatric and sympatric abietinus and tristis species comparisons (Figure 6 and Figure S4 in File S1). A striking difference between allopatry and sympatry was that the overall level of differentiation (but not absolute divergence) was lower, and the differentiation peaks were narrower and more pronounced in the sympatric comparison (Figure 6, Figure 7, and Figure S4 in File S1), resulting in significantly different F-distributions (Kolmogorov-Smirnov test: D = 0.88, p-value < 2.2 × 10−16). There was a significant negative correlation between the level of genetic differentiation and genetic diversity in both populations [Pearson’s r = −0.29 and −0.32 for abietinus and tristis; p-value < 2.2 × 10−16 in both tests (allopatry); r = −0.17 and −0.21; p-value < 2.2 × 10−16 in both tests (sympatry)], and also a negative correlation between differentiation and absolute divergence, both in allopatry (Pearson’s r = −0.225, t = −74.421, df = 103,770, p-value < 2.2 × 10−16) and in sympatry (Pearson’s r = −0.130, t = −30.823, df = 103,630, p-value < 2.2 × 10−16 ; Figure S5 in File S1). Tajima’s D was, in general, more negative in regions of high differentiation in comparison to the remaining parts of the genome (Pearson’s r = −0.070 to −0.166, p-values < 2.2 × 10−16), with the exception of sympatric abietinus that showed a positive correlation between Tajima’s D and the level of genetic differentiation (Pearson’s r = 0.162, p-value < 2.2 × 10−16; Figure 5, Figure S4, and Figure S6 in File S1). The analysis of the levels of genetic diversity and absolute divergence shows that the correlation is very strong, both in allopatry and in sympatry, with only little deviation from the linear relationship between Ɵ and D (Figure S7 in File S1). We observed a higher variance in absolute divergence in allopatry than in sympatry, and a more pronounced clustering of high differentiation windows at the lower left end in the graph (low genetic diversity and low absolute divergence) in autosomes than for the Z-chromosome (Figure S7 in File S1).
Figure 7
Estimates of relative differentiation (F) between abietinus and tristis for 10 kb windows across the genome. The upper 99th percentiles are represented by horizontal barred lines for the allopatric (top panel, gray) and sympatric (bottom panel, red) comparisons, respectively. Regions with significantly higher density of outlier windows are indicated with gray (allopatry) and red (sympatry) solid blocks.
Estimates of relative differentiation (F) between abietinus and tristis for 10 kb windows across the genome. The upper 99th percentiles are represented by horizontal barred lines for the allopatric (top panel, gray) and sympatric (bottom panel, red) comparisons, respectively. Regions with significantly higher density of outlier windows are indicated with gray (allopatry) and red (sympatry) solid blocks.Windows with F values higher than the 99th percentile in the distributions were highly clustered in specific regions for both the allopatric and the sympatric comparison (Figure 7). We tested for significant clustering by scanning the genome for regions where blocks of 50 windows (each window was 10 kb long) contained 10 or more F outlier windows (the probability of 10 or more outliers appearing in a 50-window block was estimated to 7.13 × 10−11 if outliers are assumed to be randomly scattered across the genome). In the allopatric comparison, 19 blocks representing 13 chromosomal regions (chromosomes 1, 1A, 2, 3, 4, 5, 7, 8, 9, 10, 12, and 28), were identified as significantly enriched for F outliers, and the analogous analysis in sympatry revealed 13 windows, representing 10 different regions located on chromosomes 1, 1A, 5, 10, 12, 13, 17, 20, and 21 (Figure 7 and Table S5 in File S1). The identified regions were highly overlapping between the two different comparisons, but, again, regions were narrower and more pronounced in sympatry (Figure 7). The chromosome regions with high density of F outliers contained 238 and 274 protein coding genes in the allopatric and the sympatric comparisons, respectively (Table S6 in File S1). The GO enrichment analysis revealed no over-representation of specific terms for genes in differentiation outliers in allopatric comparison (p-value threshold 0.01 after correcting for multiple testing), and only one term associated with regulation of cellular component organization in the sympatric comparison for two partly overlapping gene sets (Table S6 in File S1)
Protein coding genes in differentiation peaks
Global genomic differentiation was significantly lower in sympatry than in allopatry with more pronounced and narrower differentiation peaks, indicating recurrent gene-flow in large portions of the genome of the two species. The highly differentiated regions in sympatry may indicate that these regions have experienced restricted gene-flow, and therefore we wanted to identify potential targets of diversifying selection and/or incompatibility within these regions. Omitting the Z-chromosome, which, in general, showed elevated differentiation across the entire chromosome (Figure 6), and no relative differentiation outliers (Figure 7), we found 26 protein coding genes in the 10 kb windows with the highest level of genetic differentiation nested within blocks of high density of relative differentiation outliers (see above) in the comparison of sympatric abietinus and tristis (Table 4). Among the 26 genes, three were uncharacterized, and the remaining 23 had diverse functions, mainly related to ion transport, gene regulation (transcription factors, histone binding, and phosphorylation), muscle development, learning (ARF4), sensory perception of sound (OTOG), and pigmentation (HPS5) (Table 4).
Table 4
Protein coding genes identified in the 10-kb windows with the highest genetic differentiation (F) nested within the chromosome regions enriched for windows with high relative differentiation (F) between sympatric abietinus and tristis
Gene Name/Ensemble ID
Chromosome
Annotation/Function
EMSY (C11orf30)
1
Transcription factor
ATF7IP
1A
Transcription factor
PLBD1
1A
Phospholipase B domain
GUCY2C
1A
GMP biosynthetic process, phosphorylation
SLC9A3
1A
Sodium ion transport
TPPP
2
Microtubule bundle formation (brain)
BRD9
2
Nucleic acid/histone binding; gene regulation
COL15A1
2
Cell adhesion
ENSFALG00000013290
2
Novel transcript
ENSFALG00000006425
3
Signal transduction/response to stimulus
DZANK1
3
Mitophagy
ANO5
5
Chloride transport
SLC17A6
5
Transmembrane transport
LUZP2
5
Leucine zipper 2 protein
PLEKHA7
5
Epithelial cell-cell adhesion
ABCC8
5
Transcription factor
OTOG
5
Sensory perception of sound
HPS5
5
Pigmentation/organelle organization
SERGEF
5
Negative regulation of protein secretion
KCNC1
5
Potassium ion transport
MYOD1
5
Muscle tissue development
ENSFALT00000008878
12
Microtubule activity
PDE12
12
Phosphodiesterase 12
ARF4
12
Cell cilia function/signaling/rhodopsin
DENND6A
12
Regulation of cell-cell adhesion
SLMAP
12
Muscle contraction
Genes are ordered according to position in the genome. The function is inferred from GO terms (biological process and/or molecular function) as indicated in Ensembl (http://www.ensembl.org/Ficedula_albicollis/Gene/Ontologies).
Genes are ordered according to position in the genome. The function is inferred from GO terms (biological process and/or molecular function) as indicated in Ensembl (http://www.ensembl.org/Ficedula_albicollis/Gene/Ontologies).
Functional relevance of fixed, shared, and private polymorphisms
To obtain further information about the underlying forces generating a heterogeneous distribution of genetic diversity and differentiation, we assessed the functional relevance of different categories of SNPs. This analysis was focused on the allopatric comparison, only since we did not identify any fixed differences between sympatric abietinus and tristis, and we applied more stringent filtering (sequence coverage of at least 5× per individual for at least nine individuals in both allopatric abietinus and tristis) to minimize the risk of including erroneously called fixed differences in this analysis. Among all high-quality SNPs selected for this analysis (2,810,678 SNPs), we found that almost half of them (43.6%) were shared polymorphisms, while only 7609 SNPs (0.27%) represented fixed differences. We generated annotation information for all SNPs from all of the eight SNP sets (shared, private abietinus, private tristis, and fixed differences for the autosomes and the Z-chromosome, respectively) using the annotation available for the F. albicollis genome assembly (http://www.ensembl.org/Ficedula_albicollis/Info/Index). For autosomes, there was an excess in the proportion of fixed differences in genes, or chromosome regions in the vicinity of genes, while there was an underrepresentation of fixed differences in intergenic sequences, but this pattern was not present on the Z-chromosome (Figure 8). Since fixed differences are overrepresented in high differentiation regions (Figure 5 and Figure S6 in File S1), this indicates that patterns of elevated differentiation cannot entirely be explained by restricted recombination and less efficient selection, but also that diversifying selection has driven certain alleles to fixation between abietinus and tristis. We therefore took the analysis one step further by classifying all the fixed differences between the species in allopatry. The results showed that of the 7609 fixed differences identified, the majority were as expected located in introns, intergenic regions or noncoding regions adjacent to genes (n = 7458), but we also identified 134 exonic fixed differences corresponding to 116 synonymous and 18 nonsynonymous positions, and 17 fixed differences potentially affecting splice signals. The genes associated with the fixed differences at nonsynonymous positions or splice sites are all potential candidates for underlying species-specific characteristics of abietinus and tristis, respectively, and are presented in Table S7 in File S1. Two of the genes had more than one nonsynonymous substitution, and, based on functional studies in other organisms, these may be involved in the innate immune response (FYB) and lipid metabolism (PRUNE2). Another two genes had both nonsynonymous and splice altering differences, and these genes have functions related to gene expression regulation via histone methylation (KDM4C), microtubule activity, and intracellular transport (KIF24).
Figure 8
Bar plots showing the relative frequencies of shared (n = 1,224,635; green), private [yellow (n = 734,710) and brown (n = 843,724) for abietinus and tristis, respectively) and fixed (n = 7609; gray) polymorphisms between allopatric abietinus and tristis. SNPs on the autosomes are presented in the top panel, and Z-chromosome linked SNPs are presented in the bottom panel.
Bar plots showing the relative frequencies of shared (n = 1,224,635; green), private [yellow (n = 734,710) and brown (n = 843,724) for abietinus and tristis, respectively) and fixed (n = 7609; gray) polymorphisms between allopatric abietinus and tristis. SNPs on the autosomes are presented in the top panel, and Z-chromosome linked SNPs are presented in the bottom panel.
Discussion
Shipilina used nuclear SNP information from one of the sequencing batches and restriction data from a single mtDNA gene sequence (CYTB) in the same set of chiffchaff samples. They investigated population structure in abietinus and tristis in relation to phenotypic variation in allopatry and sympatry, and verified that the intermediate phenotypes and mixed singers observed in the area around the Ural mountains represented admixed individuals rather than a third taxon previously suggested to be present in the area (Shipilina ). Here, we went one step further by developing a reference-assisted genome assembly, and using the complete mtDNA genome and higher coverage nuclear genomic resequencing data to generate quantitative information about genome-wide patterns of regional variation in genetic polymorphism, divergence, and differentiation. The polymorphism data were used to investigate how hybridization in secondary contact affects patterns of genome differentiation and to identify regions under diversifying selection, potentially involved in reproductive barriers between the species. The chiffchaff assembly covered ∼81% of the genome (as estimated from comparison to the collared flycatcher, F. albicollis), and had sufficient quality to use as a backbone for mapping reads from resequencing efforts involving allopatric and sympatric populations of abietinus and tristis. The level of complete and partial coverage of conserved gene sets was, for example, similar to high-quality avian genome assemblies (F. albicollis and C. corone).Initially, we used the new polymorphism data to assess population structure and genomic variation within and between abietinus and tristis to have a thorough background for more detailed analyses of genomic regions that showed patterns of differentiation consistent with diversifying selection and/or genetic incompatibilities. Our estimates of genetic diversity, genetic differentiation, allele frequency distributions, LD, and population structure for both mtDNA and nuclear DNA supported the scenario suggested by Shipilina . The two species were most likely isolated in allopatry and accumulated novel variants and allele frequency changes, which was followed by secondary contact and interspecific gene-flow when distribution ranges expanded, resulting in reduced global differentiation (Shipilina ; Marova ). We observed only minor differences in the levels of genetic diversity across populations, with allopatric abietinus showing slightly higher proportion of private polymorphisms and a higher genetic diversity, but a higher proportion of low frequency variants (more negative Tajima’s D) than allopatric tristis, despite the smaller sampling range of abietinus than tristis (Shipilina ). This may reflect a more severe or longer bottleneck, and a less dramatic expansion in tristis during and after glacial cycles. In comparison to a well-studied avian study system with potentially similar demographic histories—the pied (Ficedula hypoleuca) and the collared flycatcher (F. albicollis), where average nucleotide diversity ranges from 3.2 × 10−3 to 4.0 × 10−3 (Burri )—the diversity levels were 10–20% higher in chiffchaff, suggesting higher long-term effective population sizes, potentially as a result of less dramatic contractions during glacial maxima. In sympatry, there was a reduction in the fraction of private polymorphisms, which was more pronounced in abietinus than in tristis, and the mtDNA divergence in intraspecific comparisons of allopatric and sympatric populations was higher in abietinus than in tristis. Furthermore, sympatric abietinus had a higher Tajima’s D, indicating a lower proportion of low frequency alleles. This suggests that sympatric abietinus-like birds also harbor tristis nuclear and mtDNA alleles, which is in line with previous analyses (Shipilina ). Taken together, this portrays a scenario where secondary contact and recurrent hybridization and back-crossing has generated a reduction in genome-wide differentiation in sympatry, and that the external appearance of an individual does not necessarily reflect genomic composition. Interestingly, this indicates that individual phenotype is not dependent on the effect of many genes scattered throughout the genome, but rather that relatively few genes may underlie the observed differences between the species. Having this background knowledge provides a good position from which to investigate regional variation in diversity, differentiation, and divergence to draw conclusions on specific processes underlying the observed patterns in allopatry and sympatry.There was a considerable reduction in genetic diversity (Ɵ) on the Z-chromosome as compared to the autosomes in both abietinus and tristis and the level of genetic differentiation (F) was on average higher on the Z-chromosome in both allopatry and sympatry. Interestingly, the level of absolute divergence (as estimated by D) was considerably reduced on the Z-chromosome as compared to the autosomes, similar to previous observations in the greenish warbler species complex (Irwin ). This indicates that lower postdivergence gene-flow has not been the main force affecting the increase in differentiation on the Z-chromosome, but rather that selection in the ancestral lineage has resulted in reduced diversity prior to the split between abietinus and tristis (Cruickshank and Hahn 2014; Nachman and Payseur 2012; Irwin ). The overall elevated level of differentiation on the Z-chromosome is a frequently observed pattern (Burri ; Ellegren ; Irwin ; Wolf and Ellegren 2017; Oyler-McCance ); however, several characteristics of the sex-chromosomes (Vicoso and Charlesworth 2006) contribute to difficulties in the interpretation of the exact mechanisms underlying the observed patterns on the Z-chromosome. First, the Z-chromosome does not recombine in females (except for the pseudoautosomal region, PAR), and, therefore, it is expected that the overall recombination rate will be reduced on the Z-chromosome as compared to the autosomes of similar size, which do recombine in both males and females (Irwin ; Qvarnström and Bailey 2009). This results in lower efficiency of selection in general (reducing N and increasing the effect of drift), and should also increase the effect of linked selection on the Z-chromosome. Second, it is plausible that the Z-chromosome harbors a gene set that is nonrepresentative for the genome as a whole, e.g., enriched for genes involved in local adaptation or reproductive isolation, as has been suggested for the X-chromosome in Drosophila (Presgraves 2008). We know, for example, from previous studies in Ficedula flycatchers, that the Z-chromosome plays a particularly important role for traits under sexual selection and species recognition (Ellegren ; Qvarnström and Bailey 2009; Qvarnström ; Sætre and Sæther 2010), and this could potentially be translated also to the situation in chiffchaff, since both song and plumage are important signals for mate choice and species recognition in Phylloscopus warblers (Lyu ; Arvidsson and Neergaard 1991), including abietinus and tristis (Marova ). Third, the effective population size (N) of the Z-chromosome is reduced as compared to autosomes. With an equal sex-ratio, N for the Z-chromosome is 3/4 (75%) of the N for autosomes, since females only have a single Z-chromosome (in birds, male is the homogametic sex and harbors two Z chromosomes, and female is the heterogametic sex with one Z and one W chromosome). In birds, males also generally have higher variance in reproductive output than females, further reducing the effective population size of the Z-chromosome and increasing the effect of genetic drift. Fourth, since the Z-chromosome is in hemizygous state in female birds, recessive mutations are expressed in females, and this can contribute to both faster evolution (Vicoso and Charlesworth 2009; Sackton ; Mank , 2010) and more pronounced antagonistic effects across sexes, leading to sex-biased expression patterns (Mank 2009; Ellegren and Parsch 2007). Finally, since the Z-chromosome spends 2/3 of the time (in an evolutionary timescale perspective) in the male germline, it is possible that mutation rates are higher on the Z-chromosome than on the autosomes due to male-biased mutation (Ellegren and Fridolfsson 1997; Hurst and Ellegren 1998; Wilson Sayres and Makova 2011). The observed patterns in chiffchaff suggest that the most plausible explanation is that genetic drift is a major force affecting allele frequency changes on the Z-chromosome, and that directional selection affects larger regions due to linkage, and these forces have increased differentiation across the entire chromosome. Mutation bias, however, probably has had only a minor effect (if any), since absolute divergence is considerably reduced as compared to the autosomes. In addition, selection has likely been a key force reducing overall diversity on the Z-chromosome prior to the split between abietinus and tristis, and followed by accumulation of allele frequency changes due to strong effects of drift as a consequence of combined effects of the specific features of the Z-chromosome listed above. Hence, there is no evidence in our allele frequency data for more restricted gene-flow on the Z-chromosome, which is also supported by the fact that no fixed differences were observed on the Z-chromosome when comparing abietinus and tristis in the sympatric zone.Similar to the observations when comparing the Z-chromosome to the autosomes, the highly differentiated regions across chiffchaff species in general showed reduced diversity and absolute divergence. There was also a strong correlation between absolute divergence and genetic diversity. Again, similar to the situation for the Z-chromosome, this may be translated to a scenario where linked selection and regional lowered effective population size has resulted in loss of genetic variation prior to the divergence of the two lineages (Cruickshank and Hahn 2014; Nachman and Payseur 2012). The levels of differentiation are expected to vary regionally as a consequence of stochastic allele frequency changes, also when the level of selection is equal across regions in the genome, especially if populations are structured (Baird 2017; Lohse 2017). However, we observed very distinct and narrow differentiation peaks, especially in sympatry, which indicates that additional factors affect the differentiation landscape between chiffchaff species. One plausible explanation is that the recombination rate varies considerably across regions in the chiffchaff genome, similar to what has been observed in other bird species (Kawakami ; Singhal ; Backström ). This leads to reduced effective population size in regions with low recombination rate, allowing for more rapid allele frequency shifts across species due to genetic drift in those regions (Burri ; Burri 2017a,b; Ellegren and Wolf 2017; Ravinet ; Cruickshank and Hahn 2014; Nachman and Payseur 2012; Noor and Bennett 2009). We do not have access to recombination data for the chiffchaff, but several recent studies indicate that recombination landscapes may be conserved across divergent avian lineages (Singhal ), leading to similar patterns of genetic diversity and differentiation across independent species pair comparisons (Burri ; Irwin ; Dutoit ; Van Doren ; Vijay ). In support of that, when visually comparing the location of differentiation islands identified in the chiffchaff, we see a considerable overlap with for example differentiation islands in Ficedula flycatchers (Burri ; Ellegren ). Hence, although we acknowledge that demographic history definitely affects differentiation patterns, our data suggest that the highly heterogenic landscape of genome differentiation between abietinus and tristis can be traced back to mechanistic underpinnings (e.g., recombination rate variation) determining the difference in relative importance of genetic drift and selection in different parts of the genome, and that a major proportion of the fixed differences we observe are the result of stochastic fixation of alternative alleles in the two species in allopatry. However, the observation that a larger proportion of autosomal fixed differences are present in exonic sequence or adjacent to genes indicates that at least a proportion of the fixed differences on the autosomes result from selection driving alternative alleles to fixation in abietinus and tristis, respectively. In addition, it is possible that stochastic allele frequency changes and random accumulation of fixed differences may lead to diverging phenotypes between species in allopatry, and it is therefore of interest to investigate the functional relevance of observed differences between species, even if it is apparent that the major fraction of these were not initially driven to fixation by adaptation to different environments (natural selection and ecological speciation) or by different preferences for secondary sexual traits (sexual selection). In the following paragraph, we discuss the detailed analyses of functions of genes in highly differentiated regions, and genes with alternative alleles fixed between the two species.The common and the Siberian chiffchaff differ in morphology, plumage color, song, and migration patterns—traits that most likely are of importance for species recognition and lineage specific adaptations (Marova ; Marova and Alekseev 2008; Marova and Leonovich 1993; Marova ; Shipilina ). Gene-flow between the species has been shown to lead to intermediate or mixed traits in hybrids and back-crosses (Marova ; Marova and Alekseev 2008; Shipilina ), and characterization of the genetic basis of these traits is of importance to understand species integrity and the evolutionary effects of introgression. We conducted several analyses to detect genes that might be of interest for understanding phenotypic differences between abietinus and tristis that may reflect species specific adaptations. First, we selected all genes that were located in genomic regions enriched for relative differentiation outliers, and analyzed if any functional categories were overrepresented in this class of genes. This analysis showed no significant enrichment in the allopatric comparison and a single term (regulation of cellular organization) in sympatry. Such an analysis is characteristically rough, and, although it may give some preliminary insights into gene categories associated with high differentiation, it is difficult to draw detailed conclusions regrading functional relevance (Vijay ). To get more detailed information about genes in highly differentiated regions, we also selected the windows with the highest level of differentiation (F) within each of the regions enriched for relative differentiation (F) outliers, and found that these top windows overlapped with 26 protein coding genes. The genes were mainly involved in functions related to ion transport, expression regulation (transcription factors, histone binding, and phosphorylation), and muscle activity/development, but also associated with processes directly relevant to phenotypic traits of interest in our study system; for example, sensory perception of sound (OTOG), and pigmentation (HPS5) (Table 2). Both OTOG and HPS5 are located in the peak of the differentiation island on chromosome 5. OTOG has been associated with sound perception and locomotion in mouse (Mus musculus domesticus) (Simmler ), and otolith function (hearing and balance organ development) in zebra fish (Danio rerio) (Stooke-Vaughan ), indicating a conserved function in hearing and balance organs across very divergent lineages. Song recognition is a key mechanism for mate choice in birds (Nowicki and Searcy 2005), and recent data indicate that allopatric abietinus and tristis only react to conspecific song (e.g., Marova ). It is therefore tempting to contemplate that genes involved in sensory perception of sound may play an important role for driving differences in song preference, and song learning processes, in vocally differentiated bird species like abietinus and tristis. HPS5 is involved in biogenesis of organelles, and has been associated with melanocyte formation and pigmentation in zebra fish (Daly ). Plumage pigmentation is also a key trait for mate recognition and speciation in birds (Price 2008), and, again, it is enticing to speculate around a potential role for this gene underlying specific plumage patterning in chiffchaff. As mentioned before, both OTOG and HPS5 are found in the peak of the differentiation island on chromosome 5. If selection has played any part in the high allele frequency difference between abietinus and tristis in this region, one cannot rule out that genetic hitchhiking has resulted in reduced nucleotide diversity both in the selected region and in linked genes. Therefore, verification experiments will be needed to get detailed information about the phenotypic effects of variation within these genes.Finally, we also analyzed all fixed differences observed between allopatric abietinus and tristis and selected the positions where the substitutions resulted in amino acid and splice signaling changes. Again, we found that several genes were related to metabolism and gene expression (transcription factors, RNA processing, and methylation), but there was a diverse array of other functions in this class of genes. Of particular interest were two genes that had multiple (n = 2) fixed differences resulting in nonsynonymous changes; FYB and PRUNE2. FYB has been shown to be involved in the innate immune response by stimulating interleukin-2 expression in T-cells in the mouse (Krause ), while PRUNE2 has been associated with many different functions (http://www.genecards.org/cgi-bin/carddisp.pl?gene=PRUNE2); the gene is, for example, associated with pyrophosphatase activity that could indicate involvement in lipid metabolism. Two additional genes contained both a nonsynonymous and a splice-altering fixed difference; KDM4C and KIF24. KDM4C regulates gene expression by targeting trimethylated histones in humans (Hillringhaus ), and KIF24 is a kinesin that is a microtubule remodelling ATPase likely involved in intracellular transport (Kobayashi ). From these analyses combined, we hypothesize that some of the genes located in highly differentiated regions, or containing fixed differences at nonsynonymous and splice-altering sites, might be relevant to the observed morphological, ecological, and behavioral differences between abietinus and tristis. However, detailed additional studies are obviously needed before any of these can be considered strong candidates for relevant trait differences. Further ahead still is the identification of potential links between candidate genes for trait variation and reproductive isolating mechanisms, and it is unlikely that standard genome scan approaches can be used to trace such links (Baird 2017; Buerkle 2017; Ellegren and Wolf 2017; Elmer 2017; Feder ; Jiggins and Martin 2017; Ravinet ; Wagner and Mandeville 2017; Westram and Ravinet 2017). We foresee that the next step will be more extensive sampling within the sympatric region, including birds with intermediate phenotypes and using detailed morphological and behavioral studies, combined with resequencing efforts of candidate regions, to find significant associations between specific alleles and phenotypic traits of interest.It should be noted that a reference-assisted assembly cannot detect structural changes between the target organism and the reference. A consequence might be that visual inspection of patterns of genetic variation might be affected. Since birds in general have conserved karyotypes and comparatively few large structural rearrangements (Ellegren 2009), this is likely not a major concern, especially when visual inspection is combined with analyses that do not depend on genomic location. As mentioned above, patterns of genetic diversity and differentiation in chiffchaff are remarkably consistent with observations in other avian taxa, but a detailed inspection of the genome-wide patterns reveals that a few rearrangements probably explain some observed abrupt changes in the patterns of genetic variation along chiffchaff chromosomes. This concerns inversions ranging from 4 to 25 Mb in chromosomes 3, 4, 5, 20, and Z (Figure S8 in File S1). The structural changes do not affect the qualitative interpretation of the results in this study since they constitute but a minor part of the genome, and do not overlap with any of the major differentiation islands observed.To conclude, this study contributes to the increasing body of knowledge about genome divergence processes under natural settings, and highlights the chiffchaff species complex as an important model system to study the genomics of speciation and adaptation in songbirds. In a general perspective, this work will help in the overall understanding of the speciation process, and, in particular, how secondary contact affects patterns of genome differentiation between diverging lineages.
Supplementary Material
Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.117.300152/-/DC1.Click here for additional data file.
Authors: Simon H Martin; Kanchon K Dasmahapatra; Nicola J Nadeau; Camilo Salazar; James R Walters; Fraser Simpson; Mark Blaxter; Andrea Manica; James Mallet; Chris D Jiggins Journal: Genome Res Date: 2013-09-17 Impact factor: 9.043
Authors: Reto Burri; Alexander Nater; Takeshi Kawakami; Carina F Mugal; Pall I Olason; Linnea Smeds; Alexander Suh; Ludovic Dutoit; Stanislav Bureš; Laszlo Z Garamszegi; Silje Hogner; Juan Moreno; Anna Qvarnström; Milan Ružić; Stein-Are Sæther; Glenn-Peter Sætre; Janos Török; Hans Ellegren Journal: Genome Res Date: 2015-09-09 Impact factor: 9.043
Authors: Venkat Talla; Amanda A Pierce; Kandis L Adams; Tom J B de Man; Sumitha Nallu; Francis X Villablanca; Marcus R Kronforst; Jacobus C de Roode Journal: Mol Ecol Date: 2020-07-11 Impact factor: 6.185
Authors: Matthew J Christmas; Julia C Jones; Anna Olsson; Ola Wallerman; Ignas Bunikis; Marcin Kierczak; Valentina Peona; Kaitlyn M Whitley; Tuuli Larva; Alexander Suh; Nicole E Miller-Struttmann; Jennifer C Geib; Matthew T Webster Journal: Mol Biol Evol Date: 2021-07-29 Impact factor: 16.240