Kathleen A Dogantzis1, Tanushree Tiwari1, Ida M Conflitti1, Alivia Dey1, Harland M Patch2, Elliud M Muli3,4, Lionel Garnery5, Charles W Whitfield6, Eckart Stolle7, Abdulaziz S Alqarni8, Michael H Allsopp9, Amro Zayed1. 1. Department of Biology, York University, 4700 Keele Street, Toronto, M3J 1P3 Ontario, Canada. 2. Department of Entomology, The Pennsylvania State University, State College, PA, USA. 3. Department of Life Science, South Eastern Kenya University (SEKU), P.O. Box 170-90200, Kitui, Kenya. 4. 4International Center of Insect Physiology and Ecology (icipe), P. O. Box 30772-00100, Nairobi, Kenya. 5. Laboratoire Evolution Génome Comportement Ecologie (EGCE) UMR 9191, Gif sur-Yvette, France. 6. Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, IL, USA. 7. LIB-Leibniz Institute for the Analysis of Biodiversity Change Museum Koenig, Center of Molecular Biodiversity Research Adenauerallee 160, 53113 Bonn, Germany. 8. Department of Plant Protection, College of Food and Agriculture Sciences, King Saud University, Riyadh, Saudi Arabia. 9. Plant Protection Research Institute, Agricultural Research Council, Stellenbosch, South Africa.
Abstract
The origin of the western honey bee Apis mellifera has been intensely debated. Addressing this knowledge gap is essential for understanding the evolution and genetics of one of the world’s most important pollinators. By analyzing 251 genomes from 18 native subspecies, we found support for an Asian origin of honey bees with at least three expansions leading to African and European lineages. The adaptive radiation of honey bees involved selection on a few genomic “hotspots.” We found 145 genes with independent signatures of selection across all bee lineages, and these genes were highly associated with worker traits. Our results indicate that a core set of genes associated with worker and colony traits facilitated the adaptive radiation of honey bees across their vast distribution.
The origin of the western honey bee Apis mellifera has been intensely debated. Addressing this knowledge gap is essential for understanding the evolution and genetics of one of the world’s most important pollinators. By analyzing 251 genomes from 18 native subspecies, we found support for an Asian origin of honey bees with at least three expansions leading to African and European lineages. The adaptive radiation of honey bees involved selection on a few genomic “hotspots.” We found 145 genes with independent signatures of selection across all bee lineages, and these genes were highly associated with worker traits. Our results indicate that a core set of genes associated with worker and colony traits facilitated the adaptive radiation of honey bees across their vast distribution.
The genus Apis is composed of 12 extant species that form three distinct groups: giant honey bees, dwarf honey bees, and cavity-nesting honey bees (–). All but one of the extant Apis species are endemic to Asia. The exception, Apis mellifera, is native to Europe, Africa, and Western Asia. Given the wide geographic spread of the species, A. mellifera has diversified into several subspecies (, ), of which there are approximately 10 subspecies in Africa, 9 in Asia, and potentially as many as 13 subspecies in Europe (). Each subspecies can be genetically and morphologically classified into at least five distinct evolutionary lineages: the M lineage of Eurasia, the C lineage of Europe, the O and Y lineages of Western Asia, and the A lineage of Africa (, ). Although it is reasonably accepted that the genus emerged in Asia, the ancestral origin and adaptive radiation of contemporary A. mellifera lineages and subspecies remain unresolved.Early fossil records from the Oligocene (34 to 23 Ma ago) place ancestral Apis within Europe, followed by a migration of the genus during the Late Oligocene or during the Miocene (23 to 5.5 Ma ago) (, ). It has been hypothesized that ancestral Apis migrated from Europe into Asia, where it diversified into the three modern lineages of Apis, including A. mellifera of the cavity-nesting bees (, ). Alternatively, it has also been proposed that ancestral Apis remained widespread throughout Europe and Asia, where near the end of the Miocene, Apis colonized Africa via the Iberian Peninsula leading to the origin of A. mellifera, while the remaining extent Apis species descended from ancestors in Asia (). These different hypotheses about the biogeography and diversification of the Apis genus are important for understanding the two competing hypotheses regarding the origin of A. mellifera in Asia (–) or Africa (, ). The expansion from Asia is predicted to have occurred via two northwestern routes into Europe, one consisting of the M lineage and another consisting of the C and O lineages, and a colonization route extending into Africa (A lineage) (). However, it has also been proposed that the route into Africa could have acted as a western expansion of the M lineage into Europe (). Comparatively, the species expansion from Africa is predicted to have occurred via two or three expansion routes including the colonization of the M lineage via the Iberian Peninsula, and then the C and O lineages through Northeast Africa and Western Asia ().Resolving the ancestral origin and evolutionary expansion of A. mellifera will enhance our ability to identify derived and ancestral genetic mutations. This is especially relevant for tracing the evolution of derived phenotypes and for discerning how locally adapted subspecies may contribute to the fitness and diversity of managed colonies (). Recent genomic studies of A. mellifera have shown that with the addition of new subspecies and enhanced datasets (, ), estimates of evolutionary origin can change. As such, the increased representation of samples from Africa and Western Asia—two historically undersampled regions (, )—may be the key to resolving the out-of-Africa and out-of-Asia debate.Here, we used an extensive population genomic dataset composed of 251 individuals and 18 putatively identified subspecies from Europe (N = 4), Africa (N = 8), and Asia (N = 6) to elucidate the evolutionary and adaptive origins of A. mellifera. These samples were collected throughout the native distribution of A. mellifera, with a concentrated effort on filling population and subspecies gaps within Africa and Western Asia. In this study, we aimed to evaluate the population structure of subspecies and determine their lineage classification, define evolutionary relationships using phylogenetic reconstruction, and use biogeography to estimate the most likely ancestral range of the species. Last, we assessed patterns of selection among lineages to identify and categorize the genomic regions associated with the adaptative radiation of the species.
RESULTS
Sequencing and variant detection
We curated a genomic dataset of 251 individual A. mellifera samples representing 18 putative subspecies, of which 14 representative groups were retained (Supplementary Text and data S1). The dataset is composed of several previously published samples (–) and 160 newly sequenced individuals that were collected across temporally and spatially diverse ranges to broaden the representation of populations. The average coverage depth for newly sequenced samples was 66.1 ± 17.7x. After filtering raw variants, we retained a working dataset of 11.8 million single-nucleotide polymorphisms (SNPs).
Population structure and admixture
Using several population structure analyses, we investigated patterns of clustering and admixture among honey bee samples (Fig. 1B and figs. S1 to S3). The cross-validation of the ADMIXTURE analyses revealed the optimal number of genetic clusters to be eight (K = 8). We confirmed the presence of previously identified honey bee evolutionary lineages in Africa (A lineage), Asia (Y and O lineages), Europe (C lineage), and Eurasia (M lineage) (Fig. 1, A and B). Two newly sequenced subspecies formed unique genetic clusters warranting classification as distinct lineages: A. m. lamarckii of Egypt (L lineage) and A. m. unicolor of Madagascar (U lineage) (Fig. 1, A and B). At K = 8, A. m. intermissa (North Africa), a highly admixture subspecies (27%), is identified as an independent genetic cluster (fig. S1). However, this cluster is not consistent with other K values (fig. S1) and likely does not represent a true lineage, but rather an artifact of high genetic admixture. As such, seven genetically distinct groups more accurately represent the number of biologically relevant lineages.
Fig. 1.
Population structure and phylogenetic reconstruction of A. mellifera.
(A) Map of the native distribution of the seven genetically distinct lineages. (B) Patterns of ancestry and population structure identified with ADMIXTURE when K = 7. Vertical bars represent individual bees, and colored segments represent the proportion of ancestry to the different clusters. (C) Evolutionary relationships among A. mellifera samples reconstructed with a neighbor-joining tree using SNPs located genome-wide. Asterisks represent node support of 100%. (D) Evolutionary relationships among A. mellifera samples constructed with a neighbor-joining tree using SNPs located within protein-coding regions. Asterisks represent node support of 100%. Node support and maximum likelihood phylogenetic trees can be found in the Supplementary Materials.
Population structure and phylogenetic reconstruction of A. mellifera.
(A) Map of the native distribution of the seven genetically distinct lineages. (B) Patterns of ancestry and population structure identified with ADMIXTURE when K = 7. Vertical bars represent individual bees, and colored segments represent the proportion of ancestry to the different clusters. (C) Evolutionary relationships among A. mellifera samples reconstructed with a neighbor-joining tree using SNPs located genome-wide. Asterisks represent node support of 100%. (D) Evolutionary relationships among A. mellifera samples constructed with a neighbor-joining tree using SNPs located within protein-coding regions. Asterisks represent node support of 100%. Node support and maximum likelihood phylogenetic trees can be found in the Supplementary Materials.We detected additional patterns of admixture among subspecies, notably within A. m. syriaca, which is composed primarily of O lineage ancestry (76.8%) and is admixed with the A (12.6%), Y (4.4%), and L (4.4%) lineages (fig. S1). As noted in previous studies, A. m. syriaca is located within a contact zone between Africa (A and L) and Asia (O and Y) (Fig. 1A), which is likely the contributor to high levels of hybridization (, ). Introgression of the C lineage into the L and M lineages was detected (fig. S1). These admixture patterns are expected given the close geographic proximity of these lineages, and M and C lineage admixture has been documented extensively (–). Last, we detect varying levels of admixture in samples from Kyrgyzstan (A. m. pomonella), with some samples displaying high levels of C lineage ancestry (fig. S1) likely from imported European colonies used for commercial beekeeping.
Phylogenetic and biogeographic reconstruction
We constructed several phylogenetic trees using three different combinations of SNPs to determine the evolutionary relationships among A. mellifera samples (Fig. 1, C and D, and figs. S4 to S7). Our analyses resolved two topologies that differed slightly with respect to the placement of the Y lineage (Fig. 1, C and D). Subspecies are consistently clustered into previously recognized lineages, and two definitive clades are defined by the separation of the M, C, and O lineages from the L, A, and U lineages. Divergence dating based on nuclear coding sequences suggests that A. mellifera lineages may have begun to diverge as early as c. 6 Ma ago (fig. S8).To predict the most likely ancestral range of the species and major clades, we applied a biogeographic reconstruction to both resolved topologies (table S1). The ancestral range for the most recent common ancestor of the species was predicted to be in Asia with 64.5 to 71.4% probability, while probabilities for an African or European ancestral range were much lower (<6%) (Fig. 2, fig. S9, and table S2). This finding complements a recent independent study that predicted the ancestral range for cavity-nesting bees to be in Southeast Asia (). The ancestral range of the most recent common ancestor of the M, C, and O clades was predicted to be in Asia with a 70% probability, while the ancestral range of the L, A, and U clades varied (70% Asia or 100% Africa) depending on the topology (Fig. 2, fig. S9, and table S2). Microgeographic classification of subspecies likewise places the ancestral range of the species in Western Asia (fig. S10). The use of an outgroup in biogeographic reconstructions is recommended to prevent the inference of wide ancestral ranges (). We used Apis cerana as an outgroup for this analysis, but choosing a different cavity-nesting bee would not have changed the biogeographic reconstruction, as the ancestor of all cavity-nesting bees is predicted to be in Asia ().
Fig. 2.
Ancestral biogeographic range reconstruction of A. mellifera using two resolved topologies.
The current geographic range of subspecies is indicated at branch tips by letters A (Asia), F (Africa), and E (Europe). Colored bars to the right of the trees indicate the lineage association of the subspecies. Pie charts at nodes indicate the marginal maximum likelihood probabilities for the estimated ancestral range. The ancestral range is predicted to be in Asia, with an estimated probability of 64 to 73%. (A) represents the topology reconstructed using SNPs located throughout the genome, while (B) represents the topology reconstructed with SNPs located in protein coding regions. Node probabilities and the biogeographic reconstruction of the Apis genus can be found in the Supplementary Materials.
Ancestral biogeographic range reconstruction of A. mellifera using two resolved topologies.
The current geographic range of subspecies is indicated at branch tips by letters A (Asia), F (Africa), and E (Europe). Colored bars to the right of the trees indicate the lineage association of the subspecies. Pie charts at nodes indicate the marginal maximum likelihood probabilities for the estimated ancestral range. The ancestral range is predicted to be in Asia, with an estimated probability of 64 to 73%. (A) represents the topology reconstructed using SNPs located throughout the genome, while (B) represents the topology reconstructed with SNPs located in protein coding regions. Node probabilities and the biogeographic reconstruction of the Apis genus can be found in the Supplementary Materials.
Contemporary patterns of diversity and demography
Recent demographic events, notably the last glacial period where temperate populations were constrained and the A lineage expanded to its population maxima (), have likely shaped patterns of genetic diversity and effective population size among contemporary populations. For instance, genetic diversity is highest among the A lineage (π = 3.54 × 10−3, θ = 1.01 × 10−2), relative to European (average π =1.48 × 10−3, θ = 1.84 × 10−3) and Asian (average π =1.84 × 10−3, θ = 1.83 × 10−3) lineages (table S3). Likewise, estimates of Ne were considerably larger for the A lineage (~640,000), relative to European or Asian lineages (~116,000) (table S3), as previously documented (). The U lineage of Madagascar had relatively high levels of diversity (π =2.33 × 10−3, θ =1.70 × 10−3) and effective population size (~107,000) relative to European and some Asian subspecies but considerably less than its parent lineage (A) of mainland Africa (table S7). In addition, we find that linkage disequilibrium (LD) is lowest and decays the quickest among A lineage samples, consistent with high estimates of Ne. Comparatively, LD was high among European and the U lineage, consistent with low Ne and historical population bottlenecks (figs. S11 and S12).Measures of pairwise FST between lineages were high (0.528 ± 0.149) (table S5), while estimates between subspecies within the same lineage were low (0.163 ± 0.073) (table S6). Outgroup f3 statistics, which are less sensitive to lineage-specific drift (), were used to quantify the genetic distance between lineages relative to an outgroup (A. cerana). The analysis identified high f3 between the O and C lineages, affirming a longer shared evolutionary history (fig. S13). Pairwise f3 values between the A, L, and U lineages were also high, suggesting a close evolutionary relationship between African lineages (fig. S13). Last, we observed high f3 values between the M and C lineages (fig. S13), despite having high genetic differentiation (FST = 0.66) (table S5), suggesting a more recent common ancestor, but rapid divergence between the lineages. Overall, the relationships identified by f3 statistics are congruent with the evolutionary relationships suggested by the phylogenetic tree and structure analyses.
Patterns of selection across the genome
We studied the adaptive radiation of honey bee lineages by identifying patterns of positive selection inferred from pairwise estimates of outlier genetic differentiation (FST) at SNP loci (table S7 and data S2). Here, we focused our analyses on lineage-specific outliers defined as mutations that show extreme values of FST (highest 5%) in all pairwise comparisons involving a focal lineage. We excluded samples with high levels of recent admixture and separately analyzed European and Asian M lineage subspecies given that they likely experience disparate selective pressures (). In addition, the A lineage was excluded from analyses because of a relatively low number of lineage-specific outlier markers (table S7 and Supplementary Text). While we and others (, ) have found a substantive number of outlier mutations when comparing the A group to any other honey bee lineage, there were very few loci with outlier FST in all six pairwise comparisons involving the A lineage. This may be the result of demographic effects, shared evolutionary relationships, or local adaptation among the A lineage subspecies (Supplementary Text).Outlier SNPs were enriched within protein coding (χ2, P < 6.43 × 10−11) and putative promoter regions (χ2, P < 3.12 × 10−2) of most lineages (table S8). Comparatively, introns were deficient of outlier SNPs, which was significant among four lineages (χ2, P < 2.79 × 10−2) (table S8). Although each outlier SNP is distinct to a lineage, we found that their distribution was concentrated among a relatively small set of genomic hotspots as evidenced by a significant overlap of genes with outlier SNPs between pairwise lineage comparisons (781 ± 263 genes; P < 3.57 × 10−50) (Fig. 3 and table S9). In addition, 145 genes contained at least one outlier SNP across all honey bee lineages. We used gene and SNP resampling simulations to confirm that the overlap of genes and the distribution of outlier SNPs were not due to chance or gene length. Gene resampling indicated that the observed overlap between pairwise lineages was, on average, 140% greater relative to the simulations (table S10). Likewise, randomly resampling outlier SNPs within genes, which corrects for the possibility that larger genes are more likely to have outlier loci because they tend to have more SNPs, indicated that the average number of lineages that overlapped across genes was significantly lower (29.6%; Mann-Whitney test, P < 7.17 × 10−48), relative to the simulations. This suggests that outliers in our dataset are concentrated in a smaller set of genes than expected by chance.
Fig. 3.
Proportion of genes that overlap among lineages and across the A. mellifera genome.
(A) to (G) illustrate the proportion of genes that are either unique to a lineage or share signs of selection among one to six other lineages. (H) illustrates the proportion of genes across the genome that have outlier SNPs among no lineages to all seven lineages.
Proportion of genes that overlap among lineages and across the A. mellifera genome.
(A) to (G) illustrate the proportion of genes that are either unique to a lineage or share signs of selection among one to six other lineages. (H) illustrates the proportion of genes across the genome that have outlier SNPs among no lineages to all seven lineages.
Genes associated with the adaptive radiation of A. mellifera lineages
Loci underlying adaptive divergence were enriched for Gene Ontology (GO) terms (data S3) related to morphogenesis and development of tissues and organs, including wing development, sensory organs, eye, muscle, and appendages, as well as development during the larvae and pupae stages. Notably, gene GB48653 was found to be under selection among all lineages and is orthologous to homothorax (FBgn0001235) in Drosophila melanogaster, which is important for antennal development, appendage patterning, and cell division of the eye field (). We also found enrichment of GO terms related to neuron development and receptor and signaling activity. There was also evidence for enrichment of genes related to learning and memory, as well as behavior, including olfactory, aggression, and mating. Gene GB42603 (NLG3) was found to be under selection among all lineages, and it is posited that changes in gene regulation may affect memory and learning tasks (). In addition, among the 145 genes under selection across all lineages, there were several genes that have been found to be associated with colony behavior traits including colony defense (), immunity (–), and the production of honey and royal jelly (data S2) (). Intriguingly, we find that several genes overlap among colony traits. For example, three genes (GB54493, GB51389, and GB40915) overlapped between Varroa response and colony defense. Among genes associated with royal jelly production, two (GB52279 and GB43012) were found to overlap with colony defense and Varroa infection, respectively. Last, GB42671, which was associated with honey production, was also associated with Varroa infection.To further understand the phenotypic context of local adaptation of western honey bees, we evaluated the association between genes with outlier SNPs and queen and worker castes. We used published datasets to determine differences in the expression of genes in larvae () and proteins in adults (, ) to define genes associated with queen and worker traits (i.e., queen-biased versus worker-biased expression). We found, relative to expected values, that genes associated with local adaptation of honey bee lineages were significantly elevated in the worker caste (χ2, P < 7.16 × 10−4) but often significantly underrepresented in the queen caste (χ2, P < 3.99 × 10−7 larvae; not statistically significant among adults) (Fig. 4A). Likewise, the proportion of genes with outlier SNPs was significantly higher in worker-biased genes, relative to queen-biased genes (χ2, P < 3.11 × 10−2) (Fig. 4B). Last, genes (N = 145) with independent signs of adaptive evolution across all lineages were overwhelmingly more likely to be worker biased (N = 64) than queen biased (N = 0) (Fisher’s exact test, P = 1.35 × 10−13).
Fig. 4.
Association of genes with outlier loci among the worker and queen caste.
(A) Percent change in the observed number of genes with outlier SNPs among queen- and worker-biased genes for larvae and adults, relative to expected values. For example, a negative change suggests an underrepresentation of genes, a positive change represents an enrichment of genes, and no change suggests no difference from expected values. Asterisks represent the degree of significance of the change between observed and expected values (*P < 0.05, **P < 0.01, and ***P < 0.001), while N.S. is not statistically significant. (B) Proportion of genes with outlier SNPs among worker caste– and queen caste–biased genes for larvae and adults for each lineage. Asterisks represent the degree of significant difference between the proportions (*P < 0.05, **P < 0.01, and ***P < 0.001).
Association of genes with outlier loci among the worker and queen caste.
(A) Percent change in the observed number of genes with outlier SNPs among queen- and worker-biased genes for larvae and adults, relative to expected values. For example, a negative change suggests an underrepresentation of genes, a positive change represents an enrichment of genes, and no change suggests no difference from expected values. Asterisks represent the degree of significance of the change between observed and expected values (*P < 0.05, **P < 0.01, and ***P < 0.001), while N.S. is not statistically significant. (B) Proportion of genes with outlier SNPs among worker caste– and queen caste–biased genes for larvae and adults for each lineage. Asterisks represent the degree of significant difference between the proportions (*P < 0.05, **P < 0.01, and ***P < 0.001).
DISCUSSION
Deciphering the ancestral origin of contemporary A. mellifera lineages is a major unsolved question with implications for understanding the evolution of this model eusocial species. There are currently two hypotheses that place the origin of A. mellifera in either Africa (, ) or Asia (–). Our analysis supports the hypothesis of an Asian origin of A. mellifera. A. mellifera likely diverged from other cavity-nesting bees in Southeast Asia () and colonized its current distribution from Western Asia. We find that some of the phylogenetic reconstructions emphasize an ancestral divide between West Asian lineages (Y and O), which are resolved in separate clades. Phylogenies based on protein-coding regions resolve the Y lineage as the most basal branch. Both topologies indicate that the ancestor of contemporary A. mellifera lineages was most likely in Asia. These findings are more congruent with the hypothesis that all extent Apis species descended from a common ancestor in Asia, rather than A. mellifera originating independently in Africa.Once diverged from other cavity-nesting bees, our biogeographic reconstruction provides several hypotheses for how A. mellifera expanded to its current distribution. The M lineage, which forms a distinct evolutionary branch, likely colonized Europe via an independent northern route. Although previous studies hypothesized that the M lineage expanded from Africa (, , ), which was supported by shared genetic similarity with the A lineage, these patterns are likely the result of recent nuclear (, , ) and mitochondrial (–) introgression among geographically proximate populations. The C lineage colonized Southern Europe, which may have once been the southern limit of the M lineage, after splitting from a shared common ancestor with the O lineage in Western Asia. Last, colonization of Africa potentially occurred via two dispersal events from Asia. The L lineage forms its own genetically distinct nuclear cluster and shares mitochondrial origins with some populations from desert Africa (, ) and Western Asia (), notably the Y lineage. In contrast, the A lineage, which comprises the remainder of Africa, has distinct nuclear and mitochondrial () variants and is ancestral to the U lineage.The adaptive radiation of A. mellifera lineages is marked by repeated selection among several genomic hotpots. Notably, there is a significant overlap of genes with outlier loci among pairwise lineages, but also shared among all lineages. Repeated selection among genes has been shown to be common among taxa that descend from a common ancestor and are then exposed to similar environments (). However, recent studies with A. cerana have also uncovered patterns of gene reuse, which may be linked to radiation among diverse habitats (). In our study, we find that genomic hotspots are prevalent among genes related to development, morphogenesis, and behavior. This pattern of selection is consistent with the extensive morphological and behavioral adaptations that have occurred among the species, especially between tropically and temperately adapted bees ().Last, we find that genes with outlier loci are disproportionately related to the worker caste in the form of worker-biased genes and worker-related phenotypes. Evidence for selection among the worker caste has been demonstrated previously () and is hypothesized to be related to the eusocial nature of honey bee colonies (, ). Honey bee colonies are composed of several thousand workers who contribute to important colony tasks such a brood rearing and resource provisioning. Although workers do not lay eggs, natural selection may indirectly select for worker phenotypes to optimize colony fitness (). This is relevant given the diverse colony adaptations that have arisen in response to environmental variables, including traits directly related to colony defense (), immunity (–), and the production of honey and royal jelly (). We also find signs of pleiotropy between worker phenotypes, indicating that not only is repeated selection among a common set of genes prevalent across A. mellifera lineages, but also the same genes are modulating fitness by influencing several different phenotypes.In conclusion, we have presented compelling evidence that A. mellifera emerged in Asia with the remainder of extant honey bees but then expanded into its current distribution via Western Asia. This expansion event is marked by at least three independent colonization routes that gave rise to seven genetically distinct lineages. Modern populations of A. mellifera maintain high genetic diversity, which has allowed the species to adapt to diverse environments through repeated selection among a common set of genes. These genes are often related to worker phenotypes, supporting that the worker caste is key to the adaptative radiation of the species.
METHODS
Data processing
Methods for DNA extraction, genome alignment, and SNP detection are described in detail in the Supplementary Materials. In brief, we generated a dataset of 251 individual A. mellifera samples, of which 160 samples are newly sequenced, with the remaining samples downloaded from the Sequence Read Archive (SRA) in addition to 15 A. cerana genomes (data S1) (). Sequence reads were trimmed of adapters, and low-quality bases use Trimmomatic v0.36 (). Trimmed reads were aligned to the honey bee reference genome () using NextGenMap aligner v0.4.12 (), and duplicate reads were marked with Picard v2.1.0 (https://broadinstitute.github.io/picard/). SNPs were identified and filtered using GATK v3.7 (, ).
Population structure
The program ADMIXTURE v1.3.0 () was used to estimate ancestry proportions and population structure among the 251 A. mellifera samples. To reduce the effects of uninformative and low-frequency variants (), 1 million variants were selected among a pool of SNPs pruned for biallelic loci with a minor allele frequency (MAF) >0.05. To account for LD, SNPs were further pruned (38,493) for a minimum distance of 5000 base pairs (bp)—a distance where LD typically decays to background level in the honey bee genome (). Both analyses were run with predicted K values of 1 to 18 and used the 10× cross-validation procedure to estimate the optimal number of ancestral groups (K). A principal components analysis was generated to examine the genetic relatedness and clustering patterns among A. mellifera samples using the SNPRelate () package in R () with all available SNP markers. Last, we performed a hierarchal structure analysis using identity by state with the SNPRelate () package in R () to qualitatively determine lineage assignment using the snpgdsCutTree function.
Phylogenetic reconstruction
We produced several phylogenetic trees using an SNP dataset pruned of ambiguous loci, as implemented by RaxML v8.2.12 (), and loci with low coverage (<0.8) in A. cerana. Trees were constructed using three different datasets: (i) SNPs located genome-wide (2,126,091), (ii) SNPs within coding regions (276,602), and (iii) randomly selected SNPs located among intrageneric and intergenic regions (276,602). Neighbor-joining trees were constructed with all three SNP sets using allele-sharing distance with Adegenet () and Ape () in R (). Confidence levels for bipartitions in the neighbor-joining tree were calculated using 100 bootstrap replicates as implemented in Ape (). Maximum likelihood trees were constructed using SNP sets 2 and 3 using the program RaxML v8.2.12 (). Trees were constructed with the gamma model of rate heterogeneity (ASC_GTRGAMMA) with the Lewis ascertainment bias correction. A 100 rapid bootstrap analysis and search for the best-scoring tree were performed in a single program run. Last, the program TreeMix v1.13 () was used to produce maximum likelihood trees using dataset 1. TreeMix infers population splits using genome-wide allele frequency data at the population level. The program assumes biallelic loci with no missing data; thus, missing genotypes were imputed using Beagle v5.0 (), and only biallelic loci were retained (1,884,783 genomic SNPs). The analysis was performed with samples grouped into their respective lineages and previously determined subspecies grouping (data S1 and Supplementary Text).
Divergence time estimation
Divergence times were estimated with PAML 4.9 () using both resolved phylogenetic topologies. We used the putative coding regions of nonoverlapping genes, with high (>0.9) sequence coverage among the outgroup (A. cerana), concatenated into one supergene. For ease of phylogenetic reconstruction, we did not include the A. m. monticola cluster, which is clearly established within the A lineage. We also did not include A. m. pomonella or A. m. syriaca due to high levels of admixture. First, the substitution rate was estimated using BASEML in PAML 4.9 (). We used the REV [general-time-reversible (GTR)] model with the strict molecular clock and calibrated the divergence time between A. mellifera and A. cerana at 7.5 Ma ago (, , ) using a time unit of 100 Ma ago (@0.075). The calculated substitution rate per unit of time was used to calculate the rgene_gamma variable using the shape α and scale parameter β equations as per the PAML manual (). Divergence times were estimated following the two-step approximate likelihood calculation with the MCMCtree package in PAML 4.9 (). We used the REV (GTR) model with independent clock rates, and root age was bound between >0.06<0.09 (, , ) using a time unit of 100 Ma ago. The process was run for 10,000 samples, sampling every 10 iterations, after a burn-in of 50,000, for a total of 150,000 iterations.
Ancestral biogeography reconstruction
To infer the biogeographic history of A. mellifera, we estimated the most probable model of geographic range expansion on the divergence time tree of both topologies using the R package BioGeoBEARS (, ). BioGeoBEARS uses three different models of geographic range evolution: Extinction Cladogenesis (DEC) (, ), a likelihood version of dispersal-vicariance analysis (DIVA) (), and a likelihood version of BayArea (BAYAREA) (). In addition, BioGeoBEARS can incorporate a jump dispersal or founder event speciation into the model, generating three additional models DIVA+J, DEC+J, and BAYAREA+J. We defined three biogeographic areas based on the current A. mellifera distribution: Europe (E), Africa (F), and Asia (A). We tested all six biogeographic models provided by BioGeoBEARS and used the Akaike information criterion and the log of the likelihood scores (LnL) to compare models and determine the best fit to the phylogeny.
Genetic diversity, genetic differentiation, and demography
We calculated several diversity and demographic statistics among lineage and subspecies groupings (data S1). Nucleotide diversity (π) was calculated in 500-bp sliding windows with a 250-bp step size using VCFtools v0.1.17 (). Segregating sites (S) were calculated by counting the number of polymorphic loci, and singletons (Ssingletons) were calculated by counting the number of sites with only one copy of an allele. To estimate theta (θ), we used the equation = S/an, where S is the number of segregating sites, and an is the harmonic number of n − 1, where n is the number of chromosomes. To obtain the per–base pair estimate of θ, was divided by the total number loci that had sufficient coverage (≥0.8) across the entire genome. To estimate the effective population size (N), we used Watterson’s theta estimator () θ = 3Nμ (3 is used because A. mellifera is haplodiploid), where μ is the mutation rate. N was calculated using two estimates of mutation rate 5.27 × 10−9 () and 3 × 10−9 (). We calculated LD as a measure of the squared correlation coefficient between variants (r2). LD was measured within 5000-bp windows using SNP variants that had ≥0.8 coverage and an MAF >0.05 using VCFtools (). The pairwise FST matrix was calculated using Weir and Cockerham’s weighted FST statistic with VCFtools v 0.1.17 (). Last, the program ADMIXTOOLS () was used to calculate outgroup f3 statistics (), which can be used to quantify the genetic distance between populations relative to an outgroup, A. cerana, where higher values imply longer shared evolutionary time or greater share genetic drift.
Detecting and annotating loci under selection
We identified patterns of positive selection by means of outlier differentiation using pairwise measures of Weir and Cockerham’s weighted FST statistic with VCFtools v0.1.17 (). The genome-wide distribution of FST was measured between each pairwise lineage, and loci consistently within the top 0.95 quantile across each pairwise distribution were considered unique measures of genome outliers. This analysis was performed on markers that had an MAF >0.05 in at least one lineage and had ≥0.8 coverage across all lineages; we used 3,183,349 SNPs for this analysis. We exclude highly admixed samples (data S1) and divided the European and Asian M lineage subspecies into separate populations as they are likely experiencing disparate selective pressures (). We used the program SNPeff v4.3t () to annotate SNPs at the gene and functional category levels, including exons, introns, and promoter regions, which were defined as the sequence 1000 bp upstream of the start codon of a gene () and excluded regions that overlapped with neighboring genes. In addition, SNPeff v4.3t was used to predict mutation effects on genes, such as amino acid changes. Last, GO enrichment was conducted with DAVID v6.8 () using D. melanogaster orthologs (). GO functional annotation clusters with an enrichment score ≥1.3 and GO terms with P < 0.05 after Benjamini-Hochberg correction were of interest.
Resampling simulations
To ensure the overlap of outliers among genes was not due to chance, we used gene and SNP resampling to simulate the overlap of genes across lineages. Gene resampling was achieved by randomly selecting, from the background set of genes (12,916), the corresponding number of genes associated with outlier loci within each lineage (data S2 and table S9). We then made pairwise comparisons between lineages to calculate the number of genes that overlapped between the randomly resampled lists. Simulations were repeated for 1000 iterations to generate a null distribution. We then carried out an additional analysis that takes into account gene size. Large genes may be expected to have more outlier SNPs per lineage, thus leading to greater overlap among lineages. In our dataset, we observed a total of 36,678 outlier SNPs in genes across the seven lineages studied (table S9). We simulated the null distribution of overlap (i.e., genes with different outliers in more than one lineage) by randomly generating 36,678 unique outlier SNPs, corresponding to the same number of outliers per lineage as observed in our dataset (table S9), across an equivalent coding genome as studied here (i.e., same number of genes with identical sizes as predicted in the honey bee genome) (data S2). In our simulations, the probability of observing an outlier locus within a gene scales linearly with the gene’s size. After each simulation, we computed the average number of lineages with different outlier SNPs in the same gene and the number of genes with unique outlier SNPs in all seven lineages. We ran this simulation for 100 iterations and then compared the null distribution of these two parameters to our observed data. There were clear significant differences between the null distribution of the 100 iterations and our observed data that additional iterations were not necessary.
Differential gene expression
We identified caste differentially expressed genes between 96-hour-old queen and workers (). Reads were downloaded from the SRA (BioProject PRJNA260604) and trimmed of adapters and low-quantity bases (<20) using Trimmomatic v0.36 (). Trimmed reads were then aligned to the honey bee reference genome () using multisample two-pass mapping with STAR v2.7 (). Using the aligned RNA sequencing data, a matrix of unnormalized read counts was constructed for annotated gene regions using featureCounts in Subread v2.0.1 (). Last, DESeq2 () in R () was used to identify differentially expressed genes. We used the matrix constructed with featureCounts as the countData and set the condition to caste phenotypes (queen or worker). We analyzed 10,000 genes and determined them to be differentially expressed between castes if they passed the following thresholds: fold change of ≥1.5, false discovery rate <0.05 after applied SE, and gene-level read counts ≥10 per individual in the up-regulated caste. In addition, we used a protein atlas, which examined protein expression across 26 tissues in queen and worker honey bees (). Harpur et al. () had previously generated mutually exclusive queen- and worker-biased proteins from this resource (1582 proteins), which we used for our analysis.
Authors: Maanasa Raghavan; Matthias Steinrücken; Kelley Harris; Stephan Schiffels; Simon Rasmussen; Michael DeGiorgio; Anders Albrechtsen; Cristina Valdiosera; María C Ávila-Arcos; Anna-Sapfo Malaspinas; Anders Eriksson; Ida Moltke; Mait Metspalu; Julian R Homburger; Jeff Wall; Omar E Cornejo; J Víctor Moreno-Mayar; Thorfinn S Korneliussen; Tracey Pierre; Morten Rasmussen; Paula F Campos; Peter de Barros Damgaard; Morten E Allentoft; John Lindo; Ene Metspalu; Ricardo Rodríguez-Varela; Josefina Mansilla; Celeste Henrickson; Andaine Seguin-Orlando; Helena Malmström; Thomas Stafford; Suyash S Shringarpure; Andrés Moreno-Estrada; Monika Karmin; Kristiina Tambets; Anders Bergström; Yali Xue; Vera Warmuth; Andrew D Friend; Joy Singarayer; Paul Valdes; Francois Balloux; Ilán Leboreiro; Jose Luis Vera; Hector Rangel-Villalobos; Davide Pettener; Donata Luiselli; Loren G Davis; Evelyne Heyer; Christoph P E Zollikofer; Marcia S Ponce de León; Colin I Smith; Vaughan Grimes; Kelly-Anne Pike; Michael Deal; Benjamin T Fuller; Bernardo Arriaza; Vivien Standen; Maria F Luz; Francois Ricaut; Niede Guidon; Ludmila Osipova; Mikhail I Voevoda; Olga L Posukh; Oleg Balanovsky; Maria Lavryashina; Yuri Bogunov; Elza Khusnutdinova; Marina Gubina; Elena Balanovska; Sardana Fedorova; Sergey Litvinov; Boris Malyarchuk; Miroslava Derenko; M J Mosher; David Archer; Jerome Cybulski; Barbara Petzelt; Joycelynn Mitchell; Rosita Worl; Paul J Norman; Peter Parham; Brian M Kemp; Toomas Kivisild; Chris Tyler-Smith; Manjinder S Sandhu; Michael Crawford; Richard Villems; David Glenn Smith; Michael R Waters; Ted Goebel; John R Johnson; Ripan S Malhi; Mattias Jakobsson; David J Meltzer; Andrea Manica; Richard Durbin; Carlos D Bustamante; Yun S Song; Rasmus Nielsen; Eske Willerslev Journal: Science Date: 2015-07-21 Impact factor: 47.728
Authors: Zachary L Fuller; Elina L Niño; Harland M Patch; Oscar C Bedoya-Reina; Tracey Baumgarten; Elliud Muli; Fiona Mumoki; Aakrosh Ratan; John McGraw; Maryann Frazier; Daniel Masiga; Stephen Schuster; Christina M Grozinger; Webb Miller Journal: BMC Genomics Date: 2015-07-10 Impact factor: 3.969
Authors: Queenie W T Chan; Man Yi Chan; Michelle Logan; Yuan Fang; Heather Higo; Leonard J Foster Journal: Genome Res Date: 2013-07-22 Impact factor: 9.043