Shuai Zhan1, Wei Zhang2, Kristjan Niitepõld3, Jeremy Hsu4, Juan Fernández Haeger5, Myron P Zalucki6, Sonia Altizer7, Jacobus C de Roode8, Steven M Reppert9, Marcus R Kronforst2. 1. 1] Key Laboratory of Insect Developmental and Evolutionary Biology, Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China [2] Department of Ecology &Evolution, University of Chicago, Chicago, Illinois 60637, USA [3] Department of Neurobiology, University of Massachusetts Medical School, Worcester, Massachusetts 01605, USA. 2. Department of Ecology &Evolution, University of Chicago, Chicago, Illinois 60637, USA. 3. 1] Department of Biology, Stanford University, Stanford, California 94305, USA [2] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland. 4. Department of Biology, Stanford University, Stanford, California 94305, USA. 5. Departamento de Botánica, Ecología y Fisiología Vegetal, Universidad de Córdoba, 14071 Córdoba, Spain. 6. School of Biological Sciences, The University of Queensland, Brisbane, Queensland 4072, Australia. 7. Odum School of Ecology, University of Georgia, Athens, Georgia 30602, USA. 8. Department of Biology, Emory University, Atlanta, Georgia 30322, USA. 9. Department of Neurobiology, University of Massachusetts Medical School, Worcester, Massachusetts 01605, USA.
Abstract
The monarch butterfly, Danaus plexippus, is famous for its spectacular annual migration across North America, recent worldwide dispersal, and orange warning colouration. Despite decades of study and broad public interest, we know little about the genetic basis of these hallmark traits. Here we uncover the history of the monarch's evolutionary origin and global dispersal, characterize the genes and pathways associated with migratory behaviour, and identify the discrete genetic basis of warning colouration by sequencing 101 Danaus genomes from around the globe. The results rewrite our understanding of this classic system, showing that D. plexippus was ancestrally migratory and dispersed out of North America to occupy its broad distribution. We find the strongest signatures of selection associated with migration centre on flight muscle function, resulting in greater flight efficiency among migratory monarchs, and that variation in monarch warning colouration is controlled by a single myosin gene not previously implicated in insect pigmentation.
The monarch butterfly, Danaus plexippus, is famous for its spectacular annual migration across North America, recent worldwide dispersal, and orange warning colouration. Despite decades of study and broad public interest, we know little about the genetic basis of these hallmark traits. Here we uncover the history of the monarch's evolutionary origin and global dispersal, characterize the genes and pathways associated with migratory behaviour, and identify the discrete genetic basis of warning colouration by sequencing 101 Danaus genomes from around the globe. The results rewrite our understanding of this classic system, showing that D. plexippus was ancestrally migratory and dispersed out of North America to occupy its broad distribution. We find the strongest signatures of selection associated with migration centre on flight muscle function, resulting in greater flight efficiency among migratory monarchs, and that variation in monarch warning colouration is controlled by a single myosin gene not previously implicated in insect pigmentation.
Every year millions of monarch butterflies fly from the northern United States and southern Canada to overwinter in Mexico. Amazingly, during this portion of the annual migration, individual butterflies emerge as adults in the north, fly thousands of kilometers south, overwinter for months in reproductive diapause, and finally begin mating and flying north in the spring. Recolonization of northern latitudes takes place over the course of three to four subsequent generations, after which it is late summer again and the process repeats itself. Although most North American monarchs overwinter in Mexico, those that live west of the Rocky Mountains generally overwinter along the California coast[1,2]. Monarch migration has been studied extensively over the past few decades, with particular emphasis on tracking migration routes and overwintering sites[1,3-7], as well as characterizing the navigational mechanisms that guide this complex behavior[8-13]. A largely unappreciated aspect of this system is that not all monarchs migrate. In fact, the geographic distribution of D. plexippus extends far beyond North America and it does not migrate across most of its range. For instance, while the monarch undergoes extensive, long-distance migration across North America, it also exists throughout Central America, South America, and the Caribbean, where it does not migrate[14-16]. Furthermore, the monarch has recently dispersed to many locations around the globe where it does not exhibit the same long-distance migration found in North America[17,18].The molecular genetic mechanisms that contribute to migration are likely to be complex, spanning pathways related to navigation and circadian rhythms, environmental sensing, energy production and metabolism, thermal tolerance, reproduction and longevity, neuromuscular development, and phenotypic plasticity[11,19,20]. Because D. plexippus exists as both migratory and non-migratory populations, we sought to use comparative population genomics to characterize the genetic basis of migration by identifying genome-wide targets of divergent natural selection associated with shifts in migratory behavior. To do this, we sequenced the genomes of 89 butterflies - 80 D. plexippus and nine samples from four additional Danaus species - from across their worldwide distribution (Fig. 1a, b) and analyzed them using the monarch reference genome sequence[20].
Figure 1
Global dispersal of the monarch butterfly
a, Monarch butterfly sampling locations. b, Inferred phylogeny among Danaus species based on maximum likelihood analysis of 3,714 single-copy genes. c, Neighbor-joining phylogeny of all D. plexippus individuals, based on genome-wide SNP data. d, NJ consensus tree based on 1,000 bootstrap replicates. e, PCA plots based on the first two principal components; inset shows separation between North America and south Florida. f, Genetic structure and individual ancestry; colors in each column represent ancestry proportion over range of population sizes K = 2-11.
To guide our analysis, we first characterized the evolutionary origin of D. plexippus and its history of dispersal. The genus Danaus is broadly tropical and non-migratory[14] which suggests non-migratory populations of D. plexippus in South and Central America may represent the ancestral source of the North American population[21], similar to the “southern home theory” for the origin of seasonal migration in birds[22]. More recently, monarchs dispersed across the Pacific, throughout Oceania and Australia, and they also dispersed across the Atlantic to Europe and Africa[17]. It is unknown whether the Pacific and Atlantic dispersal events were independent, but both are thought to derive from North American migratory monarchs[17,23]. There is also an enigmatic non-migratory population in south Florida[15], which may result from overwintering migratory butterflies that fail to return north.
Evolutionary and demographic history
Our analysis, based on genome-wide SNP variation (approximately 32 million SNPs), revealed a history quite distinct from our a priori expectations. For instance, we recovered North American populations as the most basal lineages, with Central/South America, Pacific, and Atlantic populations each forming an independent, derived lineage (Fig. 1 c, d; Extended Data Fig. 1). We also found that all geographic sampling locations were genetically distinct, except those in North America, providing evidence of worldwide population structure but gene flow across North America (Fig. 1e, f). An exception was the non-migratory population in south Florida, which was distinct from other North American populations. We also found evidence for very recent dispersal between the North American migratory population and adjacent non-migratory populations (Supplementary Information). Our results suggest the monarch originated in North America from a migratory ancestor, a scenario consistent with the observation that all monarch populations, as well as D. erippus, share reproductive traits and behaviors which may have evolved in the context of mass migration[24]. We speculate the monarch originated in the southern USA or northern Mexico, where it originally undertook a shorter-distance annual migration. Three subsequent, independent dispersal events led to the monarch's current broad distribution. Towards the south, monarchs expanded from Belize to Costa Rica and into South America, as well as offshore, from south Florida to Bermuda and Puerto Rico. Westwards, they expanded into Hawaii and then to Samoa and Fiji before ending up in New Caledonia, Australia and New Zealand. Across the Atlantic, monarchs established first in Portugal and then moved to Spain and Morocco.
Extended Data Figure 1
Relationships among monarch populations inferred using the maximum likelihood method implemented in Treemix
Note, this is a fully resolved, bifurcating tree. The very short basal branches indicate little genetic drift in North American populations, not unresolved basal relationships. Colors correspond to those in Fig. 1. Treemix also inferred five migration events among populations: from Puerto Rico to Aruba, from Puerto Rico to Costa Rica, from New Caledonia to Fiji, from Belize or Costa Rica to Portugal, and from Belize to Puerto Rico.
Our dispersal scenario was further supported by the observation that non-North American populations had elevated linkage-disequilibrium (Extended Data Fig. 2a) and minor allele frequencies (Extended Data Fig. 2b), indicative of founder effects, and heterozygosity declined along each putative dispersal route (Extended Data Fig. 2c), as expected of step-wise dispersal. Similar step-wise dispersal is reflected in microsatellite markers[25]. The directionality index Ψ for range expansions[26] also supported North America as the monarch's ancestral origin (Extended Data Table 1). We estimated historical population sizes and divergence times using PSMC[27] (Extended Data Fig. 2d) and ∂a∂i[28] (Extended Data Fig. 3) and found a concordant history among all monarch populations, but distinct from D. erippus, for most of the last 1 My. Approximately 20 ky ago, at the end of the last glacial maximum, the North American population began to grow, presumably fueled by increasing availability of milkweed host plants throughout the Midwestern USA and expanding monarch migration. More recently, Atlantic and Pacific populations split from North America, at which time both experienced dramatic declines in population size. Historical records suggest Atlantic and Pacific dispersal events occurred in the 1800's[17], but our results suggest an earlier timeframe (Supplementary Information).
Extended Data Figure 2
Demographic history of the monarch butterfly
a, Patterns of linkage-disequilibrium decay across the genome in different geographic populations. b, Genome-wide distribution of minor allele frequencies. c, Heterozygosity across populations, estimated as the ratio of heterozygous SNPs to homozygous SNPs/individual. d, Demographic history inferred using PSMC. This analysis includes representative individuals of high sequencing depth for each geographic location. The period of the last glacial maximum (LGM; ~20 ky ago) is shaded in gray.
Extended Data Table 1
Inferring the monarch range expansion.
S1
S2
FST
Ψ
North America
south Florida
0.0472
−0.4219
Caribbean
0.0960
−0.8639
Central America
0.0473
−0.4765
South America
0.1263
−1.0493
Pacific
0.0929
−1.4988
Atlantic
0.1054
−1.0268
A positive Ψ indicates that population S1 is farther away from the origin of the expansion than S2 while a negative value indicates S2 is farther from the origin of the range expansion. North America (S1) includes the United States and Mexico but excludes south Florida; Caribbean includes Bermuda and Puerto Rico; Central America includes Belize and Costa Rica; South America includes Aruba and Ecuador; Pacific includes Hawaii, Samoa, Fiji, New Caledonia, Australia and New Zealand; Atlantic includes Portugal, Spain and Morocco.
Extended Data Figure 3
∂a∂i analysis parameter estimates
a, schematic of demographic scenario modeled in ∂a∂i labeled with parameters being estimated. Nu, effective population size (individuals); m, migration rate (individuals/year); T, time (years). b, inferred parameter estimates. c, 1D model-data comparison considering North America population only. In the left panel, the model is plotted in red and the data in blue. In the right panel, the residuals between model and data are plotted. d, 2D comparison for joint estimation of North America and dispersal populations (Central/South America, Pacific, Atlantic). The left two panels are marginal spectra for data and the maximum-likelihood model, respectively. The right two panels show the residuals.
Natural selection associated with migration
We used a modified version of the population-branch statistic (PBS)[29] to identify regions of the genome strongly differentiating North American monarchs from all three transitions to non-migratory behavior (Fig. 2a). By further limiting this search to regions of low sequence diversity within North America, we isolated 5.14 Mb (2.1% of the genome), encompassing 536 genes, significantly associated (Z-test, P < 0.01) with migration. This set was enriched for genes related to morphogenesis, neurogenesis, and extracellular matrix/basement membrane. Derived-allele frequency was elevated among monarchs in these migration-associated genomic regions (Fig. 2a), further suggesting a history of natural selection. Derived alleles were similarly enriched in D. erippus, consistent with the observation that this species also displays migratory behavior[14,17], and suggesting the common ancestor of D. plexippus and D. erippus was migratory as well.
Figure 2
A selective sweep associated with migration
a, Distribution of PBS and polymorphism in North America (πNOR), calculated in 5 kb sliding windows. Migration-associated genomic regions were identified as the points above the dashed line (P < 0.01) and to the left of the vertical green dashed line (lower quartile). Circled points consist of a single 21 kb region. b, Population genetic statistics were plotted across DPSCF300190 in 5 kb sliding windows. c, Gene models and SNP allele: white represents homozygous for the reference allele; red, homozygous for alternative allele; yellow, heterozygous; gray, masked site.
We were surprised to find that among the approximately 5 Mb associated with migration, a single 21 kb genomic segment stood out as an extreme outlier (Fig. 2a). This region showed multiple signatures of divergent selection (Fig. 2b) as well as an enrichment of shared alleles (ABBA vs. BABA sites[30]) among Atlantic, Pacific, and Central/South American dispersal events, indicating a shared haplotype among all non-migratory populations that was highly divergent from the haplotype in North America. This signature of haplotype sharing among non-migratory populations is likely due to recurrent selection on ancestral variation, as opposed to gene flow, because the non-migratory haplotype was present at low frequency in our North American samples (e.g., sample 203 from New Jersey was heterozygous).The 21 kb outlier region contained three genes, the F-box protein FBXO45, an uncharacterized transmembrane protein, and collagen type IV, subunit α-1 (Fig. 2c). By comparing these three genes we found evidence for divergent selection on collagen IV α-1 (Fig. 3). Collagen IV α-1 showed striking divergence between haplotypes found in migratory and non-migratory populations (Fig. 3a, b), apparently resulting from an ancient origin of the non-migratory haplotype (Fig. 3c). Collagen IV is a central component of basement membranes and essential for muscle morphogenesis and function[31]. Mutations in the α-1 subunit result in severe myopathy in Drosophila[32] and myopathy-related disease in humans[33]. Migratory and non-migratory haplotypes differed by 51 nucleotide substitutions in the coding sequence of collagen IV α-1, resulting in 15 amino acid substitutions. A subsection of the gene showed particularly high diversity within D. plexippus, as well as divergence between D. plexippus and other species, centered on the single amino acid substitution with evidence of positive selection, R1573Q (Fig. 3d). Collagen type IV is a heterotrimer composed of two α-1 chains and one α-2 chain, which bind at shared triple helix domains. Danaus plexippus collagen IV α-1 contains five triple helix domains and the R1573Q substitution occurred directly in the middle of one of these, suggesting a functional role related to trimerization. Interestingly, we found collagen IV subunit α-2 in a nearby genomic window, with reduced but still highly significant signatures of selection (Extended Data Table 1), providing additional evidence of selection on the interacting members of collagen IV. Furthermore, other genomic intervals strongly associated with migration overlapped with portions of another well-characterized flight muscle gene, kettin[34] (Extended Data Table 2).
Figure 3
Divergent selection on collagen IV α-1
a, Collagen IV α-1 shows elevated sequence divergence (Dxy) and differentiation (FST) between migratory and non-migratory monarchs (mean ± s.e.m), an excess of polymorphism (HKA test), and b, haplotype divergence. c, A maximum-likelihood tree shows the non-migratory haplotype predates species-level divergence within Danaus while the migratory haplotype is similar to D. erippus. d, A subsection of high polymorphism and divergence in collagen IV α-1 coincides with an amino acid experiencing positive selection, including a R1573Q substitution on the migratory haplotype. e, expression of collagen IV α-1 and α-2 differ between migratory and non-migratory populations in flight muscle tissue. f, flight metabolic rates differ more than resting metabolic rates between migratory and non-migratory populations (mean ± s.e.m.).
– RNA-binding protein– Pre-mRNA-splicing factor Cwf15– 2 universal hypothetical protein
DPSCF300551
22.5-30.3
0.060
– butterfly hypothetical protein
DPSCF300134
135.5-172.5
0.059
– insect hypothetical protein
DPSCF300001
4697.5-4709
0.053
DPSCF300190
213-236.5
0.052
– Collagen alpha-2(IV)– thioredoxin family Trp26– Selenoprotein T– Tetratricopeptide-like helical
DPSCF300001
4622.5-4628.5
0.052
– kettin protein
DPSCF300134
109.5-134.5
0.052
– potassium ion transport protein
DPSCF300074
526.5-532
0.052
DPSCF300005
137-152
0.049
– WD40 transcription factor
DPSCF300255
244-257.8
0.048
DPSCF300014
176-202.5
0.047
– Golgin-80 (RabGTPase binding)– Phosphatidylserine synthase– WD40 protein– kismet DNA binding protein
DPSCF300083
396-431
0.047
– Zinc finger DNA-binding domain
DPSCF300001
4725–4751
0.046
– forkhead protein transcription factor– Cyclic ion channel subunit
DPSCF300005
223.5-264
0.045
– Flotillin-1 (insulin-signaling pathway)
We hypothesized that the signatures of divergence associated with these essential muscle genes reflected selection for different flight muscle function between migratory and non-migratory butterflies. Consistent with this, we found divergent expression of collagen IV α-1 and α-2, but not other linked genes, between butterflies from migratory and non-migratory populations in adult thoracic muscle tissue (Fig. 3e). Surprisingly, collagen IV subunit α-1 and α-2 were down-regulated in migratory butterflies, leading us to hypothesize that natural selection may be acting on aspects of flight efficiency with migratory populations tuned to the distinct demands of long-distance flight. This scenario is supported by evidence of distinct wing shape and size, body mass and kinematic wing loading between migratory and non-migratory populations[15], but we tested it by measuring flight metabolic rates. We found active flight to be exceptionally demanding energetically, utilizing 25 times more energy than resting. Migrating monarchs are known to offset this to some extent by gliding for periods of time on tail winds[35]. Consistent with our hypothesis that flight muscle changes have resulted in more efficient energy consumption in migratory populations, we found that flight metabolic rates were lower in butterflies from a migratory population (Massachusetts), compared to one non-migratory population (south Florida) (Fig. 3f). This increase in metabolic efficiency appears to be a result of flight muscle performance because the difference between migratory and non-migratory populations was minimal when not in flight (Fig. 3f). Furthermore, we found little sequence or gene expression divergence associated with glycolytic enzymes (Supplementary Information), which could also influence metabolic rates[36]. It is interesting that while previous work has found a link between flight metabolism and dispersal ability in other butterfly species[36-38], it has always been via glycolysis and generally in the form of higher metabolic rates yielding greater dispersal. In contrast, the extreme distances required of monarch migration appear to have generated natural selection for reduced flight metabolism, which has been mediated by alternate mechanisms. Parallel shifts to the same non-migratory collagen IV α-1 haplotype in independent dispersal events suggest that an elevated flight metabolic rate may be beneficial in the absence of long-distance migration.
The genetic basis of wing pigmentation
Another aspect of monarch biology that has attracted attention is their bold warning coloration. The monarch butterfly, like Danaus species generally[14], is characterized by bright orange wing coloration that warns predators of their toxicity[39]. This coloration also serves to facilitate Müllerian mimicry between D. plexippus and the Viceroy butterfly, Limenitis archippus[40]. It is not well-appreciated that the monarch is polymorphic for wing coloration (Fig. 4a). On Oahu, Hawaii, the white ‘nivosus’ morph has been documented since the mid 1890's[41]. Previous breeding experiments have shown that wing coloration segregates as a single, autosomal locus with the white nivosus allele recessive to wild-type[42]. The nivosus wing coloration and inheritance pattern has led to speculation that the mutation likely disrupts the production of orange pigment[17]. Red, orange, and brown pigments on nymphalid butterfly wings are frequently ommochrome pigments[43] so we hypothesized that the nivosus mutation would be found somewhere in the ommochrome biosynthesis pathway.
Figure 4
The genetic basis of warning coloration
a, While generally bright orange, the nivosus morph lacks orange pigmentation. b, A comparison of 12 Hawaiian monarch genome sequences (5 wild-type, 5 nivosus, and 2 F1 hybrids) reveals perfect SNP associations in one gene, the myosin gene DPOGS206617. c, Comparison of DNA sequence divergence (Dxy) between D. plexippus and D. chrysippus shows strong purifying selection in exon 2, coinciding with SNP associations in modern samples, crosses, and field collections from the 1980's. SNP position 785 is associated in 17/20 samples from the 1980's (P = 0.02, one-tailed Fisher's exact test).
Our population genomic data provided a means of characterizing the monarch color switch locus at a molecular level. To do this, we sequenced 12 additional Hawaiian monarch genomes, five white individuals and seven wild-type orange monarchs, all reared at the same time, in the same location. Three of these wild-type monarchs were known relatives of white monarchs, two were F1 offspring and one was an F2 offspring of a white parent. By scanning SNP genotypes for segregation patterns consistent with the Mendelian genetics (Fig. 4b), cross-referencing genotypes across the entire set of 101 genomes, and further testing co-segregation in crosses (Fig. 4c), we found that markers in one gene, the myosin gene DPOGS206617, were strongly associated with wing color. Interestingly, this gene is homologous to myosin 5a, which is responsible for the ‘dilute’ mouse coat color mutant[44], a phenotype resulting from reduced melanization due to impaired myosin transport of melanosomes[45]. This suggests that an alteration in pigment transport, and not pigment production, underlies the nivosus form. We speculate that this gene, while not previously implicated in insect pigmentation, may be an important source of color pattern variation across the butterfly subfamily Danainae because genera closely related to Danaus are dominated by white-winged species, and other Danaus species display similar orange vs. white variation[14].
Discussion
We have leveraged fantastic natural variation and extensive genome sequencing to characterize the monarch butterfly's evolutionary origin and history of dispersal, genome-wide signatures of divergent selection associated with migratory behavior, and the discrete genetic basis of warning coloration. Our results yielded unexpected answers in all three aspects. Not only did we re-polarize the ancestral migratory character state and geographic origin for the monarch, but we also found evidence for recurrent, divergent selection on flight muscle function during shifts in migratory behavior, likely mediated by their role in influencing flight efficiency. Surprisingly, as monarchs have reverted to a non-migratory state, which is an ancestral state that predates their own species and that of their common ancestor with D. erippus, in the case of collagen IV α-1 at least, they appear to have used old genetic variation to do so. Furthermore, wing color variation is mediated by a gene with no prior known role in insect pigmentation but with an analogous effect in vertebrates, but one that influences a different pigment in a distinct morphological structure.Unfortunately, the monarch migration is currently experiencing a devastating decline and there is fear the phenomenon may disappear entirely. Recent monitoring shows an alarming downward trend in monarch numbers from eastern North America, with 2013 marking the lowest number of overwintering monarchs in recorded history[46]. This decline has been driven by multiple factors, including deforestation, drought, and a precipitous drop-off in the number of milkweed host plants across North America[46]. Our results emphasize the importance of ongoing conservation efforts to preserve the migration and extend the extraordinary evolutionary history of this iconic butterfly.
Methods
Sampling and sequencing
We sampled a total of 101 butterflies, including 92 D. plexippus and nine butterflies from other Danaus species (Supplementary Table 1). The North America population of monarchs undergoes a yearly migration from the United States and southern Canada to central Mexico (east of the Rocky Mountains) and coastal California (west of the Rocky Mountains). Our sampling sites for the migratory group covered several stopover points and the overwintering grounds for both eastern and western populations (Fig. 1a, green points). For comparison, we sampled residential, non-migratory populations from three major geographic regions. In the south, non-migratory populations were sampled from south Florida (around the city of Miami), throughout the Caribbean, and Central and South America (Fig. 1a, red points). Out of the Americas, we also sampled monarch populations across the Pacific (Fig. 1a, blue points), including Hawaii and Oceania, and populations across the Atlantic (Fig. 1a, purple points), spanning Iberia and North Africa.It is important to note that D. plexippus is known to move seasonally outside of North America. For instance, seasonal movement through mountain passes has been recorded in Costa Rica[47] and seasonal movement between inland and coastal locations is well-known in Australia[48]. While this behavior is properly referred to as migration, it is very different from the migration of North American monarchs and these phenomena are routinely distinguished in the literature. First, these phenomena differ by orders of magnitude in their scope, with millions to a billion individuals moving thousands of kilometers in North America and hundreds to thousands individuals moving tens to hundreds of kilometers elsewhere. Second, outside of North America overwintering biology, such as sexual diapause, is highly labile or not known to exist. Similarly, other Danaus species, such as D. chrysippus are well-known to move seasonally[49] but the scale is relatively small in comparison to D. plexippus in North America. There is, however, evidence that D. erippus migrates in South America in a way that mirrors D. plexippus migration in North America[17]. Critically, our genetic data support these expectations and distinctions by showing that particular genetic signatures distinguish taxa (populations/species) with differing migratory behavior based on the literature. In our study, we refer to D. plexippus populations from North America (excluding south Florida) as ‘migratory’ and D. plexippus populations from other geographic locations as ‘non-migratory’ to capture this distinction in their biology.All samples were sequenced on the Illumina sequencing platform (HiSeq 2000). Paired-end libraries were prepared using an Illumina paired-end library kit. We combined between 4-8 samples in a sequencing lane (2×100 bp) to generate approximately 10X and 20X raw coverage for D. plexippus and outgroup species, respectively. A total of 384.6 Gb of paired-end sequence data were generated (Supplementary Table 2).
Alignment, SNP calling, and genotyping
Before mapping, all reads were processed for quality control and filtered using Seqtk (https://github.com/lh3/seqtk). 366.9 Gb high-quality read pairs were kept and mapped to the latest version monarch assembly (Supplementary Table 2). Based on the result of our preliminary test, we chose Stampy v1.0.21[50] as our main mapping software, but we also applied other mapping methods as independent quality controls (Supplementary Table 3). A randomly selected subset of reads was mapped in advance to estimate the appropriate parameters for insert size and substitution rate for each library. Mapping results were subsequently processed by sorting, indel realignment, duplicate marking, and low quality filtering using functions in Picard v1.8 (http://picard.sourceforge.net) and GATK2[51]. Sequencing coverage and depth for each sample were calculated using the “DepthOfCoverage” module of GATK2.Since we had a wide range of sequencing coverage among samples (Supplementary Table 2), we carried out SNP calling and genotyping at two separate stages to balance the power between low and high coverage samples. We first discovered variants on a population-scale using a variety of independent pipelines, which both introduced different alignment inputs and covered most popular SNP calling algorithms (Supplementary Table 3). By comparing methods, we determined a core set of SNPs according to the sensitivity and specificity of each pipeline, as well as using combinations of pipelines (Supplementary Table 4 and Supplementary Table 5). Using the consensus set of SNPs, we went back to each sample to estimate the corresponding genotype likelihoods from all alignment sources. Based on the comparison, genotypes called from the stampy alignment showed the overall minimum difference with other independent methods (Supplementary Table 6). We further filtered out variants from regions with abnormal sequencing coverage and constructed a core SNP matrix. In this final dataset, 99.1% of the genotypes were independently supported by additional evidence, suggesting a reliable input for the subsequent population genetic analysis. Also, unlike results obtained using a single SNP scoring pipeline, this method substantially reduced the correlation between the identified number of SNPs and sequencing depth.
Outgroup species
Since divergence between D. plexippus and other Danaus species was high, genotypes within highly divergent regions were likely to be miscalled or filtered out. We therefore performed de novo assemblies for outgroup species for analyses that were sensitive to the quality of consensus sequence. Contigs were assembled for each outgroup sample, a minimum length of 300 bp were kept and processed with a redundancy filter step as described previously[20].We chose the highest quality assemblies (Chry_AUS_113_M, Eres_FL_27M, Erip_BRA_16005_F, and Gili_FL_28_F) to infer the species phylogeny based on 7,251 single-copy universal orthologs that were identified previously among lepidopteran genomes (D. plexippus[52], H. melpomene[53], and B. mori[54]). The OGS2.0 gene models of D. plexippus were used for homology search by TBLASTN. The high-scoring pairs with E<10-5 were then processed by genblastA v1.0.4[55] and gene structures were determined by GeneWise v2.2.0[56]. 3,714 proteins that were recovered with >= 50% coverage in all four outgroup species were kept, conserved blocks were extracted using Gblocks v0.91b[57], and were concatenated to seven super genes with 908,188 amino acids. The species phylogenetic tree was calculated using PhyML v3[58] with the JTT model and 100 replicates of bootstrap analyses. Our species tree resulted in clear separation among all sequenced Danaus species, including putative sister-species, D. plexippus and D. erippus.
Population structure
We used all bi-allelic and high quality SNPs to infer phylogeography and population structure for D. plexippus. For phylogeny, pairwise genetic distances were calculated among all samples as described previously[59] and a tree was subsequently generated (Fig. 1c) using the neighbor-joining (NJ) method implemented in PHYLIP v3.695[60]. A second frequency tree was also generated (Fig. 1d) based on 1000 bootstrap replicates using the consense module of PHYLIP.We also inferred a population-level phylogeny using the maximum likelihood approach implemented in Treemix[61]. This method was designed specifically to infer patterns of population splitting events from genome-wide allele frequency data. For this analysis, we excluded samples that appeared to have recently dispersed among our geographic locations and we filtered out singleton SNP sites (MAF < 0.05). All subsequent data analysis was performed with Treemix v1.11 using parameters “-global” to generate the ML tree (Extended Data Fig. 1).Population genetic structure and individual ancestry proportions (admixture) were inferred using FRAPPE v1.1[62]. We increased the pre-defined genetic clusters from K=2 to K=11 and ran analysis with 10,000 maximum iterations. We also performed principal component analysis (PCA) using the package EIGENSOFT v5.0[63]. A Tracey-Widom test was used to determine the significance level of the eigenvectors.Based on the complete NJ tree, FRAPPE results, and PCA clustering, it was immediately clear that three D. plexippus samples had recently dispersed among geographic locations. Sample Plex_BLZ_4_M was collected in Belize but clustered with North American samples, Plex_FLs_MIA16_M was collected in south Florida but clustered with migratory North American samples, and Plex_MA_HI032_M was collected in Massachusetts but clustered with Bermuda. Since these are all putative exchanges among geographically proximate locations, we suspect they represent real dispersal events as opposed to sample mix-ups. It is also worth noting that Plex_FLs_MIA16_M is the only sample we included from south Florida that was collected during the winter which suggests that North American migratory monarchs end up in south Florida where they are in contact with the genetically distinct non-migratory population[64]. This may also explain why this sample, and no other south Florida samples, emerged as a recent dispersal event. A small number of additional North American samples were population outliers in either FRAPPE or PCA suggesting potential admixture or contamination. We removed these samples, those with low sequence coverage, and the three recent dispersers from subsequent analyses (Supplementary Table 7). We note however that we performed a second analysis including all samples and found that we were still able to clearly identify all regions of the genome strongly associated with migration (below).
Demographic analysis
We compared patterns of linkage disequilibrium (LD) and minor allele frequency (MAF) among populations. To estimate linkage disequilibrium, we calculated r2 using Haploview v4.2[65] with parameters “-maxdistance 160 -dprime -minGeno 0.6 -minMAF 0.1 -hwcutoff 0.001”. Global patterns of LD and MAF were compared for the four main clusters of monarchs (Extended Data Fig. 2a and 2b). We found high LD and MAF in Atlantic and Pacific populations, consistent with inbreeding and population bottlenecks. Central/South America was intermediate between these populations and North America. We also estimated heterozygosity for each sample, calculated as the ratio of heterozygous to homozygous variants for each sample (Extended Data Fig. 2c).We used the directionality index Ψ[26] to test the occurrence of a range expansion and to infer the origin of the range expansion (Extended Data Table 1). For this analysis, we used biallelic SNPs showing consensus genotypes across all outgroup individuals, and we defined the ancestral allele as that consensus outgroup allele. After excluding sites where one or both of the two focal populations (S1 and S2) was fixed for the ancestral allele, we calculated a 2D site derived allele frequency spectrum between populations as described previously[26].We inferred demographic history for D. plexippus using the Pairwise Sequentially Markovian Coalescence (PSMC) model[27]. To ensure the quality of consensus sequence, we only used representative samples of high sequencing depth for each geographic region. Processed alignments of individuals were transformed to the whole-genome diploid consensus sequence using SAMTOOLS[66]. Bases of low sequencing depth (a third of the average depth) or high depth (twice of the average) were masked. We then used “fq2psmcfa” to transform the consensus sequence into a fasta-like format where the i-th character in the output sequence indicates whether there is at least one heterozygote in the bin of 20 bp. Parameters were set as follows: “-p 4+5*3+13*2+5*3+4 -r 2”. The monarch generation time (g) was set as an estimate of 0.3 years. We used a standard mutation rate (μ) of 8.4 × 10−9, from Drosophila[67]. Note, if we use a lower estimate of the mutation rate[68], all results from the PSMC and ∂a∂i analyses (below) remain qualitatively the same but inferred divergence times are older and effective population sizes are larger.We also inferred the demographic history of the major geographic regions using diffusion approximation for demographic inference (∂a∂i[28]), which employs SNP frequency data for populations rather than recombination events within individual genomes. For this analysis, we analyzed the North American population alone and then considered only pairwise comparisons between North America and each of the major dispersal populations (South/Central America, Atlantic, Pacific). In an attempt to avoid selected sites, we only used SNPs from intergenic regions on autosomal scaffolds. We calculated folded frequency spectra since there is no trinucleotide substitution matrix that can be used for statistical correction. As suggested, we specified simple models first and fit the model by increasing complexity gradually. A likelihood ratio test was used to optimize model selection, with the best model pictured in Extended Data Fig. 3a, although we note it is hard to rule out more complex demographic scenarios. Scaled parameters from the most likely model were transformed using the same g and μ as above. We also performed nonparametric bootstrapping (100 times) to determine the variance of each parameter (Extended Data Fig. 3).Historical records suggest Atlantic and Pacific dispersal events of the monarch butterfly occurred in the 1800's[17,69]. These records are largely based on sightings from early European explorers who do not note the monarch in locations at certain times and then others who note the monarch in abundance only years later. Our demographic analyses based on genome sequence data are inconsistent with this timing. For instance, our PSMC analyses suggest Pacific and Atlantic dispersal events may have occurred as early as 2-3 ky ago. Because this timing was unexpected, we performed a follow-up analysis using ∂a∂i[28] and this also yielded split times of 2-3 ky ago between North America and the Atlantic and Pacific populations (Extended Data Fig. 3). Interestingly, the ∂a∂i analysis further indicated recent bottleneck recovery in the past 200-500 years, perhaps pointing to trans-oceanic dispersal events that were initially seeded thousands of years ago but which spread widely only within the last 200 years. This may provide some link between the genetic data and the historical records. If our demographic inference based on the genomic data is correct, where has the monarch been for all this time? The most common monarch host plants are recent introductions in the Pacific but there are native host plants in Southeast Asia. In addition, there is apparently a long history of the monarch butterfly in New Zealand. For instance, the indigenous Māori people of New Zealand believe the monarch is native to New Zealand, and unlike other Pacific locations, they have a traditional name for the monarch butterfly[69]. On the Atlantic side, it is possible the monarch has co-occurred with congener D. chrysippus in North Africa and on the Canary Islands, using the same host plants. We stress that the ancient Atlantic and Pacific dispersal scenarios we outline here are speculative, but plausible, and they would be in line with our genomic results.
Identification of migration-associated genomic regions
We applied a sliding window approach (5 kb windows sliding in 500 bp steps) to identify genomic regions associated with migration. Several statistical features were considered and compared. Based on the evolutionary scenario, we employed a modified population branch statistic (PBS) approach, which originally showed power to detect incomplete selective sweeps over short divergence times[29], a scenario that is highly relevant here.Our approach was to search for genomic regions separating North America from all three independent, losses of migration (South/Central America, Atlantic, Pacific). Based on the original PBS algorithm, we specifically modified the formula as PBS = (TN-C + TN-P + TN-A - TC-P - TC-A) / 3, where TA-B is the log transformed FST between population A and B (N, North America; C, Central/South America, P, Pacific; A, Atlantic). We further restricted this to windows in the lowest quartile distribution of pairwise nucleotide polymorphism (π[70]) within North America. At a significance of P < 0.01 (Z test), we identified a total of 5.14 Mb (2.1%) of the genome, including 536 predicted genes, which were associated with migration (Supplementary Table 8). If we did not restrict our gene set by low π in North America, our list of migration-associated genes included an additional 154 genes (Supplementary Table 8). We calculated a variety of other statistics, including Tajima's D[71], LD, difference in the number of ABBA vs. BABA[72] sites, and derived allele frequency (DAF) for each sliding window. To estimate DAF, we inferred ancestral alleles using the consensus sequence of the outgroup taxa. We found that DAF was notably enriched in our migration-associated genomic regions, relative to the rest of the genome, and this was true in both D. plexippus and D. erippus.
Annotation
We annotated genes in migration-associated genomic regions using the monarch OGS2.0 gene models and related information from MonarchBase[52] (Extended Data Table 2, Supplementary Table 8). We additionally annotated the genes within functional categories based on the corresponding Drosophila melanogaster orthologs using DAVID online platform v6.7[73]. Functional enrichments are presented in Supplementary Table 9 and Supplementary Table 10. We also specifically examined PBS and gene expression in glycolysis enzymes (Supplementary Table 11).
Targeted gene analysis
We performed a targeted population genetic analysis of three adjacent genes on scaffold DPSCF300190; FBX045, DPOGS206536, and collagen IV α-1. For each gene, CDS was extracted for samples from the population resequencing data and average pairwise sequence divergence and FST were estimated between migratory and non-migratory populations using Arlequin v3.5[74]. We used the program Network v4.612 (Fluxus Engineering) to generate gene networks and MEGA v6[75] to infer a maximum-likelihood gene tree for collagen IV α-1 under the GTR+I+G model. We used DnaSP v5.10[76] to compare sequence polymorphism among the three genes using the HKA test[77]. We also performed sliding window analyses of sequence polymorphism (π) and between species (plexippus, chrysippus) divergence (Dxy) along collagen IV α-1 CDS using DnaSP. Codon-based models of adaptive protein evolution were implemented using the Datamonkey webserver (datamonkey.org). Specifically, we ran FUBAR[78], MEME[79], and FEL[80] tests on an alignment of lepidopteran collagen IV α-1 sequences. The average dN/dS across collagen IV α-1 was 0.16 but all tests were suggestive of positive selection on the R1573Q substitution while other sites were found to be under negative or no selection (FUBAR: dN = 10.5, dS = 1, Normalized dN/dS = 9.5, Bayes factor = 87; MEME: dN(β+) = 110 (P [β = β+] = 0.46), dS = 0.47, P = 0.08; FEL: dN = 55, dS = 10−6, Normalized dN/dS = 83, P = 0.10).
RNA-seq
We extracted total RNA from the adult thoracic muscle tissue of six D. plexippus samples, one male and one female from Massachusetts, Hawaii, and Costa Rica. The samples from Hawaii and Costa Rica were from non-migratory populations and the samples from Massachusetts were from a non-migratory summer generation of the migratory North American population. RNA-seq libraries were prepared using an Illumina TruSeq protocol, individually indexed and sequenced on one lane of HiSeq 2000. After QC processing of raw data, differential gene expression of target genes was analyzed using TopHat v2.0.7[81] and CuffLinks v2.1.1[82].
Respirometry
We measured resting and flight metabolic rates using flow-through respirometry[36]. The material consisted of 60, 2-day old females, originating from Massachusetts (n=19), south Florida (n=19), and Costa Rica (n=22). The samples from south Florida and Costa Rica were from non-migratory populations and the samples from Massachusetts were from a non-migratory summer generation of the migratory North American population. Individuals were placed in a 2-L cylindrical, transparent respirometry chamber and covered with a dark cloth. We pumped dried, CO2-free air through the chamber and used a mass flow meter and controller (Sierra Instruments, Monterey, CA, USA) to regulate the STP-corrected flow rate of 1.8 L min−1. We used a regularly calibrated differential, infrared CO2 analyzer (Li-Cor 7000, Li-Cor, Lincoln, NE, USA) to measure the CO2 emission rate. We converted the signals from analog to digital using a Sable Systems Universal Interface (UI-2) and collected the data using ExpeData (Sable Systems, Reno, NV, USA). After the individual had rested motionless in the darkened chamber for ca. 15 min and the CO2 emission curve had reached a stable baseline, we started recording resting metabolic rate (RMR). We recorded a minimum of 10 min of stable, undisturbed RMR. The measurements took place in a temperature controlled room. We used a NTC thermistor probe (Sable Systems) to continuously measure the temperature inside the chamber. The mean temperature across the RMR measurements was 32.2°C (s.e.m. 0.07).After the RMR recording, we removed the dark cloth and exposed the butterfly to bright light (two 25 W UV/visible light bulbs and one 26 fluorescent light bulb). We allowed the butterfly to adjust to the light for 30 sec after which we began to stimulate it to fly by sharply shaking the chamber after which the individual took off. Once it alighted, we shook it up in the air again. The shaking continued for 10 min with the aim of producing continuous flight. After 10 min, we stopped the stimulation and covered the chamber with the dark cloth. We allowed the CO2 curve to return to the baseline. We performed an instantaneous or Z-correction[83] on all metabolic rate data to compensate for delayed washout of CO2 in the respirometry chamber. To characterize flight performance, we focused on total flight metabolic rate (FMR), which consists of the total volume of CO2 produced during the 10-min flight experiment.For analysis, we compared the three populations using ANCOVA. Time of the day, temperature, and body size (pupal mass) were added as covariates to the models. There were significant differences in RMR among the three populations (F2,51 = 6.11, P = 0.004). Pupal mass had a positive effect on RMR (F1,51 = 14.98, P = 0.0003). There was a marginally significant quadratic time effect, suggesting that RMR peaked in early afternoon (Time: F1,51 = 4.09, P = 0.048; Time[2]: F1,51 = 4.03, P = 0.0499). The measurement temperature main effect was non-significant (F1,51 = 1.08, P = 0.30) but there was a significant interaction between population and temperature (F1,51 = 6.30, P = 0.004), suggesting a positive relationship between temperature and RMR in Florida and Costa Rica, but a negative relationship in Massachusetts. A Tukey's HSD test revealed no significant RMR differences in pairwise comparisons among populations (P > 0.05 in all pairwise comparisons). FMR differed among populations (F2,56 = 6.43, P = 0.003). Pupal mass had a positive effect on FMR (F1,56 = 9.38, P = 0.0034). A Tukey's HSD test showed that mass-independent FMR was different between Massachusetts and south Florida (P = 0.0025) but not between Massachusetts and Costa Rica (P = 0.5837).
Association mapping of color gene
We extracted genomic DNA from 12 individuals (Supplementary Table 1) and constructed Illumina paired-end libraries using the Illumina TruSeq protocol. 12 libraries were indexed and pooled into 2 lanes and sequenced using Illumina Hiseq2000. Only high quality reads that passed QC step were used for downstream analyses. Genome resequencing data were aligned to D. plexippus reference genome sequence using Bowtie2 v2.1.0[84] with parameter –very-sensitive-local and then were re-ordered and sorted by Picard v1.84 (http://picard.sourceforge.net). Realigner Target Creator and Indel Realigner in GATK v2.1 were used to realign indels and Unified Genotyper was used to call genotypes across 12 individuals using following parameters: heterozygosity 0.01, stand_call_conf 50, stand_emit_conf 10, dcov 250. 10,034,303 SNPs and 1,434,642 indels supported by more than 10 individuals and with good quality (Q>30) were kept for further analysis. Association tests were performed using PLINK v1.07[85] and variants with P < 0.005 (Fisher's exact test) were selected. Candidate loci were checked using customized scripts and those with strong linkage patterns (containing >10 associated variants) within gene regions and satisfied the known genotypes were picked up for PCR verification (Supplementary Table 12). We also repeated this analysis after excluding 4 of the 12 samples with lower genome sequence coverage and the results were identical, yielding the highest genome-wide SNP associations in the myosin gene DPOGS206617.We further tested potentially associated polymorphisms in families and field-collected samples provided by John Stimson[41,42]. We extracted genomic DNA from 58 historic specimens (Supplementary Table 13) and performed whole genome amplification using Genome Plex Complete Whole Genome Amplification Kit (Sigma). We designed primers to span polymorphisms on five potentially associated scaffolds (Supplementary Table 12). Amplification efficiency was low because of the age of these specimens. However, after Sanger sequencing all positive PCR products and subsequent analysis of genotypes, amplified region 1013900-1013989 yielded a very strongly associated SNP, position 785 in the myosin gene DPOGS206617 (Supplementary Table 12 and Supplementary Table 14). This section of DPOGS206617 is very likely to house the causative variation responsible for monarch color variation because it contains SNPs perfectly associated with color in our full-genome sequence data, a SNP very strongly associated in our targeted analysis of historical specimens, a signature of long-term purifying selection over evolutionary time, and notably, the white-associated alleles in this region are derived alleles that are found in no other monarch sample among our 101 sequencing panel.
Relationships among monarch populations inferred using the maximum likelihood method implemented in Treemix
Note, this is a fully resolved, bifurcating tree. The very short basal branches indicate little genetic drift in North American populations, not unresolved basal relationships. Colors correspond to those in Fig. 1. Treemix also inferred five migration events among populations: from Puerto Rico to Aruba, from Puerto Rico to Costa Rica, from New Caledonia to Fiji, from Belize or Costa Rica to Portugal, and from Belize to Puerto Rico.
Demographic history of the monarch butterfly
a, Patterns of linkage-disequilibrium decay across the genome in different geographic populations. b, Genome-wide distribution of minor allele frequencies. c, Heterozygosity across populations, estimated as the ratio of heterozygous SNPs to homozygous SNPs/individual. d, Demographic history inferred using PSMC. This analysis includes representative individuals of high sequencing depth for each geographic location. The period of the last glacial maximum (LGM; ~20 ky ago) is shaded in gray.
∂a∂i analysis parameter estimates
a, schematic of demographic scenario modeled in ∂a∂i labeled with parameters being estimated. Nu, effective population size (individuals); m, migration rate (individuals/year); T, time (years). b, inferred parameter estimates. c, 1D model-data comparison considering North America population only. In the left panel, the model is plotted in red and the data in blue. In the right panel, the residuals between model and data are plotted. d, 2D comparison for joint estimation of North America and dispersal populations (Central/South America, Pacific, Atlantic). The left two panels are marginal spectra for data and the maximum-likelihood model, respectively. The right two panels show the residuals.Inferring the monarch range expansion.A positive Ψ indicates that population S1 is farther away from the origin of the expansion than S2 while a negative value indicates S2 is farther from the origin of the range expansion. North America (S1) includes the United States and Mexico but excludes south Florida; Caribbean includes Bermuda and Puerto Rico; Central America includes Belize and Costa Rica; South America includes Aruba and Ecuador; Pacific includes Hawaii, Samoa, Fiji, New Caledonia, Australia and New Zealand; Atlantic includes Portugal, Spain and Morocco.Top 20 migration-associated genomic regions.
Authors: Ben Murrell; Sasha Moola; Amandla Mabona; Thomas Weighill; Daniel Sheward; Sergei L Kosakovsky Pond; Konrad Scheffler Journal: Mol Biol Evol Date: 2013-02-18 Impact factor: 16.240
Authors: Ildikó Kelemen-Valkony; Márton Kiss; Judit Csiha; András Kiss; Urs Bircher; János Szidonya; Péter Maróy; Gábor Juhász; Orbán Komonyi; Katalin Csiszár; Mátyás Mink Journal: Matrix Biol Date: 2011-10-18 Impact factor: 11.583
Authors: Kristjan Niitepõld; Alan D Smith; Juliet L Osborne; Don R Reynolds; Norman L Carreck; Andrew P Martin; James H Marden; Otso Ovaskainen; Ilkka Hanski Journal: Ecology Date: 2009-08 Impact factor: 5.499
Authors: Samantha E Iiams; Aldrin B Lugena; Ying Zhang; Ashley N Hayden; Christine Merlin Journal: Proc Natl Acad Sci U S A Date: 2019-11-25 Impact factor: 11.205
Authors: David L Denlinger; Daniel A Hahn; Christine Merlin; Christina M Holzapfel; William E Bradshaw Journal: Philos Trans R Soc Lond B Biol Sci Date: 2017-11-19 Impact factor: 6.237
Authors: Simon H Martin; Kumar Saurabh Singh; Ian J Gordon; Kennedy Saitoti Omufwoko; Steve Collins; Ian A Warren; Hannah Munby; Oskar Brattström; Walther Traut; Dino J Martins; David A S Smith; Chris D Jiggins; Chris Bass; Richard H Ffrench-Constant Journal: PLoS Biol Date: 2020-02-27 Impact factor: 8.029