Literature DB >> 31778174

Genetic Landscapes Reveal How Human Genetic Diversity Aligns with Geography.

Benjamin M Peter^1,2, Desislava Petkova³, John Novembre^1,4.

Abstract

Geographic patterns in human genetic diversity carry footprints of population history and provide insights for genetic medicine and its application across human populations. Summarizing and visually representing these patterns of diversity has been a persistent goal for human geneticists, and has revealed that genetic differentiation is frequently correlated with geographic distance. However, most analytical methods to represent population structure do not incorporate geography directly, and it must be considered post hoc alongside a visual summary of the genetic structure. Here, we estimate "effective migration" surfaces to visualize how human genetic diversity is geographically structured. The results reveal local patterns of differentiation in detail and emphasize that while genetic similarity generally decays with geographic distance, the relationship is often subtly distorted. Overall, the visualizations provide a new perspective on genetics and geography in humans and insight to the geographic distribution of human genetic variation.

Entities: Chemical Disease Gene Species

Keywords: geographic structure; geography; human genetics; isolation-by-distance; population genetics; population structure

Year: 2020 PMID： 31778174 PMCID： PMC7086171 DOI： 10.1093/molbev/msz280

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

In many regions of the world, human genetic diversity “mirrors” geography in the sense that genetic differentiation increases with geographic distance (“isolation by distance” Ramachandran et al. 2005; Novembre et al. 2008; Wang et al. 2012; Bradburd and Ralph 2019; Battey et al. 2019); However, due to the complexities of geography and history, this relationship varies across the globe. Pioneering studies of classical blood group and allozyme loci (Barbujani and Sokal 1990; Cavalli-Sforza et al. 1994), mostly across Europe, found that some allele frequencies exhibit zones of elevated change that frequently align with each other. Later studies of large microsatellite marker panels (Rosenberg et al. 2002) observed broad geographic clustering, which lead to a debate whether human fine-scale genetic variation is better characterized by discrete clusters or continuous clines (Serre and Pääbo 2004; Rosenberg et al. 2005; Frantz et al. 2009; Perez et al. 2018). Since those early studies, methods in spatial or landscape genetics have matured, with new, powerful methods capable of modeling population structure allowing for spatial heterogeneity (Guillot et al. 2009; Bradburd et al. 2016; Novembre and Peter 2016; Ringbauer et al. 2017; Bradburd et al. 2018; House and Hahn 2018; Ringbauer et al. 2018). One of these methods is the tool EEMS (for Estimated Effective Migration Surfaces, Petkova et al. 2016). EEMS uses a model based on local “effective migration” and “diversity” parameters. Importantly, it is a model-based visualization tool. The parameters of the model are not intended to be interpreted literally—they are simply tools to help visualize the relationship of genes to geography. Populations in areas of high effective migration are genetically more similar than other populations at the same geographic distance, and conversely, low effective migration rates imply genetic differentiation increases rapidly with distance. In turn, a map of inferred patterns of effective migration can provide a useful visualization of spatial genetic structure for large, complex samples. To date, the EEMS method has not been applied to human diversity data from very large, spatially extended samples. The method has the potential to produce useful summaries of human genetic variation that are more transparent and immediately interpretable than alternatives using methods such as principal components analysis. To explore this possibility, we have applied EEMS and PCA using single-nucleotide polymorphism (SNP) data combined from 27 different data sets comprising a total of 6,066 individuals from 419 locations across Eurasia and Africa (supplementary information, Supplementary Material online). We organize our applications in seven analysis panels: an overview Afro-Eurasian panel (AEA), four continental-scale panels, and two panels of Southern African KhoeSan and Bantu speakers. In all cases, the inferred EEMS surfaces are “rugged,” with numerous high and low effective migration features (figs. 1 and 2) that are strongly statistically supported when compared with a uniform-migration model (supplementary table 2, Supplementary Material online). The regions of depressed effective migration often align in long, connected stretches that are present in >95% of MCMC iterations. We refer to these features as “troughs” and annotate them with dashed lines (figs. 1 and 2, supplementary figs. 2, Supplementary Material online show these troughs in isolation, supplementary figs. 2, Supplementary Material online show the posterior variance on migration rates).

Regional patterns of genetic diversity. (a) Scale bar for relative effective migration rate. Posterior effective migration surfaces for (b) Western Eurasia (WEA) (e) Central/Eastern Eurasia (CEA) (g) Africa (AFR) (h) South East Asian (SEA) (k) Southern African KhoeSan (SAKS) (l) Southern African Bantu (SAB) analysis panels. In panel g, red circles indicate Nilo-Saharan speakers. Approximate location of troughs is shown with dashed lines (see supplementary fig. 4, Supplementary Material online). PCA plots: (c) WEA (d) Europeans in WEA (f) CEA (i) SEA (j) AFR (m) SAHG+SAB. Individuals are displayed as gray dots. Large dots reflect median PC position for a sample; with colors reflecting geography matched to the corresponding EEMS figure. In the EEMS plots, approximate sample locations are annotated. For exact locations, see annotated supplementary figure 4, Supplementary Material online and supplementary table S1, Supplementary Material online. Features discussed in the main text and Supplementary Material online are labeled. FST values per panel emphasize the low absolute levels of differentiation.

Large-scale patterns of population structure. (a) EEMS posterior mean effective migration surface for Afro-Eurasia (AEA) panel. Regions and features discussed in the main text are labeled. Approximate location of troughs is annotated with dashed lines (see supplementary fig. 2, Supplementary Material online). (b) PCA plot of AEA panel: Individuals are displayed as gray dots, colored dots reflect median of sample locations; with colors reflecting geography and matching with the EEMS plot. Locations displayed in the EEMS plot reflect the position of populations after alignment to grid vertices used in the model (see Materials and Methods). For exact locations, see annotated supplementary figure 2, Supplementary Material online and supplementary table S1, Supplementary Material online. The displayed value of FST emphasizes the low absolute level of differentiation in human SNP data. Regional patterns of genetic diversity. (a) Scale bar for relative effective migration rate. Posterior effective migration surfaces for (b) Western Eurasia (WEA) (e) Central/Eastern Eurasia (CEA) (g) Africa (AFR) (h) South East Asian (SEA) (k) Southern African KhoeSan (SAKS) (l) Southern African Bantu (SAB) analysis panels. In panel g, red circles indicate Nilo-Saharan speakers. Approximate location of troughs is shown with dashed lines (see supplementary fig. 4, Supplementary Material online). PCA plots: (c) WEA (d) Europeans in WEA (f) CEA (i) SEA (j) AFR (m) SAHG+SAB. Individuals are displayed as gray dots. Large dots reflect median PC position for a sample; with colors reflecting geography matched to the corresponding EEMS figure. In the EEMS plots, approximate sample locations are annotated. For exact locations, see annotated supplementary figure 4, Supplementary Material online and supplementary table S1, Supplementary Material online. Features discussed in the main text and Supplementary Material online are labeled. FST values per panel emphasize the low absolute levels of differentiation. In the broad overview Afro-Eurasia panel (fig. 1; n = 4,697 samples; 370 locales; FST = 0.071) we see that 19 out of 25 troughs visually align with plausible topographical obstacles to migration, such as deserts (Sahara; A1), seas (e.g., Mediterranean, Red, Black, Caspian, East China Seas; A2–8), marine straits (e.g., Mozambique Channel, Taiwan Strait; A9–10) and mountain ranges (Ural, Himalayas, Caucasus; A13, A11, middle of A3) or a combination thereof (e.g., the northeastern parts of A11, A12 roughly accord with the Tien Shan and the Tarim Basin, Altai and Gobi complex of mountains/desert, respectively). Many of these features, such as the Sahara desert (Cavalli-Sforza et al. 1994) or the Himalayas (Rosenberg et al. 2005; Bradburd et al. 2013) have been studied in great detail, as they are zones of not only genetic but also linguistic and ethnic differentiation. The remaining seven troughs (A19–A25) are found across Central Africa, Southern Africa, Scandinavia, and Siberia. In each of these regions, our sample consists of agricultural-based populations in relatively close proximity to traditionally hunter–gatherer or pastoralist populations. The island populations of the Andaman islands and New Guinea show troughs nearly contiguously around them (southern part of A11, and A15) reflecting their histories of relative isolation (Reich et al. 2009; Pugach et al. 2013). The other main features emerging at this scale are several large regions that have mostly high effective migration (such as within the European continent, the Arabian Peninsula, and East Asia). Analyses on a finer geographic scale highlight subtler features (e.g., compare Europe in fig. 1 vs. fig. 2), and reveal that differentiation exists on local and continental scales (supplementary table 2, Supplementary Material online). At these finer scales we continue to see troughs that align with landscape features, though increasingly we see troughs and corridors that coincide with contact zones of language groups and hypothesized areas of human migrations. For example, in Europe (fig. 2) we observe troughs roughly in zones associated with language contact zones between Germanic and Northern Slavic speakers (W12) and between Northern Slavic speakers and the linguistically complex Caucasus region (W8). These, as well as most of the other features in Europe (troughs through the Alps, Adriatic, between Italy and Sardinia, in Northern Scandinavia), closely align with older results from classical markers (Barbujani and Sokal 1990). The Eastern Eurasian panel (fig. 2) is largely consistent with the coarser-scale AEA panel. An exception is a corridor from Mongolia to the Caspian Sea (roughly E/W feature surrounded by E4–E7, E14, and E22), possibly reflecting genetic similarity over long distances brought about by the movements of Mongol and Turkic peoples, as the Kalmyk, Kazhaks, and Uygurs sample in this corridor all have well documented shared genetic ancestry with present-day populations of Southern Siberia and Mongolia (Yunusbayev et al. 2015). In Southeast Asia (fig. 2), troughs align with several straits in the Malay archipelago (S6–S8). On the other hand, we observe two major corridors, one from Taiwan/Luzon through Western Mindanao to Sulawesi, and one from Ternate through the Lower Sunda Islands (LSI) into Melanesia. These could be a reflection of the Austronesian expansion that started roughly 3,000 years ago (Duggan and Stoneking 2014). In Africa (fig. 2), a trough (A1) aligns with the Sahara desert and extends southeastward, roughly aligned with the language group boundaries between Niger-Congo and Afro-Asiatic language speakers (Campbell and Tishkoff 2008; supplementary fig. 7, Supplementary Material online). The West-African Afro-Asiatic speaking Hausa and Mada, together with the admixed Fulani (Bryc et al. 2010) show low effective migration to coastal West African Bantu speakers (A8). In Central Africa, corridors connecting West Africa with East and Southern Africa may reflect the Bantu expansion, and the Biaka and Mbuti show low effective migration (A7) with surrounding Bantu and Nilo-Saharan populations. In both Central and Eastern Africa, Nilo-Saharan and Niger-Congolese speakers overlap, resulting in low effective migration uncorrelated with language. Between Southern and Eastern Africans there is low effective migration through Mozambique and South-Western Tanzania (A4–A6). For a more detailed analysis, we constructed KhoeSan (SAKS, n = 109, 16 locales, FST = 0.025, fig. 2) and Bantu (SAB, n = 30, 11 locales, FST = 0.014; fig. 2) panels, which reveal very different spatial structuring. These results are broadly consistent with existing work on African population structure (Tishkoff et al. 2009; Bryc et al. 2010; Pickrell et al. 2012; Uren et al. 2016), and emphasize that African population structure appears largely determined by the Sahara desert, the Bantu and Arabic expansions, and the complex structure of hunter–gatherer groups specifically in South Africa. We also contrasted the EEMS results to those obtained with principal component analysis (PCA). Although, PCA-biplots typically reflect large-scale gradients of diversity in a panel, EEMS emphasizes local distortions, such as troughs features that are often imperceptible in the PCA-biplots (fig. 1; fig. 2; supplementary fig. 6, Supplementary Material online). This is due, in part, to geographical information allowing EEMS to discern subtle structure while controlling for the effects of uneven sampling (Petkova et al. 2016), whereas the objective function of PCA minimizes the Frobenius-norm, and therefore emphasizes the largest pairwise genetic distances. The maps we present provide compact summaries of the complex relationship of genes and geography in human populations. Most of the clearest features in these maps (e.g., the Alps, Sahara desert, Himalayas, W3, A1, E14; Nei and Roychoudhury 1993; Cavalli-Sforza et al. 1994; Bradburd et al. 2013) have been described previously and many represent regions where genetic, geographic, linguistic and ethnic differentiation all coincide. A subset of the trough features align with differences in subsistence strategies. Overall, the maps provided here support many previous inferences, typically made from more limited data sets, and provide an expanded demonstration of how human genetic diversity can reflect physical and cultural geography. In contrast to methods that identify short bursts of gene flow (“admixture”) between diverged populations (Patterson et al. 2012; Loh et al. 2013; Hellenthal et al. 2014), EEMS models local migration between nearby groups to represent heterogeneous isolation-by-distance patterns. This leads to a few limitations that must be considered in interpretation: First, spatially heterogeneous isolation-by-distance is a flexible model, but not necessarily flexible enough to capture the complexity of human histories. For instance, human groups often overlap spatially while maintaining differentiation or have undergone long-distance migration/admixture not included in our model. These latter cases can produce geographic “outliers” that are difficult for EEMS to model. A clear example is Madagascar in the large AEA panel, which in the PCA is shifted toward samples from S.E. Asia (fig. 2), presumably because of admixture from S.E. Asia to Madagascar (Kusuma et al. 2016). We found that running EEMS at high resolutions results in more interpretable plots as the surfaces can often accommodate modeling these samples within regions of relative isolation (e.g., A3 in the AFR panel models the differentiation of Madagascar from mainland samples, fig. 2). Second, decisions regarding which samples to include will affect the outcome of any analysis. When there is a feature inferred in a region with few samples, the exact positioning of the inferred change on the map will be imprecise (e.g., W4 in fig. 2, presumably associated with the English Channel). The maps of posterior variance (supplementary figs. 2 and 4, Supplementary Material online) partly convey where there is uncertainty in positioning, but caution is still warranted as violations of the modeling assumptions will introduce further uncertainty. In other cases, the presence or absence of a particular group may impact the inference of corridors, sometimes depending on resolution. One example is the Kalmyk, a Mongolian people in Southern Russia. The Kalmyk are linked by a corridor to Mongolia (area surrounded by E22) in the CEA, but not the AEA panel; this corridor disappears in the CEA panel if the Kalmyks are excluded. Similarly, including the Eastern African Hadza and Sandawe (two language isolates) causes inference of a trough (eastern part of A1). This trough is broken up when we exclude these two samples. Another concern is that we merged data from studies whose sample inclusion criteria differ (e.g., four-grandparents from a single region vs. self-reported individual origin); however, based on exploratory analyses and the large spatial-scales treated here, we suspect these differences have minor effects on the overall landscapes inferred. Third, the scales of the effective migration rates need to be interpreted with care. In each of our analysis panels, the absolute levels of differentiation are consistently low across all populations. EEMS draws attention to where differentiation is slightly elevated or depressed relative to expectations from geographic distance. Low effective migration between a pair of populations does not imply a complete absence of migration nor large levels of absolute differentiation; conversely, high levels of effective migration do not imply present-day ongoing gene flow. The EEMS surface is best understood as a modeling construct to visualize a relationship between genes and geography that is nonuniform across space. In particular, the emergence of migration features in the EEMS maps often align with known topography, past historical migrations, and/or linguistic/cultural distributions, but this is not an assessment of a causal connection. Formally testing the influence of specific features and environmental variables on migration rates remain important future tasks that will require extending EEMS or using different frameworks (Hanks and Hooten 2013). Finally, it is worth reiterating the maps inferred here represent a model of gene flow that predicts genetic diversity in humans sampled today—a fuller representation would represent genetic structure dynamically through time. This is especially relevant as ancient DNA data have recently suggested human population structure can be surprisingly dynamic (e.g., Lazaridis et al. 2014). We suspect that some of the corridors are revealing elevated genetic similarity that has arisen from major gene flow events (e.g., in the AEA analysis, the connectivity through the Pontic Caspian Steppe may reflect the Bronze Age “Steppe” expansions inferred by Allentoft et al. 2015; Haak et al. 2015). Overall, our migration landscapes suggest an alternative perspective from the clusters versus clines paradigms for human structure (Rosenberg et al. 2002; Serre and Pääbo 2004; Rosenberg et al. 2005): By revealing both sharp and diffuse features that structure human genetic diversity, our results suggest that more continuous definitions of ancestry in human population genetics can complement principal component methods or models of discrete populations with admixture. The results also help develop a more thorough geographic understanding of human genetic variation and its distribution. For instance, as rare variants are often geographically localized (Gibson 2012; Mathieson and McVean 2012), the maps presented here may be especially useful for predicting ancestries within which rare alleles (some of which will have medical relevance) might be contained. The maps also annotate features of present-day population structure that ancient DNA and historical/archaeological studies can aim to explain.

Materials and Methods

Merging Genetic Data

We obtained SNP genotype data from 27 different studies (supplementary table 1, Supplementary Material online). Processing was done using a reproducible snakemake pipeline (Köster and Rahmann 2012) available under http://github.com/NovembreLab/eems-merge, heavily relying on plink 1.9 (Chang et al. 2015) for handling genotypes. The sources differ in the input format and preprocessing, however in general we performed the following steps: Remove all nonautosomal, non-SNP variants Map SNPs to the forward strand of human reference genome b37 coordinates using chip manufacturer metadata files or SNP identifiers Remove strand-ambiguous A/T and G/C variants The remaining SNPs were then merged using successive plink –bmerge commands into a single master data set with 9,003 individuals and 1.9 M SNPs but a total genotyping rate of only 20.6%. Forty six SNPs were removed because different studies reported different alternative alleles. We used a relationship filter of 0.6 using the “–rel-cutoff 0.6” flag in plink to remove 667 closely related individuals or duplicates. After merging, each analysis panel had missingness rates <0.5% (AEA = 0.2%, WEA = 0.3%, CEA = 0.2%, SEA = 0.5%, AFR = 0.2%, SAHG = 0.1%). In all panels, all SNPs passed a one-sided HWE-test (P-value < 10−5), with the exception of SEA, where nine (out of 7,553 SNPs) failed and were excluded.

Data Retrieval and Filtering

Human Origins Data Set

Sampling location information was obtained from table S9.4 of Lazaridis et al. (2014), and the data were shared by David Reich. We used the population information in the “vdata” subset of all ascertainment panels, except for the analysis where we assess ascertainment bias. The utility “convert” from “admixtools” (Patterson et al. 2012) was used to convert the data into plink format.

Estonian Biocentre Data

The data generated by the Estonian Biocentre (Behar et al. 2013; Cardona et al. 2014; Chaubey et al. 2011; Di Cristofaro et al. 2013; Fedorova et al. 2013; Kovacevic et al. 2014; Metspalu et al. 2011; Migliano et al. 2013; Pierron et al. 2014; Raghavan et al. 2014; Rasmussen et al. 2010, 2011; Skoglund et al. 2014; Yunusbayev et al. 2012, 2015) were provided in plink format by Mait Metspalu on October 30, 2015, along with location information where it was available. This data set contained 1,282,568 SNPs. Of those, 6,770 SNPs had nonunique ids and were removed.

HUGO Pan-Asian SNP Consortium

The data were downloaded on June 24, 2015 from www.biotec.or.th/PASNP (HUGO Pan-Asian SNP Consortium 2009). Location-metadata were obtained on the same day from the map on the same website, and individuals were matched to populations using the individual identifiers. All individuals with the same tag were assigned the median of all locations from that tag. The data were first lifted onto hg19 (with 5 out of 54,794 SNPs being removed), and then reformatted into binary plink format. Because of the small size of the chip used and the low overlap with the human origins array in particular, we only consider this data in the Southeast Asian panel.

Uniform Global Sample

This data were downloaded on June 20, 2015 from http://jorde-lab.genetics.utah.edu/pub/affy6_xing2010/ (Xing et al. 2010). Sampling locations were provided by Jinchuan Xing. We used version 32 of the annotation file obtained on June 19, 2015 from affymetrix.com to map SNPs onto hg19, remove strand-ambiguous SNPs and to flip SNPs that were on the minus-strand.

POPRES Data

POPRES data were obtained under dbGAP study accession phs000145 to John Novembre, and we used the data as processed in Novembre et al. (2008), and only retain individuals for which all grandparents were from the same country, and labeled the Swiss sample according to self-reported language (Nelson et al. 2008). We used version 32 of the annotation file obtained on June 19, 2015 from www.affymetrix.com (“Mapping250K_sp.na32.annot.csv” and “Mapping250K_Sty.na32.annot.csv”) to filter SNPs that did not map onto hg19 and we removed strand-ambiguous AT and GC polymorphisms.

African Data

Data from Bryc et al. (2009) and Hunter-Zinck et al. (2010) were obtained on April 19, 2017 from David Comas’ website under http://www.biologiaevolutiva.org/dcomas/? p=607. We used version 32 of the annotation file GenomeWideSNP_6.na32.annot.csv” obtained on June 19, 2015 from affymetrix.com to map SNPs onto hg19, remove strand-ambiguous SNPs and to flip SNPs that were on the minus-strand.

Southeast Asian Data

The data were obtained on July 14, 2015 from Mark Stoneking in three different source files (Reich et al. 2011). After merging the three different source files, SNPs not mapping to hg19 using the annotation file “GenomeWideSNP_6.na32.annot.csv” were removed, as were AT and GC SNPs. Sampling locations were extracted from figure 1 of Reich et al. (2011).

Mediterranean Panel

Data were obtained on August 13, 2015 in binary plink format from http://drineas.org/Maritime_Route/RAW_DATA/PLINK_FILES/MARITIME_ROUTE.zip (Paschou et al. 2014). Sampling location information was obtained from supplementary table 3 in Paschou et al. (2014). SNPs not mapping to hg19 using the annotation file “GenomeWideSNP_6.na32.annot.csv” were removed, as were AT and GC SNPs.

Tibetan and Himalayan Data

Data from Bigham et al. (2010), Xu et al. (2011), and Jeong et al. (2017) were obtained from Choongwon Jeong and Anna Di Rienzo. We used the same filtering as in the Jeong et al. (2017) study, but only added the samples originating from these three studies with permission from the respective authors.

Combining Meta-Information

All sources with the exception of the Estonian Biocentre data provided (approximate) sampling coordinates. However, the level of accuracy varied between sources, with some providing specific ethnicities, some (such as POPRES) only providing country information and others just providing city- or state-level information. For POPRES-derived data, and most countries, we assigned individuals to the country’s centerpoint, with the exception of Sweden and Finland, which were assigned their capital. For the Estonian Biocentre data, sampling location data were highly heterogeneous. Samples that could not be confidently assigned to a region with an accuracy of 100 km were excluded. For populations with samples from multiple studies, the most accurate source location was used. For locations covered with different accuracy, only the most accurate samples were retained. For example, we dropped all Spanish individuals from POPRES (only country level data), as the Human Origins data provided higher resolution, with samples from eleven different regions in Spain. The resulting table is given as supplementary table S1, Supplementary Material online.

Language Data

To validate troughs correlating with presumed language barriers, we cross-referenced the genetic data with linguistic data from the Glottolog 3.2 database (Hammarström et al. 2018). To do so, we compared the correlation of pairwise genetic distance and geographic distances within and between pairs of language groups. As there was frequently no primary data recording the language of speakers, we proceeded as follows: For population identifiers that correspond to languages/or ethnic groups with a clear majority language, we used that language. For samples with country-level information where the country has a clear majority language (e.g., Germany, Slovenia), that language was assigned (supplementary table S1, Supplementary Material online). Otherwise, if a sample was from a region with a clear majority language that is not obviously due to recent colonization, that language was assigned. All other samples were not assigned a language. For simplicity, we group Nilotic, Central Sudanic, and Mande languages into “Nilo-Saharan,” Khoe, Kxa, and Tuu speakers into “KhoeSan” and Armenic, Circassian, Kartvelian, and Nakh-Daghesanian into “Caucasus.” For all troughs, we test the hypothesis that they align with boundaries between linguistic groups, by performing a partial mantel test comparing genetic distances and language groups as a categorical variable using the implementation in the R-package “vegan”(Oksanen et al. 2007). We note that results need to be interpreted cautiously, as the mantel test is generally poorly calibrated for spatially autocorrelated data (Guillot and Rousset 2013).

Samples Omitted from Model Fitting

Besides samples whose geographic origin we could not unambiguously assign (n = 74), we removed a small number of samples that would violate some assumptions of the EEMS model. In particular, we excluded all Jewish samples (n = 379), due to complexity of the diaspora and subsequent local admixture (Behar et al. 2010) and Han-Chinese in Taiwan and Singapore (n = 170), who both are recent migrant population to those locales. To avoid any possible distortion due to uneven sampling, we downsampled all single locales to at most 50 individuals, drawn independently for different panels. This resulted in a total of 6,066 individuals used in at least one panel (supplementary table S1, Supplementary Material online).

Visualization Pipeline

We developed a second pipeline using snakemake (Köster and Rahmann 2012) to perform all subsetting and demographic analyses, available under github.com/NovembreLab/eems-around-the-world. The pipeline allows for defining panels using a flexible set of features, including latitudinal and longitudinal boundaries, continent or country of samples, source study, as well as the addition and exclusion of particular samples or populations. Based on these subsets, different modules allow performing EEMS and PCA analyses, as well as generating all the figures, that were then annotated using the software Inkscape (http://inkscape.org; last accessed December 9, 2019). All configuration variables are stored in json and yaml config files. We perform EEMS and PCA for each panel independently. Structural variants are a potential confounding factor for genome-wide SNP based analysis. In PCA, these variants may result in a number of neighboring SNP in high LD to have very high loadings, thus overemphasizing the effect of these variants. For this reason, it is advisable to remove regions containing SNP that have extremely high loadings on some principal component. Thus, for each panel, we perform a preliminary PCA analysis using flashpca (Abraham and Inouye 2014). The loading-scores for each PC were normalized by dividing them by the standard deviations on each PC [outlier_score = L[i]/sd(L[i])], and then we removed a 200 kb window around any SNP for which |outlier_score| > 5. We also dropped individuals with >5% missingness, and SNPs with >1% missing data from each panel.

EEMS

To generate the map surfaces with EEMS (https://github.com/dipetkov/eems), we must choose a grid size and boundaries. Choosing a coarse grid results in faster computation, but only produces a map with broad-scale patterns. A finer grid, on the other hand, is able to reveal more details, but at a steep increase in computational cost and with an increased danger of introducing patterns that are harder to interpret. Grid density and sizes are given in supplementary table 1, Supplementary Material online, along with population level FST calculated using plink, and FST based on the mean migration rate inferred by EEMS and equilibrium stepping stone model theory (Slatkin 1991). We evaluated the impact of SNP ascertainment bias by running EEMS on the multiple, documented SNP ascertainment panels of the Human Origins data (Lazaridis et al. 2014). We found that while ascertainment bias has an effect on the heterozygosity surfaces that EEMS estimates, the migration surfaces remain relatively unaffected (supplementary fig. 1, Supplementary Material online). Therefore, we restrict our presentation to the migration surfaces. EEMS approximates a continuous region with a triangular grid, which has to be specified. We generated global geodesic graphs at three resolutions (approximate distance between demes of 120, 240, and 500 km, respectively) using dggrid v6.1 (Sahr et al. 2003) and intersected these graphs with the area representing each panel (figs. 1 and 2). For each panel, we performed four pilot runs of 2–8 million iterations each. The run with the highest likelihood was then used for a second set of four runs of 4–10 million iteration each, with the first 500,000 discarded as burn-in. Number of iterations were chosen such that the total computation time per single run was around 10 days. Every 20,000th iteration was sampled. All other (hyper-)parameters were kept at their default values (Petkova et al. 2016). We compared EEMS to an isolation-by-distance model with a constant migration rate by refitting EEMS allowing only a single migration rate tile, but arbitrary diversity rate tiles using the otherwise same settings. The resulting log Bayes factors are given in supplementary table 2, Supplementary Material online.

Evaluating Fit of EEMS and PCA to Genetic Distances

For EEMS, the posterior samples imply an expected distance matrix between populations. For PCA, the components and their loadings provide an approximation to the genetic distance matrix between individuals. We use the median PCA values of individuals across two, ten, or 100 PC components to produce an expected genetic distance matrix between populations. For each method, the expected genetic distance matrices are compared with the observed matrices using a simple linear correlation computed between all pairwise distances. Click here for additional data file.

63 in total

1. Shared and unique components of human population structure and genome-wide signals of positive selection in South Asia.

Authors: Mait Metspalu; Irene Gallego Romero; Bayazit Yunusbayev; Gyaneshwer Chaubey; Chandana Basu Mallick; Georgi Hudjashov; Mari Nelis; Reedik Mägi; Ene Metspalu; Maido Remm; Ramasamy Pitchappan; Lalji Singh; Kumarasamy Thangaraj; Richard Villems; Toomas Kivisild
Journal: Am J Hum Genet Date: 2011-12-09 Impact factor: 11.025

2. The genome-wide structure of the Jewish people.

Authors: Doron M Behar; Bayazit Yunusbayev; Mait Metspalu; Ene Metspalu; Saharon Rosset; Jüri Parik; Siiri Rootsi; Gyaneshwer Chaubey; Ildus Kutuev; Guennady Yudkovsky; Elza K Khusnutdinova; Oleg Balanovsky; Ornella Semino; Luisa Pereira; David Comas; David Gurwitz; Batsheva Bonne-Tamir; Tudor Parfitt; Michael F Hammer; Karl Skorecki; Richard Villems
Journal: Nature Date: 2010-06-09 Impact factor: 49.962

3. Estimating Barriers to Gene Flow from Distorted Isolation-by-Distance Patterns.

Authors: Harald Ringbauer; Alexander Kolesnikov; David L Field; Nicholas H Barton
Journal: Genetics Date: 2018-01-08 Impact factor: 4.562

4. Population genomics of Bronze Age Eurasia.

Authors: Morten E Allentoft; Martin Sikora; Karl-Göran Sjögren; Simon Rasmussen; Morten Rasmussen; Jesper Stenderup; Peter B Damgaard; Hannes Schroeder; Torbjörn Ahlström; Lasse Vinner; Anna-Sapfo Malaspinas; Ashot Margaryan; Tom Higham; David Chivall; Niels Lynnerup; Lise Harvig; Justyna Baron; Philippe Della Casa; Paweł Dąbrowski; Paul R Duffy; Alexander V Ebel; Andrey Epimakhov; Karin Frei; Mirosław Furmanek; Tomasz Gralak; Andrey Gromov; Stanisław Gronkiewicz; Gisela Grupe; Tamás Hajdu; Radosław Jarysz; Valeri Khartanovich; Alexandr Khokhlov; Viktória Kiss; Jan Kolář; Aivar Kriiska; Irena Lasak; Cristina Longhi; George McGlynn; Algimantas Merkevicius; Inga Merkyte; Mait Metspalu; Ruzan Mkrtchyan; Vyacheslav Moiseyev; László Paja; György Pálfi; Dalia Pokutta; Łukasz Pospieszny; T Douglas Price; Lehti Saag; Mikhail Sablin; Natalia Shishlina; Václav Smrčka; Vasilii I Soenov; Vajk Szeverényi; Gusztáv Tóth; Synaru V Trifanova; Liivi Varul; Magdolna Vicze; Levon Yepiskoposyan; Vladislav Zhitenev; Ludovic Orlando; Thomas Sicheritz-Pontén; Søren Brunak; Rasmus Nielsen; Kristian Kristiansen; Eske Willerslev
Journal: Nature Date: 2015-06-11 Impact factor: 49.962

5. Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans.

Authors: Maanasa Raghavan; Pontus Skoglund; Kelly E Graf; Mait Metspalu; Anders Albrechtsen; Ida Moltke; Simon Rasmussen; Thomas W Stafford; Ludovic Orlando; Ene Metspalu; Monika Karmin; Kristiina Tambets; Siiri Rootsi; Reedik Mägi; Paula F Campos; Elena Balanovska; Oleg Balanovsky; Elza Khusnutdinova; Sergey Litvinov; Ludmila P Osipova; Sardana A Fedorova; Mikhail I Voevoda; Michael DeGiorgio; Thomas Sicheritz-Ponten; Søren Brunak; Svetlana Demeshchenko; Toomas Kivisild; Richard Villems; Rasmus Nielsen; Mattias Jakobsson; Eske Willerslev
Journal: Nature Date: 2013-11-20 Impact factor: 49.962

6. Inferring admixture histories of human populations using linkage disequilibrium.

Authors: Po-Ru Loh; Mark Lipson; Nick Patterson; Priya Moorjani; Joseph K Pickrell; David Reich; Bonnie Berger
Journal: Genetics Date: 2013-02-14 Impact factor: 4.562

7. Second-generation PLINK: rising to the challenge of larger and richer datasets.

Authors: Christopher C Chang; Carson C Chow; Laurent Cam Tellier; Shashaank Vattikuti; Shaun M Purcell; James J Lee
Journal: Gigascience Date: 2015-02-25 Impact factor: 6.524

8. Reconstructing Indian population history.

Authors: David Reich; Kumarasamy Thangaraj; Nick Patterson; Alkes L Price; Lalji Singh
Journal: Nature Date: 2009-09-24 Impact factor: 49.962

9. A quantitative comparison of the similarity between genes and geography in worldwide human populations.

Authors: Chaolong Wang; Sebastian Zöllner; Noah A Rosenberg
Journal: PLoS Genet Date: 2012-08-23 Impact factor: 5.917

10. Visualizing spatial population structure with estimated effective migration surfaces.

Authors: Desislava Petkova; John Novembre; Matthew Stephens
Journal: Nat Genet Date: 2015-12-07 Impact factor: 38.330

12 in total

1. Detecting Selection from Linked Sites Using an F-Model.

Authors: Marco Galimberti; Christoph Leuenberger; Beat Wolf; Sándor Miklós Szilágyi; Matthieu Foll; Daniel Wegmann
Journal: Genetics Date: 2020-10-16 Impact factor: 4.562

Review 2. Why do we pick similar mates, or do we?

Authors: Thomas M M Versluys; Ewan O Flintham; Alex Mas-Sandoval; Vincent Savolainen
Journal: Biol Lett Date: 2021-11-24 Impact factor: 3.703

3. KLFDAPC: a supervised machine learning approach for spatial genetic structure analysis.

Authors: Xinghu Qin; Charleston W K Chiang; Oscar E Gaggiotti
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

4. Fast and flexible estimation of effective migration surfaces.

Authors: Joseph Marcus; Wooseok Ha; Rina Foygel Barber; John Novembre
Journal: Elife Date: 2021-07-30 Impact factor: 8.140

5. Reconstructing cranial evolution in an extinct hominin.

Authors: Karen L Baab
Journal: Proc Biol Sci Date: 2021-01-20 Impact factor: 5.349

Review 6. Populations, Traits, and Their Spatial Structure in Humans.

Authors: Mashaal Sohail; Alan Izarraras-Gomez; Diego Ortega-Del Vecchyo
Journal: Genome Biol Evol Date: 2021-12-01 Impact factor: 3.416

7. Re-situations of scientific knowledge: a case study of a skirmish over clusters vs clines in human population genomics.

Authors: James Griesemer; Carlos Andrés Barragán
Journal: Hist Philos Life Sci Date: 2022-04-21 Impact factor: 1.452

8. A geometric relationship of F₂, F₃ and F₄-statistics with principal component analysis.

Authors: Benjamin M Peter
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2022-04-18 Impact factor: 6.671

9. Genomic history and ecology of the geographic spread of rice.

Authors: Rafal M Gutaker; Simon C Groen; Emily S Bellis; Jae Y Choi; Inês S Pires; R Kyle Bocinsky; Emma R Slayton; Olivia Wilkins; Cristina C Castillo; Sónia Negrão; M Margarida Oliveira; Dorian Q Fuller; Jade A d'Alpoim Guedes; Jesse R Lasky; Michael D Purugganan
Journal: Nat Plants Date: 2020-05-15 Impact factor: 15.793

10. Allele frequency differentiation at height-associated SNPs among continental human populations.

Authors: Minhui Chen; Charleston W K Chiang
Journal: Eur J Hum Genet Date: 2021-07-15 Impact factor: 5.351