Literature DB >> 35561686

The genomic origins of the world's first farmers.

Nina Marchi1, Laura Winkelbach2, Ilektra Schulz3, Maxime Brami2, Zuzana Hofmanová4, Jens Blöcher2, Carlos S Reyna-Blanco3, Yoan Diekmann5, Alexandre Thiéry1, Adamandia Kapopoulou1, Vivian Link3, Valérie Piuz6, Susanne Kreutzer2, Sylwia M Figarska2, Elissavet Ganiatsou7, Albert Pukaj2, Travis J Struck8, Ryan N Gutenkunst8, Necmi Karul9, Fokke Gerritsen10, Joachim Pechtl11, Joris Peters12, Andrea Zeeb-Lanz13, Eva Lenneis14, Maria Teschler-Nicola15, Sevasti Triantaphyllou16, Sofija Stefanović17, Christina Papageorgopoulou7, Daniel Wegmann18, Joachim Burger19, Laurent Excoffier20.   

Abstract

The precise genetic origins of the first Neolithic farming populations in Europe and Southwest Asia, as well as the processes and the timing of their differentiation, remain largely unknown. Demogenomic modeling of high-quality ancient genomes reveals that the early farmers of Anatolia and Europe emerged from a multiphase mixing of a Southwest Asian population with a strongly bottlenecked western hunter-gatherer population after the last glacial maximum. Moreover, the ancestors of the first farmers of Europe and Anatolia went through a period of extreme genetic drift during their westward range expansion, contributing highly to their genetic distinctiveness. This modeling elucidates the demographic processes at the root of the Neolithic transition and leads to a spatial interpretation of the population history of Southwest Asia and Europe during the late Pleistocene and early Holocene.
Copyright © 2022 The Author(s). Published by Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Neolithic transition; ancient genomics; demogenomic modeling; demographic inference; demographic processes; human evolution; population admixture; upper Palaeolithic

Mesh:

Substances:

Year:  2022        PMID: 35561686      PMCID: PMC9166250          DOI: 10.1016/j.cell.2022.04.008

Source DB:  PubMed          Journal:  Cell        ISSN: 0092-8674            Impact factor:   66.850


Introduction

Genetic analyses of skeletal remains from prehistoric sites have greatly enriched our knowledge of the changes that brought sedentism and food-production, along with new people, material culture and practices, to Europe approximately 8.6 kya (kiloyears ago) through processes often described in archaeology as the “Neolithic revolution” (Childe, 1936). These processes are thought to have reached a tipping point ∼11.7 kya in Southwest Asia, where plants and animals were first domesticated (Fuller et al., 2011; Peters et al., 1999). From this region, it is widely agreed that farming spread into Europe along two main routes, the “Mediterranean” route and the “Danubian” route (Shennan, 2018); it is the latter route that forms the focus of the research described here. Despite this well-developed archaeological narrative, the genetic origins of the world’s first farmers and the spatiotemporal scope of the processes involved remain elusive, in large part due to the lack of high-quality ancient genomes derived from the populations involved in the crucial initial phases of farming expansion. To date, palaeogenetic studies have established that European early farmers (EFs) were genetically distinct from contemporary Holocene hunter-gatherers (HGs) inhabiting the continent (Bramanti et al., 2009; Skoglund et al., 2012). Apparently, genetic exchange between EFs and HGs appear to have been limited in the early phases of the agricultural expansion, with more intense exchange taking place in the later stages (Lazaridis et al., 2014; Mathieson et al., 2015). Most farmers present in Continental Europe around 9 kya appear to have descended from populations inhabiting the Aegean basin and the Eastern Marmara region (Hofmanová et al., 2016), but their ultimate genetic and geographic origins are still a matter of debate. Early Neolithic (EN) farmers from the Aegean are clearly related to Central Anatolian farmers (Kılınç et al., 2017), but they also show affinities with Pre-Pottery Neolithic farmers of the Southern Levant (Lazaridis et al., 2016). This suggests a common origin of all these populations prior to the westward spread of agriculture (Kılınç et al., 2016), potentially in the Fertile Crescent area, an archaeologically significant region that contained parts of modern-day Iran, Iraq, Israel, Palestine, Jordan, Lebanon, Syria, and Turkey. However, research has also revealed that Aegean farmers are genetically distinct from early farming populations from the eastern wing of the Fertile Crescent, the Zagros region of Iran and northern Iraq, which may indicate parallel adoption of farming practices by genetically distant groups of HGs across Southwest Asia (Broushaki et al., 2016). Furthermore, there is some evidence of genetic continuity between Epipalaeolithic and Neolithic populations of Central Anatolia (Feldman et al., 2019), suggesting local transitions to agriculture without major gene flow. To make the picture even more complex, some Central Anatolian EFs also show genetic affinities to Caucasus HGs as represented by a 10th millennium before present (BP) genome from Kotias in Western Georgia (Jones et al., 2015; Kılınç et al., 2016; Skourtanioti et al., 2020). Caucasus HGs are themselves thought to be closely related to early Iranian Neolithic farmers (Lazaridis et al., 2016) as well as to later Pontic-Caspian steppe pastoralists (Lazaridis, 2018; Mathieson et al., 2018; Narasimhan et al., 2019). Despite efforts to understand the genomic history of early farming populations in Europe and Southwest Asia, we still lack a detailed historical scenario of population demography, divergence, and migration embedded in geographic and temporal frameworks. Only a few analyses have previously estimated divergence times between ancestors of Neolithic and HG groups (Broushaki et al., 2016; Jones et al., 2015); the models used were additionally simplistic. The palaeogenetic conclusions outlined above mainly derive from the interpretation of descriptive analyses and summary statistics (e.g., principal component and admixture analyses, f-statistics), usually computed on low coverage genomes and/or on ascertained sets of SNPs (Patterson et al., 2012). Such data are difficult to integrate into “demogenomic” (shorthand for demographic modeling using genomic information) analyses. The goal of this paper is to reconstruct the ancestry of Southwest Asian and European EFs and the processes that contributed to their differentiation from HGs. To do so, we produced 15 high-resolution genomes of early Holocene farmers and HGs distributed along a geographical and temporal transect reaching from Southwest Asia to the Rhine valley in West-Central Europe (Figure 1). DNA was extracted from skeletons recovered from some of the most important archaeological sites in early Holocene Europe and Anatolia, including the first farming villages in the Aegean basin (Barcın, Aktopraklık, Nea Nikomedeia); Mesolithic and Neolithic sites from the Iron Gates gorges and other areas of the Central Balkans (Lepenski Vir, Vlasac, Grad-Starčevo, Vinča-Belo Brdo); and the oldest cemeteries and mass grave or “massacre” sites of the Central European EN (Kleinhadersdorf, Asparn-Schletz, Essenbach-Ammerbreite, Dillingen-Steinheim, Herxheim) (Table 1; Figure 1).
Figure 1

Spatial and temporal distribution of the ancient genomes analyzed in this study

(A) Location of archaeological sites with newly sequenced genomes and additional genomes used for modeling: Neolithic (black) and Mesolithic or Palaeolithic (red); different chronological phases of Neolithic expansion (colored areas) and archaeological cultures (blue) along the Danubian route of Neolithization; geographical areas (purple).

(B) Chronological distribution of the 25 genomes analyzed in this study, with the 15 newly sequenced genomes in bold, and the previously published genomes in italics (details in Tables 1 and S3). We also list the cultural groups (EFs, early farmers; HGs, hunter-gatherers), the regions and the archaeological sites where ancient individuals were sampled. The chronological interval at 2 sigma (95.4% probability) is shown for each directly 14C-dated sample, except for Stuttgart and Ess7, for which approximate dates are given based on the archaeological context.

Table 1

Archaeological and genetic information for newly sequenced genomes

IndividualPeriod (culture)SiteCountryAge (cal. BP)Mean depth (X)Genetic sexHaplogroups mtDNAHaplogroups Y
VLASA7LMVlasacSerbia8,764–8,34015.21MU5a2aI2
VLASA32LMVlasacSerbia9,741–9,46812.65MU5a2aR1b1
AKT16ENAktopraklıkTurkey8,635–8,46012.25FK1a3
Bar25ENBarcınTurkey8,384–8,20512.65MN1a1a1G2a2b2a1
Nea3ENNea NikomedeiaGreece8,327–8,04011.57FK1a2c
Nea2ENNea NikomedeiaGreece8,173–8,02312.51FK1a
LEPE48TENLepenski VirSerbia8,012–7,86710.92MK1a1C1a2b
LEPE52E-MNLepenski VirSerbia7,931–7,69312.37MH3G2a2b2a1a1c
STAR1EN (Starčevo)Grad-StarčevoSerbia7,589–7,47610.55FT2e2
VC3-2EN (Starčevo)Vinča-Belo BrdoSerbia7,565–7,42611.22MHV-16311G2a2a1a3
Asp6EN (LBK)Asparn-SchletzAustria7,575–7,47412.11MU5a1c1G2a2b2a3
Klein7EN (LBK)KleinhadersdorfAustria7,244–7,00011.30FW1-119
Dil16EN (LBK)Dillingen-SteinheimGermany7,235–6,99810.60MJ1c6C1a2b
Ess7EN (LBK)Essenbach-AmmerbreiteGermany(7,050–6,900 BP)12.34MU5b2c1G2a2b2a1a1
HerxEN (LBK)HerxheimGermany7,164–6,99311.46FK1a4a1i

LM, late Mesolithic; EN, early Neolithic; TEN, transformational/early Neolithic; E-MN, early-middle Neolithic; LBK, Linearbandkeramik. Samples with genetic sex determined as XX and XY are noted as F and M, respectively.

The samples’ ages are based on 14C dating (95.4% probability), except Ess7, for which an approximate date is given based on the archaeological context.

Spatial and temporal distribution of the ancient genomes analyzed in this study (A) Location of archaeological sites with newly sequenced genomes and additional genomes used for modeling: Neolithic (black) and Mesolithic or Palaeolithic (red); different chronological phases of Neolithic expansion (colored areas) and archaeological cultures (blue) along the Danubian route of Neolithization; geographical areas (purple). (B) Chronological distribution of the 25 genomes analyzed in this study, with the 15 newly sequenced genomes in bold, and the previously published genomes in italics (details in Tables 1 and S3). We also list the cultural groups (EFs, early farmers; HGs, hunter-gatherers), the regions and the archaeological sites where ancient individuals were sampled. The chronological interval at 2 sigma (95.4% probability) is shown for each directly 14C-dated sample, except for Stuttgart and Ess7, for which approximate dates are given based on the archaeological context. Archaeological and genetic information for newly sequenced genomes LM, late Mesolithic; EN, early Neolithic; TEN, transformational/early Neolithic; E-MN, early-middle Neolithic; LBK, Linearbandkeramik. Samples with genetic sex determined as XX and XY are noted as F and M, respectively. The samples’ ages are based on 14C dating (95.4% probability), except Ess7, for which an approximate date is given based on the archaeological context.

Results

Genetic structure and affinities of ancient individuals

Multidimensional scaling (MDS) performed on the neutral portion of the genome (Figure 2A) of ancient individuals and modern reference populations reveals three clusters of ancient individuals: (1) European HGs, (2) western EFs, i.e., EFs from Europe and Anatolia, (3) an EF individual from Iran (WC1) and a Mesolithic HG from Caucasus (KK1). Consistent with previous analyses based on ascertained SNPs (Marcus et al., 2020; Skoglund et al., 2012), the western EFs show strongest affinities with modern Sardinians, with the exception of one English (CarsPas1) and two Northwest Anatolian EFs (Bar8 and AKT16), who are found to cluster with modern individuals from other parts of Southern Europe. In contrast, Palaeolithic and Mesolithic European HGs are genetically well differentiated from all modern Europeans. The Iranian EF and the Caucasus HG appear to be genetically close to modern populations from their sampling area, in keeping with some long-term genetic continuity in those regions. This observation is even more pronounced when performing a MDS analysis on the whole genome, including sites potentially affected by selection (Figure M1_5 from Methods S1). Generally, ancient individuals appear to be closer to modern ones, once the MDS is computed on the whole genome instead of just neutral sites. This could be explained by a slower evolution of genomic regions influenced by background selection (Pouyet et al., 2018). Another striking difference visible on the whole-genome MDS plot (Figure M1_5) is that western EFs are closer to some other Southern Europeans than to Sardinians.
Figure 2

Genetic structure and affinities of ancient individuals

(A) Multidimensional scaling (MDS) analysis performed on the neutrally evolving portion of ancient (n = 25; large filled circles and squares: early farmers; triangles: hunter-gatherers) and modern (n = 65, shown as small circles) genomes from Europe and SW Asia. Ellipses highlight two clusters of ancient individuals: the European hunter-gatherers (HGs) and the European and Anatolians early farmers (i.e., western EFs).

(B) Admixture analyses (K = 3) performed on 22 ancient genomes (three genomes with the lowest quality were discarded: Bichon, Bon002, and Bar8). Note that AKT16 in NW Anatolia is more admixed than a neighbor genome from the Barcın site (Bar25), in keeping with f-statistics analyses (see Figure S1), which has led us to consider them as originating from independent populations in our demographic modeling.

(C) Heterozygosity computed at neutral sites in ancient genomes (HGs in blue, EFs in green).

(D) Runs of homozygosity (ROHs) computed on imputed ancient whole genomes for intermediate ROHs (2–10 Mb, dark color) or long ROHs (>10 Mb, light color), indicative of background relatedness in small populations and close inbreeding, respectively.

See also Figure S1.

Genetic structure and affinities of ancient individuals (A) Multidimensional scaling (MDS) analysis performed on the neutrally evolving portion of ancient (n = 25; large filled circles and squares: early farmers; triangles: hunter-gatherers) and modern (n = 65, shown as small circles) genomes from Europe and SW Asia. Ellipses highlight two clusters of ancient individuals: the European hunter-gatherers (HGs) and the European and Anatolians early farmers (i.e., western EFs). (B) Admixture analyses (K = 3) performed on 22 ancient genomes (three genomes with the lowest quality were discarded: Bichon, Bon002, and Bar8). Note that AKT16 in NW Anatolia is more admixed than a neighbor genome from the Barcın site (Bar25), in keeping with f-statistics analyses (see Figure S1), which has led us to consider them as originating from independent populations in our demographic modeling.
Figure S1

Population grouping verified with f-statistics, related to Figure 2 and Methods S1

These analyses were performed on the 1240k dataset.

(A) We first tested if some individuals of a specific group had significantly more shared ancestry with individuals of a different group using f-statistics of the form D(Ind1 from a population, Ind2 from the same population; Test, Mbuti [outgroup]). We only found three significant absolute Z scores (>3.0, yellow). For Austria, Asp6 appears to share more ancestry with VLASA7 than Klein7. Since variation in European HG ancestry is expected in EF populations due to the ongoing process of admixture and since these samples did not show variation in their affinities to other EF samples, modeling them as a single population seems justified. For NWAnatolia, AKT16 was found to share significantly more ancestry with both Loschbour and Bichon, which we further investigated in (B).

(B) To shed more light on the variation in European HG ancestry among Anatolian and Greek samples, we calculated f-statistics of the form D(NGreece/SGreece/NWAnatolia/CAnatolia, CAnatolia; HG_west, Mbuti [outgroup]), where we use HG_west to denote both West 1 and West 2 European HGs. This test indicates whether the tested individual/population from NGreece, SGreece, NWAnatolia, or CAnatolia (left) shares more (orange, Z score > 0) or less (blue, Z score < 0) ancestry with the tested HG_west individual (bottom) than the tested individual from CAnatolia (top). Significant Z scores above 3.0 or below −3.0 are shown with more intense colors. Among the CAnatolian individuals, Pınarbaşı and Boncuklu_N appear to share excess drift with HG_west when compared with individuals from Greece and NWAnatolia, in contrast to Tepecik-Çiftlik_N. Among the individuals from NGreece, SGreece, and NWAnatolia, AKT16 appears closest to HG_west, and much closer than Bar25. In light of these results, we modeled AKT16 and Bar25 independently in the demographic inferences.

(C) Heterozygosity computed at neutral sites in ancient genomes (HGs in blue, EFs in green). (D) Runs of homozygosity (ROHs) computed on imputed ancient whole genomes for intermediate ROHs (2–10 Mb, dark color) or long ROHs (>10 Mb, light color), indicative of background relatedness in small populations and close inbreeding, respectively. See also Figure S1.

Demogenomic modeling assumptions

To progress our research, we first contrasted alternative models of population differentiation using high-resolution ancient genomes drawn from each of the three MDS clusters described above. Sampled individuals from each cluster were assumed to derive from populations that belong to a large structured metapopulation, consisting of interconnected but mostly unsampled populations. Such a model is described in the literature as the “continent-island” model (Excoffier, 2004, 2013) with sampled populations (the islands) receiving a single pulse of gene flow from the metapopulation (the continent) shortly before their sampling time. The three metapopulations were designated Western, Central, and Eastern to reflect the geographic distribution of the ancient genomes across Europe and SW Asia. These three metapopulations represent the pools of western European HGs, western EFs from Europe and Anatolia, and Iranian EFs, respectively. Additional ancient individuals were then gradually added to the initial model to infer their ancestry and relationships with other populations (see Methods S1—Demogenomic inference with fastsimcoal2). Thus, increasingly complex models could be built and tested, resulting in the demographic scenario shown in Figure 3.
Figure 3

Demographic scenario inferred from genomic modeling

This demographic history was obtained by compiling the best models of all tested scenarios (see Methods S1— Demogenomic inference with fastsimcoal2—Final model). Times of the events (y axis) and population ages (shown below their symbols) are indicated in ky BP. Under each population name, we indicate their sampled genomes, their associated inbreeding coefficients (Fis), and their diploid effective population sizes (Ne). Unfilled symbols indicate ancestral populations that we simulated after or before key events (split times or admixture events). The X symbols indicate bottlenecks that occurred on ancestral branches, modeled as a one-generation bottleneck through a population with its effective size shown in italics. Admixture proportions >10% from the Western metapopulation are indicated by blue arrows.

See also Figure S2 for effective populations sizes inferred by MSMC2.

Demographic scenario inferred from genomic modeling This demographic history was obtained by compiling the best models of all tested scenarios (see Methods S1— Demogenomic inference with fastsimcoal2—Final model). Times of the events (y axis) and population ages (shown below their symbols) are indicated in ky BP. Under each population name, we indicate their sampled genomes, their associated inbreeding coefficients (Fis), and their diploid effective population sizes (Ne). Unfilled symbols indicate ancestral populations that we simulated after or before key events (split times or admixture events). The X symbols indicate bottlenecks that occurred on ancestral branches, modeled as a one-generation bottleneck through a population with its effective size shown in italics. Admixture proportions >10% from the Western metapopulation are indicated by blue arrows. See also Figure S2 for effective populations sizes inferred by MSMC2.
Figure S2

MSMC2 effective population size estimates, related to Figure 3 and Methods S1

These were obtained for populations either with four haplotypes or two haplotypes when only one ancient individual was available for the population (shown in the legend with ∗, details in Table M1_3, population sizes estimated from single ancient and modern genomes are shown in Figure M1_12). We used a mutation rate of 1.25 × 10−8 per generation per site and a generation time of 29 years. The analysis suggests smaller effective population sizes in the most recent times for HGs compared with EFs.

All western EFs share a remote common ancestry with Caucasus HGs

In contrast with previous studies (Lazaridis et al., 2016), we find that Caucasus HGs (represented by KK1) and western EFs are all descended from a population ancestral to the Central metapopulation. This is in line with a recent genetic study showing that KK1 was more closely related to EFs than to western European HGs (Speidel et al., 2021). This ancestral Central metapopulation received about 14% (95% CI 8–26) of its gene pool from the Western metapopulation some 14.2 kya (95% CI 13.7–19.0, Figures 3 and M1_18). Ancestors of the Iranian Neolithic population (represented by WC1) were not affected by this initial admixture: they rather diverged from the Eastern metapopulation 13.6 kya (95% CI 11–24.6) after its split from the Central metapopulation ∼15.8 kya (95% CI 14.3–25.6). Even though Caucasus HGs show closer genetic affinities with EFs from Iran (Figures 2A and 2B), our analyses suggest that they share a common ancestry with all western EFs.

Ancestors ofwestern EFs admixed twice with western HGs

We find that the ancestors of western EFs received a second pulse of gene flow (15%, 95% CI 6–17) from the Western metapopulation ∼12.9 kya (95% CI 9.4–13.9), while Caucasus HGs did not (Figure M1_20B). Models that do not include this additional admixture have a lower likelihood and are therefore rejected (Figure M1_20A). Thus, the ancestors of western EFs are the product of repeated episodes of gene flow from the Western metapopulation. These populations have then diverged from Caucasus HGs due to an intense period of genetic drift between 12.9 and 9.1 kya (Figures 3 and 4). Indeed, we find that their effective population size was reduced to 620 individuals (95% CI 72–2,150) during this relatively long period of drift, which caused them to not only diverge genetically from their ancestral population but also from Caucasus and European HGs, and from Iranian EFs (Figure 4).
Figure 4

Evolutionary insights gained from the demographic scenario shown in Figure 3

(A) MDS analysis done on 12 populations used in the demogenomic analyses and on simulated ancestral populations (unfilled symbols) sampled at key moments of their history, as defined on Figure 3: (1) on the ancestral branch before the split between Western and Eastern metapopulations 25.6 kya; (2) on the Central metapopulation branch just before and after the admixture occurring 14.2 kya; (3) on the Western metapopulation branch just before this admixture; (4) on the Eastern metapopulation branch at the time of split of the Iranian population 13.6 kya; (5) at the top of the western EFs ancestors branch just after its admixture with the Western metapopulation (12.9 kya), and then every 25 generations until the split of the Aegean populations 9.3 kya. Arrows indicate the trajectory of the populations caused by important demographic events (i.e., admixture events, bottlenecks, episodes of drift).

(B) Admixture plot for K = 3 performed on sampled and ancestral populations.

See also Figure S3 for corresponding admixture plots done on observed and simulated individuals.

Evolutionary insights gained from the demographic scenario shown in Figure 3 (A) MDS analysis done on 12 populations used in the demogenomic analyses and on simulated ancestral populations (unfilled symbols) sampled at key moments of their history, as defined on Figure 3: (1) on the ancestral branch before the split between Western and Eastern metapopulations 25.6 kya; (2) on the Central metapopulation branch just before and after the admixture occurring 14.2 kya; (3) on the Western metapopulation branch just before this admixture; (4) on the Eastern metapopulation branch at the time of split of the Iranian population 13.6 kya; (5) at the top of the western EFs ancestors branch just after its admixture with the Western metapopulation (12.9 kya), and then every 25 generations until the split of the Aegean populations 9.3 kya. Arrows indicate the trajectory of the populations caused by important demographic events (i.e., admixture events, bottlenecks, episodes of drift). (B) Admixture plot for K = 3 performed on sampled and ancestral populations. See also Figure S3 for corresponding admixture plots done on observed and simulated individuals.
Figure S3

Comparison of observed and simulated Admixure plots, related to Figure 4 and Methods S1.

Admixture plot for K = 2 (left panel) and K = 3 (right panel) carried out on (A) observed data for 16 ancient genomes included into fastsimcoal2 demographic inferences (909,688 sites without missing data; Bichon, Bar8, and Bon002 were not included because of lower quality) and on (B) data simulated accordingly to our final model (shown in Figure 3), for the same subset of individuals as in (A).

Anatolian and Aegean farmers differentiation

Populations from Northwest Anatolia (the archaeological sites of Aktopraklık and Barcın) and Northern Greece (Nea Nikomedeia) appear to have diverged from one another at about the same time ∼9.1–9.3 kya (95% CI 9.1–12, Figures M1_20 and M1_22), potentially during the colonization of the wider Aegean area by EFs. In contrast, EFs from Central Anatolia (represented by a genome from Boncuklu) diverged at least 1,000 years earlier ∼10.5 kya (95% CI 10.5–11, Figure M1_24). That Anatolian and Aegean populations show varying amounts of recent gene flow from the Western metapopulation suggests different levels of interaction with surrounding HGs. Indeed, genomes from Northern Greece show a lower degree of further HG introgression (3%, 95% CI 1–11) than those of Boncuklu (10%, 95% CI 3–15), Barcın (12%, 95% CI 6–16), and especially Aktopraklık (17%, 95% CI 11–18) (Table S4). The high level of Western metapopulation admixture found in Aktopraklık, a site previously described as influenced by both Epipalaeolithic and Neolithic traditions (Özdoğan, 2011), is in line with the admixture analysis (Figure 2B) and the f-statistics that show larger genetic affinities of AKT16 with European HGs than other western EFs (Figure S1). Population grouping verified with f-statistics, related to Figure 2 and Methods S1 These analyses were performed on the 1240k dataset. (A) We first tested if some individuals of a specific group had significantly more shared ancestry with individuals of a different group using f-statistics of the form D(Ind1 from a population, Ind2 from the same population; Test, Mbuti [outgroup]). We only found three significant absolute Z scores (>3.0, yellow). For Austria, Asp6 appears to share more ancestry with VLASA7 than Klein7. Since variation in European HG ancestry is expected in EF populations due to the ongoing process of admixture and since these samples did not show variation in their affinities to other EF samples, modeling them as a single population seems justified. For NWAnatolia, AKT16 was found to share significantly more ancestry with both Loschbour and Bichon, which we further investigated in (B). (B) To shed more light on the variation in European HG ancestry among Anatolian and Greek samples, we calculated f-statistics of the form D(NGreece/SGreece/NWAnatolia/CAnatolia, CAnatolia; HG_west, Mbuti [outgroup]), where we use HG_west to denote both West 1 and West 2 European HGs. This test indicates whether the tested individual/population from NGreece, SGreece, NWAnatolia, or CAnatolia (left) shares more (orange, Z score > 0) or less (blue, Z score < 0) ancestry with the tested HG_west individual (bottom) than the tested individual from CAnatolia (top). Significant Z scores above 3.0 or below −3.0 are shown with more intense colors. Among the CAnatolian individuals, Pınarbaşı and Boncuklu_N appear to share excess drift with HG_west when compared with individuals from Greece and NWAnatolia, in contrast to Tepecik-Çiftlik_N. Among the individuals from NGreece, SGreece, and NWAnatolia, AKT16 appears closest to HG_west, and much closer than Bar25. In light of these results, we modeled AKT16 and Bar25 independently in the demographic inferences.

A stepwise, demic expansion of Neolithic farmers into Central Europe

To understand the progressive spread of EFs into Europe, we expanded the existing demographic model by including three EN populations from Serbia, Austria, and Germany (Figure M1_25). Even though the model is not spatially explicit, aspects of the EFs spread can be inferred from the spatial and temporal distribution of the archaeological sites. We find that a simple model with a strict stepwise migration of EFs originating in the wider Aegean region (Northwest Anatolia or Northern Greece) and extending to Serbia through the Balkans and along the so-called Danubian corridor, then to Austria, and eventually Germany, is better supported than a scenario allowing for long-distance migration from the Aegean directly to Austria (Figure M1_26A; Table S4). We also find that early farming communities incorporated a few HG individuals at each modeled stage of their dispersal along the Danubian corridor (2%–7%), compatible with previous estimates of 3%–9% (Hofmanová et al., 2016; Lipson et al., 2017; Nikitin et al., 2019; Figure M1_26B). Scenarios without Western metapopulation introgression into EF populations have a lower likelihood than scenarios with introgression (Figure M1_26A). Even though we modeled this introgression from the Western metapopulation that is closely related to the newly sequenced genomes from the Mesolithic site of Vlasac, we cannot exclude that it actually occurred from other European HG groups like those related to Loschbour and Bichon. Previous genetic studies have indeed suggested that different Mesolithic backgrounds could have introgressed into the EF gene pool in different regions of Europe (Lipson et al., 2017).

A last glacial maximum divergence between Eastern and Western metapopulations

Our model also provides important insights regarding the deep branching of pre-Neolithic populations. The divergence between the ancestors of the Western and Eastern metapopulations is estimated to date back to ∼25.6 kya (95% CI 17.3–31.3, Figure M1_18). This is much younger than the previously inferred divergence time between the ancestors of western European HGs and either Iranian EFs (46–77 kya; Broushaki et al., 2016) or European EFs (46 kya; Jones et al., 2015). However, these previous divergence times were obtained using relatively simple models without bottlenecks and assuming topologies with only recent or even no admixture at all. We have explored additional scenarios to evaluate the effects of these simplifications on metapopulation divergence times. As expected, a model without bottlenecks on the metapopulation branches leads to a much older divergence time of 39 kya between Eastern and Western metapopulations (Table S4), which is more in line with previous estimates, but this model is inherently less likely than the original one (Figure M1_18A). On the other hand, models without any admixture between the Western HG metapopulation and the ancestors of western EFs lead to a much younger divergence time between the Eastern and Western HG metapopulations (16 kya, Table S4), but the fit with the data is very poor (Figure M1_18A). Comparing two western European HGs from the sites of Bichon and Loschbour with our newly sequenced Mesolithic individuals from Vlasac further reveals that European HG populations had already split during the last glacial maximum (LGM) ∼22.8 kya (95% CI 16.7–24.7, Figure M1_28), with Bichon and Loschbour populations subsequently diverging from one another, approximately 1,000 years later.

The reduced diversity in European HGs is due to a massive LGM bottleneck

Genetic diversity as quantified by the heterozygosity at neutral sites is much lower in HGs than in EFs (Figure 2C), excepting Northwest Anatolian EFs in line with previous studies (Kılınç et al., 2016; Kousathanas et al., 2017). HG genomes furthermore show a generally larger proportion of intermediate runs of homozygosity (ROHs) (2–10 Mb ROHs, Figure 2D; Figure M1_7; Ringbauer et al., 2021) indicative of background relatedness within European HGs, usually attributed to small population size (Ceballos et al., 2021)—a small population size is also observed in MSMC2 analyses (Figure S2). MSMC2 effective population size estimates, related to Figure 3 and Methods S1 These were obtained for populations either with four haplotypes or two haplotypes when only one ancient individual was available for the population (shown in the legend with ∗, details in Table M1_3, population sizes estimated from single ancient and modern genomes are shown in Figure M1_12). We used a mutation rate of 1.25 × 10−8 per generation per site and a generation time of 29 years. The analysis suggests smaller effective population sizes in the most recent times for HGs compared with EFs. However, it is interesting to note that we estimate in our model HG effective population sizes to be larger than those of most EFs, particularly those from Anatolia (Bon002, AKT16, and Bar25), which show effective population sizes of a few hundred individuals only (Figure 3), consistent with their high proportion of intermediate length ROHs (Figure 2D). The lower diversity observed among European HGs thus appears to be rather due to a very strong LGM bottleneck (Figures 3 and M1_18 and see discussion) than to individuals living in small isolated groups.

The inferred model reproduces key features of genomic data

Genomic data simulated under the most complete demographic scenario (Figure 3) lead to population relationships (Figure 4A) that are very similar to those observed on the MDS plot performed on real data (Figure 2A). Three clusters are clearly visible, with European HGs all in proximity to each other, and markedly distinct from the western EFs; Caucasus HGs, and Iran EFs in contrast remain distinct. Furthermore, a simplified admixture graph (Patterson et al., 2012) fitted on data simulated under the model presented in Figure 3 leads to f-statistics consistent with those calculated on the real data (compare Figures M1_29 and M1_30). All population relationships inferred in our model of Figure 3 are thus confirmed by f-statistics. Finally, an admixture analysis performed on data simulated according to our model leads to admixture proportions that are very similar to those observed, as shown in Figures 4B and S3. In particular, the Caucasus HG shows a large yellow component shared with the Iranian EF, in line with their proximity on the MDS plot. The good match between observed and simulated data thus provides an a posteriori validation of our model-based approach.

Discussion

Evolutionary insights gained from explicit demographic modeling

Our sequencing of ancient genomes at >10x has not only tripled the number of high-quality whole genomes available for the early Holocene in Europe but also allowed us to perform genetic analyses on an unbiased set of markers minimally impacted by selection. Such neutral markers are ideally suited for reconstructing the population history of Europe and Southwest Asia from the Late Pleistocene to the Early Holocene. In addition to confirming long-held assumptions and interpretations, our modeling provides several key insights about the demographic processes that preceded the Neolithic transition and its expansion to the west.

A split of European HGs triggered by the LGM

Our model suggests that European HGs had already split into two subgroups (West 1 and West 2 in Figure 5C) ∼23 kya, after experiencing a very severe bottleneck during the LGM, responsible for their low level of genetic diversity (Figure 2C). In contrast to previous studies (Ceballos et al., 2021; Günther et al., 2018), HGs were found to have generally larger effective population sizes than contemporary EFs (Figure 3). Such relatively large effective population sizes can lead to slow population differentiation, which might explain why the different HG groups show close genetic affinities (Figures 2A and 4A) despite long divergence times and a wide geographic distribution. Large HG effective population sizes could be due to long-distance genetic exchanges between groups. Contrastingly, the inferred low effective population size of EFs (despite obvious large census sizes) suggests that the Neolithic transition was linked to a reduction in local EF effective population sizes, potentially due to “sedentarization” or commitment to place (Aimé et al., 2013) and restricted gene flow among small-scale farming communities, as observed at the aceramic Neolithic sites of Boncuklu and Aşıklı (Yaka et al., 2021).
Figure 5

A spatiotemporal interpretation of population differentiation in SW Asia and Europe based on our model and the geographic distribution of the genomes

For a Figure360 author presentation of this figure, see https://doi.org/10.1016/j.cell.2022.04.008.

Colored shaded areas indicate approximate putative distributions of populations at different time points. The letters (A) to (H) indicate the chronological order of events; see main text for a detailed description. Note that warmer periods (Bølling and Allerød interstadials, Holocene) correspond to population range expansions while colder periods (LGM, Older Dryas) are associated with contractions.

See also Figure S4 for additional f-statistics analyses supporting alternative connections between the Levant and the Aegean/Greece area.

A spatiotemporal interpretation of population differentiation in SW Asia and Europe based on our model and the geographic distribution of the genomes For a Figure360 author presentation of this figure, see https://doi.org/10.1016/j.cell.2022.04.008. Colored shaded areas indicate approximate putative distributions of populations at different time points. The letters (A) to (H) indicate the chronological order of events; see main text for a detailed description. Note that warmer periods (Bølling and Allerød interstadials, Holocene) correspond to population range expansions while colder periods (LGM, Older Dryas) are associated with contractions. See also Figure S4 for additional f-statistics analyses supporting alternative connections between the Levant and the Aegean/Greece area.
Figure S4

Patterns of population admixture revealed by f-statistics, related to Figure 5 and Methods S1.

Relationship of Anatolian and Greek Neolithics with the Levant using f-statistics on the 1240k dataset of the form D(CAnatolia/NWAnatolia/NGreece/SGreece/CEurope,Test; Levant,Mbuti [outgroup]). This test indicates whether Neolithic individuals from CAnatolia, NWAnatolia, Greece, or CEurope (left) share less (blue, Z score < 0) or more (orange, Z score > 0) ancestry with samples from the Levant, namely individuals from contemporary Israel associated with Natufian culture (Israel_Natufian), Pre-Pottery Neolithic B (Israel_PPNB), and Chalcolithic (Israel_C; all bottom) than the Test individuals/populations from Greece and CAnatolia. Significant Z scores above 3.0 or below −3.0 are shown with more intense colors. We find a strongly significant excess of shared drift between populations from the Levant and NGreece, SGreece, NWAnatolia, and CEurope when contrasted to Boncuklu. This signal was, however, not replicated when representing CAnatolia by Pınarbaşı. In contrast, the SGreece populations Diros_EN and Peloponnese_N appear to share excess drift with populations from the Levant, and in particular the Chalcolithic Israel_C, when contrasted to samples from NGreece, NWAnatolia, or CEurope.

Ancestors of western EFs are related to Caucasus HGs

We have shown that HGs from the Caucasus share a common ancestor with western EFs, as both show traces of an ancestral admixture event between the Western and Eastern metapopulations (Figure 3). Despite this historical relationship, our model still manages to predict the apparent affinity of the Caucasus HG genome with the Iranian EF (e.g. the yellow component shared by KK1 and WC1 found for real and simulated data in Figures 2B, 4B, and S3, as well as their proximities on the MDS plots of Figures 2A and 4A). This shows that observed patterns of genetic similarities are not sufficient to reconstruct actual ancestry relationships.

Specific demographic processes explain EF and HG differentiation from their ancestral population

Through our demogenomic modeling, we can demonstrate how specific demographic processes gave rise to the genetic divergence of past populations. We can indeed not only simulate the genomic diversity of sampled ancient individuals but also that of individuals drawn from ancestral populations at any time point and thus predict their relationships with sampled individuals (Figure 4A). For instance, we infer that the population ancestral to all our ancient sampled individuals was genetically close to Iranian EFs and Caucasus HGs. As also shown in Figure 4A, the ancestors of European HGs (the Western metapopulation) considerably diverged from the ancestral population due to a LGM bottleneck, explaining their outlier position on the top left of the MDS plot. The ancestors of western EFs (i.e., the Central metapopulation) were initially close to the ancestral population and to the Iranian EFs, but the two consecutive admixture events with the Western HG metapopulation put them closer to the European HGs on the MDS plot. Nonetheless, it was the >2,500 years of intense genetic drift that made them particularly distinct from all other groups, resulting in them occupying the upper right corner of the MDS plot in Figure 4A. We postulate that this rapid divergence process could be due to recurrent founder effects having occurred during their dispersal through Anatolia. Although western EFs were previously described as genetically intermediate between other SW Asian groups (Feldman et al., 2019), or as a mixture of Iranian and Southern Levant Neolithic populations with western HGs (Lazaridis et al., 2016), this initial admixture signal remains hidden to classical admixture analyses because it is progressively eroded by the genetic drift that occurred during the migration of these populations through Anatolia (Figure 4B). Whereas populations simulated shortly after the two main admixture events with the Western metapopulation appear as admixed (i.e., EFAncestors [12.9 kya] in Figure 4B), this signal progressively disappears through time with the emergence (and migration) of western EFs who more and more appear as having a completely unrelated gene pool (e.g., green component in the admixture analyses of Figure 4B). Their slightly more central position on the MDS plot (Figure 4A) in later generations is then due to admixture with the Western metapopulation and surrounding farmers modeled as the Central metapopulation.

A spatial interpretation of population differentiation

The timing and sequence of demographic events that emerge from our model suggest a scenario of population differentiation with a clear geographic and chronological resolution from the LGM to the early Holocene (Figure 5).

HG structuration and divergence induced by the LGM

We find that the divergence between the Eastern and Western HG metapopulations has been initiated by the LGM some 26 kya (Figures 5A and 5B), probably due to a deterioration of the habitat and a contraction into LGM refugia potentially located in milder regions (Figure 5B). The archaeological, environmental, and climatic records indeed suggest that large tracts of Eurasia were deserted at the height of the LGM, ∼26 kya, when ice sheets were at their maximum extent (Clark et al., 2009; French, 2021; Jöris et al., 2009), and both human and animal populations survived in glacial refugia located in more southern regions of Europe (Sommer et al., 2008). In our initial model, the LGM divergence is immediately followed by a bottleneck of very strong intensity in the population ancestral to European HGs. The modeled intensity I of this bottleneck depends on the bottleneck duration (t) and its size (Nbot) as I = t/(2Nbot). If the bottleneck had lasted 4,000 years (corresponding to 138 generations of 29 years), our estimated intensity I = 0.18 would correspond to an effective bottleneck population size of 383 individuals. This low number is in line with the archaeological record suggesting a 60% decline in census population size in the latter part of the Gravettian, 14C-dated to 29,000–25,000 cal. BP, with total population size in Europe as low as 700–1,550 individuals (Maier, 2017). We have explored an alternative demographic scenario where the bottleneck was decoupled from the divergence between the Eastern and Western HG metapopulations. In that case (Figure M1_17; Table S4), we find a slightly older metapopulation divergence (27 kya) occurring during the phase of decrease in northern summer solar insolation, i.e., between 32 and 26 kya (Clark et al., 2009), and a more recent bottleneck (23 kya) sitting in the middle of the LGM, confirming that the two events are related to this extreme cold phase. Our analyses further indicate that European HGs differentiated in two separate refugia by the end of the LGM 21.7 kya (Figures 3 and 5C), perhaps corresponding to what archaeologists traditionally identify as the areas of distribution of Solutrean and Epigravettian technocomplexes (Smith, 1966; Kozłowski and Kaczanowska, 2004).

Post-LGM range expansions and admixture

Following a period of range expansion after the LGM (Figure 5D), representatives of the Central HG metapopulation, who were likely descendants of the Epigravettian refugial population, mixed ∼14.2 kya with the population ancestral to both Caucasus HGs and western EFs (called East 1 in Figure 5E). Given the assumed geographical distribution of the preceding glacial refugia, this Bølling interstadial period admixture likely happened in a region encompassing Southeastern Anatolia and the Northern Levant or even in neighboring regions such as Central and Eastern Anatolia or the Turkish South coast.

Differentiation of Anatolian and Aegean EF groups

The demographic processes behind the differentiation between Central Anatolian and Aegean EF populations are more difficult to pinpoint and locate with precision. The inferred low population size of the ancestors of western EFs after their split from the Central metapopulation during the Older Dryas Figure 5F) could be due to a westward range expansion and associated recurrent founder effects during the Allerød interstadial. This period was characterized by relatively favorable climatic conditions, which may have allowed the ancestors of western EFs to further mix with Epipalaeolithic HGs in Anatolia (∼12.9 kya; Figure 5G). The fact that Central Anatolian EFs share the same admixture event and drift with Aegean EFs suggests that they were part of the same expansion wave, and that Central Anatolia was settled by EFs before they reached the Aegean area, potentially by a different route. However, a previous study (Feldman et al., 2019) has shown that Central Anatolian EFs and Epipalaeolithic HGs were genetically similar, which indicates that admixed groups existed in Central Anatolia prior to the Neolithic transition. Alternatively, admixed HG populations could have moved there from the Fertile Crescent, adopting fully developed farming practices at a later stage, which is in line with the observation that early aceramic sites such as Boncuklu and Aşıklı on the Anatolian Plateau show experiments in crop cultivation and caprine management ∼9.7 kya (Buitenhuis et al., 2018; Ergun et al., 2018). By contrast, the migration to NW Anatolia (Figure 5H) likely occurred at the time of the fully developed ceramic Neolithic characterized by the establishment of widespread mixed farming (Bogaard et al., 2017). Further support for such a demic diffusion scenario to NW Anatolia by a direct (coastal) route and to a lesser extent via the Konya plain region of Central Anatolia comes from f-statistics showing Southern Levant populations sharing more drift with Aegean Neolithic individuals than with Central Anatolian ones (Figure S4). This signal could be due to either (1) long-distance gene flow between Aegean and Levantine communities, as suggested by archaeologists based on similarities in material culture (Perlès, 2001; Horejs et al., 2015) (2) a higher level of Western metapopulation admixture in Central Anatolia as observed in Boncuklu (Figure S1B) and inferred in our model (Figure 3), or (3) an early migration of the Boncuklu ancestors from the Fertile Crescent to Central Anatolia, combined with some later gene flow between populations from the Fertile Crescent and the ancestors of Aegean EFs. However, f-statistics analyses reveal that EFs from the Aegean and NW Anatolia are rather heterogeneous in their levels of shared drift with several populations, including HGs from the Levant and EFs from Iran (Figure S4; Table S5), suggesting that the Neolithization of the Aegean was a more complex process.

Neolithic expansion occurred through a mixture of cultural and demic diffusion

Even though the initial spread of the Neolithic must have been through cultural diffusion in the Fertile Crescent among genetically well differentiated groups, our results indicate that expansion to Northwestern Anatolia, the Aegean Basin and the Danubian corridor occurred primarily through demic diffusion (Figure 5H). The initial spread of populations beyond the Fertile Crescent was certainly far from linear and associated with multiple genetic influences from the Levant, some of which were not linked to the emergence of a fully developed Neolithic economy. As soon as Neolithic lifeways reached Europe, the mode of dispersal of populations to Central Europe became more linear and can essentially be modeled as a stepping-stone migration (Figure M1_26).

Limitations of the study

Although our demogenomic model (Figure 3) can clarify observed population affinities, it has exposed temporal and geographical gaps, which can only be filled by producing additional genomic data of similarly high quality. In particular, high-quality genomes from pre-Neolithic and Neolithic populations of the Western Fertile Crescent, as well as reference genomes for Central Anatolia and HGs from Eastern Europe are needed to confirm our conclusions and render them more complete. Whereas the demographic scenarios we have explored are much more complex than those previously investigated (Broushaki et al., 2016; Jones et al., 2015), they are certainly still very schematic and reality must have been more complex. In particular, we have used simplified assumptions such as constant mutation rates and generation times, which may have actually changed over time (Coll Macià et al., 2021; Harris, 2015), or we have modeled genetic interactions between populations as direct and unique pulses of gene flow, whereas gene flow could have occurred over prolonged periods, or could have occurred indirectly through unsampled populations. However, since our model still captures and reproduces most observed genetic patterns (Figure 4), it seems robust enough to provide key insights into past demographic processes. Finally, while we have analyzed genetic evidence relating to European EFs along the Danubian corridor, the spread of EFs along the Mediterranean coast remains to be investigated to obtain a more complete picture of the Neolithic settlement of Europe.

Conclusions

Population modeling using the framework outlined here allowed us to extract key, unexpected, but complementary and far more detailed information on population affinities than could be concluded from summary statistics and multivariate analysis alone. In addition, this model provides a time frame for the differentiation of the major groups populating Southwest Asia and Europe from the LGM until the introduction of agriculture and highlights the crucial role of climatic changes as driver of population fragmentation and admixture events (Lahr and Foley, 1998). Although the world’s first farmers look genetically very different from European HGs, our simulation based on high-quality genomes shows that some European and Southwest Asian populations in fact shared a recent common history marked by repeated interactions since the end of the last ice age. Strong drift during their expansion through Anatolia contributed to making western EFs look more dissimilar than they actually were and somehow concealed their hybrid nature. In summary, the idea of a single cultural and genetic origin of all farmers in the Fertile Crescent, without significant initial contribution of European-like HGs, is no longer tenable in its current form.

STAR★Methods

Key resources table

Resource availability

Lead contact

Further information and requests should be directed to and will be fulfilled by the lead contact Laurent Excoffier (laurent.excoffier@iee.unibe.ch).

Materials availability

This study did not generate new unique reagents.

Experimental model and subject details

In this study, we present new whole-genome sequences for 15 ancient human individuals (see Table 1, Figure D1_1 from Data S1, and key resources table for an overview). Our sample set consists of two Mesolithic individuals from Vlasac (Serbia), two individuals from Transitional and Neolithic layers at Lepenski Vir (Serbia) and 11 early Neolithic individuals originating from Aktopraklık and Barcın in Turkey (one individual each), from Nea Nikomedeia in Greece (two individuals), from Vinča-Belo Brdo and Grad-Starčevo in Serbia (one individual each), from Kleinhadersdorf and Asparn-Schletz in Austria (one individual each) and Essenbach-Ammerbreite, Dillingen-Steinheim and Herxheim in Germany (one individual each). For detailed information on the archaeological background, please refer to the Data S1. Age estimated from anthropological analyses and genetic sex of the individuals are reported in Table 1 and Data S1. The ancient skeletal material used for this study was collected and analysed in collaboration with the anthropologists/archaeologists in charge of the archaeological sites, in accordance with the laws of the respective countries. Specifically, for the Greek samples, we had permission by the Greek Ministry of Culture and Sports for the sampling, DNA extraction, and radiocarbon dating according to Greek law for destructive sampling of archaeological material (Ν.3028/02).

Method details

Data generation

Sample preparation

All laboratory analyses were conducted in the dedicated ancient DNA facilities of the Palaeogenetics Group Mainz (Institute of Organismic and Molecular Evolution, Johannes Gutenberg University, Mainz), according to strict ancient DNA protocols to prevent contamination with modern DNA as well as cross contamination between samples as previously described (Bollongino et al., 2013; Bramanti et al., 2009; Scheu et al., 2015). These ancient DNA standards include decontamination of workspace, labware and samples, independent DNA extractions and the processing of blank controls during sample pulverisation, DNA extraction, library preparation and PCR reactions to monitor contaminations. All steps prior to PCR amplification (sample preparation, DNA extraction and library preparation) were carried out in the dedicated ancient DNA facilities of the Palaeogenetics Group, which are separated from post-PCR areas. The petrous bone samples were prepared as described in Hofmanová et al. (2016). In detail, after documentation, samples were sterilised under ultraviolet light (254 nm) from 2 sides for 45 minutes per side. In order to remove superficial contaminants, soil remains and the outer bone surface were removed using a sandblasting machine (P-G 400, Harnisch+Rieth) with Spezial-Edelkorund (EW60/250μ and 30B/50μ; Harnisch+Rieth). The densest, inner part of the petrous bone was then isolated and cut in cubes using a disk saw (Electer Emax IH-300, MAFRA). After irradiation with ultraviolet light (254 nm) from two sides, for 45 minutes per side, the densest bone cubes were pulverised using a mixer mill (MM200, Retsch). To control contamination, blank milling controls containing hydroxyapatite were processed in parallel.

Radiocarbon dating and stable isotope analysis

New radiocarbon dates and stable isotope values (carbon δ13C and nitrogen δ15Ν) were obtained for seven individuals (Table D1_2). The analyses were performed at the Curt-Engelhorn-Zentrum Archäometrie gGmbH (Mannheim, Germany). The collagen used for the analyses was extracted from fragments of the petrous bones used for palaeogenomic analysis. All dates were uniformly re-calibrated in OxCal 4.4.2 (Bronk Ramsey, 2009) using the IntCal20 calibration curve (Reimer et al., 2020). The isotope results were used as a basis for palaeodietary inference (Figure D1_11) and for freshwater reservoir effect correction in the case of the Vlasac sample (Table D1_2). Radiocarbon dates on human bones from archaeological sites in the Danube Gorges are known to be influenced by a large freshwater 14C reservoir effect, due to high intake of freshwater food, and need to be corrected before calibration (Cook et al., 2001). We did it following the method described by Cook et al. (2001), later amended by Bonsall et al. (2015): for human bones with , uncalibrated dates of the form were adjusted to where .

DNA extraction

DNA extraction followed the protocol by Yang et al. (1998) with the modifications described in MacHugh et al. (2000) and Gamba et al. (2014) as well as additional modifications described below. For each sample, 0.15-0.31 g of bone powder was used for extraction. Prior to extraction, a pre-lysis was performed by adding 1 mL of EDTA (0.5 M, pH8) to the bone powder and incubating at room temperature for 10 minutes. The solution was centrifuged at maximum speed to pellet the powder and the supernatant was removed. Lysis was performed on rocking shakers at 37°C for 24 hours (900 rpm) using 1 ml of extraction buffer containing EDTA (950 μl, 0.5 M, pH8), Tris-HCl (20 μl, 1 M, pH8), N-Lauroylsarcosine (17 μl, 5%) and Proteinase K (13 μl; 20 mg/ml). After 24 hours of incubation, the samples were centrifuged for 10 minutes at 10,000 rpm, the supernatant was removed, transferred into new tubes and stored in a fridge until further processing. A second lysis step was performed following the same procedure. Following lysis, the supernatants from the two lysis steps were merged on an Amicon Filter (Amicon Ultra-4 30 kDA, 15 ml) and centrifuged for 10 min at 2500 rpm. The DNA was then washed twice with 3 ml 1X Tris-EDTA, followed by centrifugation at 2500 rpm for 20 minutes and discarding of the flow-through in between. After washing, the extract was concentrated to 100 μl and subsequently purified with the MinElute PCR Purification Kit following the manufacturer’s instructions but incubating for 5 minutes during elution with 44 μl elution buffer (preheated to 65°C). Blank controls were processed during DNA extraction and incorporated into all further steps of the analyses.

Library preparation

Double-indexed libraries were prepared according to the protocol by Kircher et al. (2012) with slight modifications. Prior to library preparation the DNA extracts were treated with USER™ enzyme: 5 μl of USER™ enzyme was added to 16.25 μl of DNA extract (exception: 17μl of two merged extracts were used for library SL3_5 of sample AKT16, see Table S1) and the mixture was incubated for three hours at 37°C (Verdugo et al., 2019). The blunt-end repair step followed immediately and was performed using the NEBNext End Repair Module: the DNA extract was mixed with NEBNext End Repair Reaction Buffer (10X, 7 μl), NEBNext End Repair Enzyme Mix (3.5 μl) and nuclease-free water (38.25 μl; for a final reaction volume of 70 μl) and incubated for 15 minutes at 25°C followed by 5 minutes at 12°C. Deviating from this procedure, two libraries (SL2.12_MU and SL3.12_MU of sample VC3-2) were produced using 20 μl DNA extract and without USER™ enzyme treatment of the DNA extract (see Table S1). In the adapter ligation step hybridised adapters P5 and P7 (Meyer and Kircher, 2010) were used at a concentration of 0.75 μM. 3 μl of Fill-In product (total volume: 40 μl) were amplified using the AccuPrime™ Pfx SuperMix (20 μl) in one PCR parallel adding unique and sample-specific index combinations to the library molecules (final reaction volume: 25 μl; final primer concentration: 200 nM or 160 nM each). Double indexing followed Kircher et al. (2012), but using index sequences from the NexteraXT index Kit v2 (Illumina; barcode length 8 bp; ordered at IDT, Leuven, Belgium). The PCR was performed in 9-13 cycles; the PCR temperature profile followed the manufacturer’s recommendations but using an annealing temperature of 60°C, extending for 30 seconds during each cycle and performing a final elongation step for 5 minutes. For purification during library preparation the MinElute PCR Purification Kit was used, while amplified libraries were purified with the MSB® Spin PCRapace. Libraries were quantified by Qubit Fluorometric quantitation (dsDNA HS assay) and measurement on the Agilent 2100 Bioanalyser System (High Sensitivity DNA Analysis). Blank controls as well as positive controls (nonsense hybrids) of known concentration, were processed in every library step including PCR amplification to verify the success of the library preparation and to monitor contamination. Quantification of blank controls did not indicate significant contamination during any laboratory step (pulverisation, DNA extraction, library preparation and amplification).

Sample Screening

Using shallow shotgun sequencing on an Illumina MiSeq™ platform at StarSEQ GmbH (Mainz, Germany), all samples were screened for their endogenous DNA preservation. Libraries were equimolarly pooled, subsequently purified with Agencourt® AMPure® XP beads and sequenced in single-end runs with 50 bp read length. Demultiplexing was performed by the sequencing facility using the MiSeq Reporter and allowing one mismatch in the barcode. Blank controls from DNA extraction, library preparation and PCR-steps were sequenced alongside to estimate the fraction of potentially contaminating reads introduced in the lab (for details see Table S1).

Whole-genome sequencing

For deeper shotgun sequencing, 2-5 DNA extracts and 3-9 libraries of each sample were prepared (as detailed above). The libraries were amplified in up to 13 PCR parallels, each with an individual index combination, to increase the complexity of the libraries. Subsequently, PCR parallels were purified and quantified individually as described above. For sequencing, PCR parallels were pooled equimolarly according to their concentrations measured on Qubit® and purified with Agencourt® AMPure® XP beads. The samples were sequenced on an Illumina HiSeq3000 (SE, 100 cycles or PE, 150 cycles) at the Next Generation Sequencing Platform at the University of Berne, Switzerland. A summary and details of the laboratory work and the sequencing strategy for each sample can be found in Table S1.

Bioinformatics pipeline

The bioinformatic steps within this section were conducted with the following programs and versions unless specified otherwise: ANGSD - version 0.917 (Rasmussen et al., 2011) ATLAS - version 1.0, commit 6bd2482 (Link et al., 2017) ATLAS-Pipeline, commit 6df90e7 (bitbucket.org/wegmannlab/atlas-pipeline) BWA - Burrows-Wheeler Alignment Tool - version 0.7.15 (Li, 2013) ContamMix - version 1.0 (Fu et al., 2013) fastqc - version 0.11.5 (www.bioinformatics.babraham.ac.uk/projects/fastqc/) GATK - version 3.7 (DePristo et al., 2011) mafft - version 7.31 (Katoh et al., 2002) MIA (Mapping Iterative Assembler) - version 1.0 (https://github.com/mpieva/mapping-iterative-assembler) Picard-tools - version 2.9 (http://broadinstitute.github.io/picard/) SAMtools - version 1.3 (Li et al., 2009) seqtk - version 1.2 (https://github.com/lh3/seqtk) Snakemake - version 4.0 (Köster and Rahmann, 2012) Trim Galore! - version 0.4.3 (https://github.com/FelixKrueger/TrimGalore)

Raw data handling

We committed to process all 102 samples (15 ancient genomes from this study, 10 previously published ancient genomes from Brace et al., 2019; Broushaki et al., 2016; Gamba et al., 2014; Günther et al., 2018; Hofmanová et al., 2016; Jones et al., 2015; Kılınç et al., 2016; Lazaridis et al., 2016, and 77 modern SGDP genomes from Mallick et al., 2016; see Table D1_1 and Table S2 for the complete list) with the same bioinformatic pipeline, albeit minor changes due to variation in how the raw data were obtained. We therefore first describe the pipeline used for the 15 samples sequenced in this study, followed by a description of the differences in how previously published data were analysed: The sequencing quality was checked with fastqc. Raw reads were trimmed using Trim Galore with no quality filter and a length filter of 30 bp (-q0, --length 30, -a ‘AGATCGGAAGAGCACACGTCTGAACTCC’). For paired-end libraries, the additional option --retain unpaired was used and the second adapter sequence (-a2 'AGATCGGAAGAGCGTCGTGTAGGGAAAG') was provided. A second fastqc analysis was performed to verify trimming and the quality of the remaining sequences. Reads were then aligned to the 1000 Genomes project version of the human reference genome hs37d5 (1000 Genomes Project Consortium et al., 2015) using bwa -mem with options -t 8 -M. Reads with a mapping quality below 30 were filtered out with SAMtools -q30. Duplicate reads were marked but not removed using picard-tools MarkDuplicates with VALIDATION_STRINGENCY=SILENT and AS=TRUE. Unmapped reads were removed with SAMtools with option -F4. Informative read groups were added to keep track of individual PCR-parallels with picard-tools AddOrReplaceReadGroups. All necessary sorting and indexing in-between the steps was performed with SAMtools index and sort, library-files from the same samples were merged using SAMtools merge.

Local Realignment

Local indel-realignment is an important but computationally expensive step in our pipeline. In order to allow for parallelization, we proceeded with the following three steps: We identified potential target intervals using GATK RealignerTargetCreator, on a set of ten ancient and ten modern samples (Asp6, Bar25, Dil16, Ess7, Klein7, LEPE48, Nea2, Nea3, STAR1, VC3-2, VLASA32, VLASA7, French-1, Sardinian-1, Russian-1, Spanish-1, Georgian-1, Hungarian-1, Icelandic-1, Finnish-1, English-1, Greek-1, Mende-1, Estonian-1, Polish-1), while providing known indel sites identified by the 1000 Genomes project (see Key Resources table). We refer to the obtained target set as TargetsBase. Using ATLAS task=downsample, we created a set of reads by downsampling the following 5 ancient and 5 modern samples to a depth of 4X: Bar25, Klein7, STAR1, VLASA32, VLASA7, French-1, Georgian-1, Finnish-1, Greek-1, Polish-1. We refer to this set as the GuidanceSet. We run local indel-realignment for each sample individually as follows: we identified private target positions for the sample using GATK RealignerTargetCreator while providing the known indel sites from the 1000 Genomes project (as above). We then unified the identified targets with TargetsBase to obtain a sample-specific target set. Finally, we run GATK IndelRealigner on this sample together with the GuidanceSet on the sample-specific targets. We parallelized this step additionally by contig, merging the output with picard-tools.

Filter and Sample Statistics

To reduce spurious alignments, we used picard-tools FilterSamReads to filter out reads that contained soft-clipped bases previously identified with ATLAS task=assessSoftClipping or did not pass picard-tools ValidateSamFile. We further used SAMtools to keep only primary alignments and, for paired-end libraries, proper pairs. We determined read counts, sequencing depth, endogenous DNA-content and other statistics using ATLAS task=BAMDiagnostics and SAMtools flagstat. These statistics are given per library-parallel and for the merged data for all 15 ancient samples presented here in Table S1.

ATLAS Pipeline

We used commit 6df90e7 of the snakemake pipeline ATLAS-Pipeline, which is available at bitbucket.org/wegmannlab/atlas-pipeline and summarised below. Configfiles used are made available at https://github.com/CMPG/originsEarlyFarmers. Since Post-mortem-Damage (PMD) is most prevalent at read ends, the PMD pattern is different if a read spans the entire fragment or not. Following Kousathanas et al. (2017), we split single-end read groups by length using ATLAS task=splitRGbyLength with the option allowForLarger. We identified reads that span the entire fragment as those shorter than the maximum read length minus 5 bases to account for variation introduced by adapter-trimming. We then identified PMD patterns for each read group using ATLAS task=estimatePMD, providing the reference genome with fasta=hs37d5.fasta and limiting the analysis to the chromosomes 1-22, X and Y with option chr. PMD estimates are shown in Figure M1_1A. We performed base-quality recalibration as described in Kousathanas et al. (2017), using ATLAS task=recal on known ultraconserved sites obtained from https://ccg.epfl.ch/UCNEbase/data/download/ucnes/hg19_UCNE_coord.bed (Dimitrieva and Bucher, 2013). We assumed equal base frequencies (equalBaseFreq) and inferred recalibration parameters for each PCR-parallel (pmdFile) from all sites with a minimum depth of 2 (minDepth=2). Sequencing- and PCR-duplicates were not excluded from the estimation of recalibration parameters (keepDuplicates) to add additional power. A quality transformation by base quality recalibration was created with ATLAS task=qualityTransformation (Figure M1_2). For paired-end sequenced read-groups, the estimated recalibration parameters were directly corrected in the BAM file using ATLAS task=recalBAM, providing the recal-parameters (recal) and the reference (fasta=hs37d5.fasta). Paired-end reads with corrected base qualities were then physically merged with ATLAS task=mergeReads to avoid overlapping bases to be counted twice in downstream analysis. Reads that did not pass picard-tools ValidateSamFile were omitted. As a result of the physical merging, relative positions of bases within a read have changed, and we thus re-estimated PMD patterns as outlined above. We called Bayesian genotypes in the reference genome with ATLAS task=callNew and parameters method=Bayesian prior=theta fixedTheta=0.001 infoFields=DP equalBaseFreq. The reference genome, PMD patterns and recal-patterns (for single-end read groups) were provided with fasta, pmdFile and recal, respectively. The first and last 2 bp of each read were ignored (trim5=2, trim3=2). We also created recalibrated BAM files from single-end libraries, which we used in any subsequent analysis in which ATLAS was not involved (e.g. contamination estimation). The molecular sex was inferred using a script from (Skoglund et al., 2013) version 0.4, based on the ratio of reads aligning to the X and Y chromosomes (Table 1; Figure M1_1B).

Contamination estimation

Blank controls of extraction, library-preparation and index-PCR were analysed alongside the screening process. The concentration of potential contaminants was never higher than 0.81 ng/μl. A Bioanalyser analysis (HS DNA, Agilent Technologies, Waldbronn, Germany) and the screening results for extraction- and library-controls confirm the detected DNA to consist of primer- and adapter-dimers with a maximum of 55 aligning reads per blank control (out of a potential share of 200,000 reads; see Table S1). We used the ContamMix R-script estimate.R to estimate the amount of authentically mapping mitochondrial reads at 311 modern diagnostic marker positions. We run ContamMix with the option --baseq20 and the following two input-files that we generated from all reads in the recalibrated BAM file that mapped against the MT-Genome: i) An alignment of mitochondrial reads against their own consensus (-samFn). This consensus was constructed by iteratively mapping the reads with MIA (mia) and mapping the last iteration with MIA (mia) to obtain one FASTA-sequence. The MT-reads were mapped against their consensus using bwa aln and samse (v.0.7.17) and filtered for a minimum mapping quality of 30 with SAMtools -q30. ii) A multiple alignment of the consensus genome and a fasta-file containing the diagnostic marker positions against each other (--malnFn) obtained with mafft. For male individuals, contamination was additionally estimated using the ANGSD contamination script based on haploid X-chromosomal regions (X:5000000-154900000). We run ANGSD with the options -doCounts 1 and -iCounts 1, followed by the executable misc/contamination, providing the publicly available HapMap file HapMapChrx.gz. The contamination estimates obtained with both methods (ANGSD and ContamMix) are shown in Figure M1_1C and Table S1.

Reference samples

To minimise reference biases (Günther and Nettelblad, 2019), we analysed all previously published samples with the same pipeline as outlined above. However, some adaptations were necessary for some samples: Blank characters in read names (e.g. from ENA accession numbers) were removed to ensure the proper detection of optical duplicates. Quality scores of Bichon FASTQ files were converted from illumina 1.5 to illumina 1.9 with setqk seq -V 64. Stuttgart and Loschbour were received as unaligned and untrimmed demultiplexed raw data in BAM file format and transformed to unaligned FASTQ format using picard-tools SamToFastq. For Loschbour files, individual library information was not recoverable. Therefore, duplicates across all sequencing files were marked. Raw sequencing data was not obtainable for Bon002 and SF12. For Bon002, we downloaded the FASTQ files that were produced from BAM files by ENA. For SF12, we used published BAM files with already physically merged reads and converted them back to FASTQ files using picard-tools SamToFastq. For both, we split the FASTQ files by run identifiers to treat each run separately throughout the pipeline. Since those reads were already pre-processed, we omitted adapter trimming, removed orphaned reads and otherwise followed the pipeline for single-end samples without splitting read groups by length. For NE1, one of the FASTQ files was corrupted and only 25% of the contained reads could be recovered. As a result, we could only use 97% of the total number of reads generated for that sample (Gamba et al., 2014). We then demultiplexed the fixed FASTQ-files with a custom bash script, allowing one mismatch at the first position and one mismatch at any other position of the index. Since the modern samples from the SGDP dataset were already aligned to the desired reference genome, we abstained from remapping them and analysed them along the ancient samples as described above, starting with filtering with for MQ<30, marking optical and PCR duplicates and performing local realignment. For each ancient sample, all pipeline results and differences from the standard pipeline are detailed in Table S1.

Data filtering

Individual VCFs obtained after Bayesian genotype calling were first filtered according to their read depth (DP): for each individual we excluded sites that had DP < 8 and those that had a DP bigger or smaller than 2.5 s.d. away from the mean DP for each individual. To minimise effects of outliers, the mean DP was calculated using the R function optimize for the sites comprised +/-10 of the mode DP and a tolerance of 10-6 (see Table S2). We also excluded sites that had poor genotype quality (GQ < 30), and we only kept autosomal sites. Furthermore, heterozygous sites were considered as homozygous if they had a significant allelic imbalance (p-value ≤ 0.1) tested by Fisher’s exact test. After filtering, the modern individuals had on average 2,549,812,798 sites passing the filters against 1,980,736,974 for the ancients (ranging from 333,061,590 to 2,544,170,685). All VCF file processing was performed using bcftools v0.1.15 (Danecek et al., 2021). Applying these filters led to a substantial increase in the quality of our heterozygous sites, reducing the amount of sites with high allelic imbalances and decreasing the asymmetry between singleton and non-singleton sites to a minimum (Figures M1_3 and M1_4). To quantify this improvement explicitly, let us denote by the set of loci at which individual i was called heterozygous either for a singleton (only non-reference allele across all samples, s = 1) or not (s = 0). We then define the imbalance statisticswhere denotes the number of reference alleles observed in individual i at site l out of the total number of observed alleles at this site. To compare singleton (s = 1) against non-singleton (s = 0) positions, we further quantify . As shown in Figure M1_4, the chosen filters guarantee comparable imbalances at singleton and non-singleton sites, indicating similar quality of both categories. A striking outlier was sample CarsPas1, which was not included in any demographic analysis.

Dataset assembling

For further downstream analysis we assembled the following datasets: “Ancient”: The filtered individual VCFs from the 25 ancient individuals were merged into a single VCF polarised with the Chimpanzee reference genome (panTro4 from The Chimpanzee Sequencing Analysis Consortium, 2005). We only kept the biallelic polymorphic sites (6,986,216) for which individuals can have missing genotypes (pipeline made available in our GitHub repository: https://github.com/CMPG/originsEarlyFarmers). “102samples”: Same as Ancient but with all 102 ancient and modern samples, keeping 14,896,103 biallelic polymorphic sites. On average, the modern genomes show 4% of missing sites with a maximum of 6%, the ancients 29% ranging from 3 to 87%. “Neutral”: In order to avoid biases due to background selection (BGS) and biased gene conversion (BGC) when estimating population diversity and relationships (Matthey-Doret and Whitlock, 2019), we performed most of our genetic and demographic analyses on a “neutral” portion of the genome (Pouyet et al., 2018). We defined this portion as the restricted set of sites from the 102samples dataset that had the same reference allele in the chimpanzee and gorilla reference genomes, that were in regions with recombination rate >1 cM/Mb where BGS has little effect, outside of CpG islands, non CpG sites (i.e. when a C is followed by a G in the chimpanzee and gorilla genomes and has a T as alternative allele in our data; or a G preceded by a C in in the chimpanzee and gorilla genomes with an A as alternative allele) and with A↔T and G↔C mutations which are not affected by BGC or post-mortem damages. We kept 473,834 biallelic sites. We obtained the Chimpanzee and the Gorilla hg19 nucleotide states from the chained and netted alignments http://hgdownload.cse.ucsc.edu/goldenpath/hg19/vsPanTro4/axtNet/ and http://hgdownload.cse.ucsc.edu/goldenpath/hg19/vsGorGor5/axtNet/, respectively; the recombination map for the YRI population from the 1000 Genomes Project (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/); the CpG islands listing from the UCSC genome browser (CpG Islands cpgIslandExt Track). Filtering was performed in R, with VCF files read using SNPRelate package (Zheng et al., 2012) “Phased”: We did genotype imputation and phasing for each chromosome on the 102samples dataset that was polarised back with hs37d5 human reference genome. We used SHAPEIT4 v1.2 (Delaneau et al., 2019) with by-default parameters for sequencing data, the HapMap phase II b37 genetic map provided in SHAPEIT4 folders, and the Haplotype Reference Consortium (McCarthy et al., 2016) dataset (accession number EGAD00001002729 on the European Genome-phenome Archive) as reference panel. All the chromosomal phased-imputed VCFs were then concatenated into a single phased VCF with bcftools v1.9, which includes 2,795,127,079 sites in total. “1240k”: We compared the samples obtained in this study with previously published, target-enriched datasets available through Allen Ancient DNA Resource v42.4 (AADR) at https://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/index_v42.4.html. We use these population labels, but split Peleponnese_N into Diros_EN and Peleponnese_LN. To ensure our data is comparable to the pseudo-haploid calls, we generated majority calls for all our ancient individuals with ATLAS (task=majorityBase; commit 7cfc900) at the 1,233,013 SNP positions present in this reference dataset. We then combined the calls of the 25 ancient genomes with the AADR and refer to this dataset as 1240k in the following. “1240k_HOIll”: We extracted the non-functional HOIll subset of sites (Lazaridis et al., 2016) for the Ancient dataset. HOlll sites were also extracted for additional samples (Altai, Chimp, Vindija and Dinka) retrieved from the target-enriched dataset available through Allen Ancient DNA Resource v37.2 (AADR) at https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data.

Phenotype predictions

Pigmentation

Pigmentation phenotypes of hair, skin and eyes were predicted with the HIrisPlex-S webtool (Chaitanya et al., 2018; Walsh et al., 2013) for each of the newly sequenced samples and the previously published samples. In case of missing genotypes for any of the 41 SNPs used by HIrisPlex-S, the sites with low or no depth in the Ancient dataset were examined directly in the BAMs and the most abundant allele was chosen (Table S3). In BAMs, we only considered bases at least 3 bases away from either end of the read, with a base quality ≥ 25, and no C↔T and G↔A SNPs to avoid any effect of PMD on the prediction. To account for the uncertainty associated with low-coverage sites, two HIrisPlex-S input files were created for each individual: one in which the missing genotype consisted of the allele taken from the BAM and the non-effect allele and one where the genotype was comprised of the allele taken from the BAM and the effect allele. Running HIrisPlex-S twice for each sample resulted in ranges of probabilities for each phenotype (Table S3). A prediction was accepted without further explanation if in both runs the same phenotype showed a probability ≥ 0.7 (Chaitanya et al., 2018). If predictions differed between runs, the most parsimonious phenotype was chosen, following the approach of Walsh in Brace et al. (2019) (Figure M1_9).

Standing height

To assess genetic variation related to standing height, a classical highly heritable polygenic trait (McEvoy and Visscher, 2009), polygenic scores (PS) were computed based on a set of 670 SNPs (Chan et al., 2015) on the Ancient dataset (Figure M1_10). To account for missing data common even in high depth ancient genomes, we applied the generalised risk score approach described by Veeramah et al. (2018) on samples where enough SNPs were present to account for at least 75% of the scores effect size. This led to the exclusion of five ancient individuals (Bar8, Bon002, Bichon, CarsPas1, AKT16) out of the 25 included in this study.

Other phenotypes

Genotypes for SNPs associated with additional phenotypes of interest were inspected manually for each sample in the Ancient dataset or BAM files if necessary: rs4988235 a variant in the MCM6 gene associated with lactase-persistence in Eurasia; rs3827760 in the EDAR gene; rs17822931 in ABCC11; seven SNPs located in the FADS1/2 gene complex (listed in Table M1_2). A two-sided binomial test was used to compare counts of derived alleles in the ancient individuals with frequencies estimated in the CEU population from phase3 of the 1000 genomes project (1000 Genomes Project Consortium et al., 2015), chosen as a proxy for Central European modern populations (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/).

Quantification and statistical analysis

Population genetics analyses

Uniparental haplogroup determination

Mitochondrial haplogroups were determined from the BAM files for the 15 newly-sequenced genomes using phy-mer (Navarro-Gomez et al., 2015) with K-mer minimal number of occurrences = 10 (Table S3). For the samples genetically identified as men, Y-chromosomal haplogroups were determined using Yleaf (Ralf et al., 2018) with minimal base-quality of 20 (-q 20) and base-majority to determine an allele set to 90% (-b 90).

Neanderthal introgression

We quantified proportions of Vindija Neanderthal (Prüfer et al., 2017) introgression by computing f ratio statistics of the form as previously suggested (Petr et al., 2019) with qpF4ratio from the ADMIXTOOLS package (Patterson et al., 2012) on the 1240k_HOIll dataset.

Genomic heterozygosity

We estimated the level of genetic diversity of each individual as the proportion of heterozygous sites found in the neutrally evolving portion of the genome (Figure 2C; Table S2). Thus, for every genome, we divided the amount of heterozygous sites observed in the Neutral dataset by the expected number of neutral sites genotyped for this individual. This number is obtained by considering all the genotyped sites for the considered individual that i) have the same reference allele for both the chimpanzee and gorilla reference genomes, ii) are not CpG sites and out of CpG islands, iii) are in regions with recombination rate > 1 cM/Mb based on Pouyet et al. (2018). Finally, in order to consider only sites unaffected by BGC, we divided the number of these previously defined sites by three, as only one third of all mutations should be BGC free (i.e. A↔T and G↔C polymorphisms).

Runs of homozygosity (ROHs)

ROHs were identified on the Phased dataset in genomes of 90 European and SW Asian modern and ancient individuals imputed using IBDSeq v. r1206 (Browning and Browning, 2013) with default parameters but errormax = 0.005 and ibdlod = 2. We further processed the artificially long tracts spanning assembly gaps or centromeres (genomic locations were obtained from UCSC genome browser: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz) and split them into shorter tracks excluding the gap stretch, inspired by Sikora et al. (2017). As advised when genomes do not come from a homogeneous population (Browning and Browning, 2013), we focused on intermediate (2-10 Mb) and long (>10 Mb) ROHs similarly to Racimo et al. (2020).

Multi-Dimensional scaling (MDS)

Genetic relationships among individuals were estimated from pairwise average nucleotide divergence π (Nei and Li., 1979). For each pair of individuals X and Y, we identified the sites showing no missing data for the two individuals, and we computed their average nucleotide divergence π over these sites (Table S2). We considered whole-genome sites or neutral sites only (in this case, the number of differences was divided by the expected number of neutral sites as defined above in the heterozygosity paragraph). We then represented the relationships between modern and ancient genomes from Europe and SW Asia (for sites from the 102samples or Neutral data sets; Figure 2A; Figure M1_5A) or only the ancient genomes (for sites from the Ancient dataset; Figure M1_5B), using a classical multidimensional scaling (MDS) approach implemented in R (cmdscale function from stats package).

Uniform manifold approximation and projection

In order to represent potentially complex relationships between 90 individuals (85 modern and 25 ancient samples) from Western Eurasia, we performed a Uniform Manifold Approximation and Projection (UMAP) dimension reduction on their genotypes obtained at 2035 neutral polymorphic sites (from the Neutral dataset) that did not present missing data for any individual. The analysis was performed using the umap function from the umap R package (using the options n_neighbors=3, n_epochs=1000, min_dist=0.1, negative_sample_rate=5, n_components=2, and metric="euclidean") (Figure M1_6).

Admixture clustering analyses

Admixture coefficients of each ancient genome were estimated using the R package LEA (Frichot and François, 2015) with parameters K = 2 or 3, alpha = 10, and number of repetitions = 5. We used the function snmf to calculate the fit (entropy) of each run and we plotted the admixture coefficients from the run with the smallest entropy value. The groups were defined in an unsupervised manner. In order to maximise the number of genomic sites to be used, we excluded for these analyses the three lowest quality Neolithic genomes (Bichon, Bon002, Bar8) from the Ancient dataset (Figure 2B; Figure S3A). Comparison of observed and simulated Admixure plots, related to Figure 4 and Methods S1. Admixture plot for K = 2 (left panel) and K = 3 (right panel) carried out on (A) observed data for 16 ancient genomes included into fastsimcoal2 demographic inferences (909,688 sites without missing data; Bichon, Bar8, and Bon002 were not included because of lower quality) and on (B) data simulated accordingly to our final model (shown in Figure 3), for the same subset of individuals as in (A). Patterns of population admixture revealed by f-statistics, related to Figure 5 and Methods S1. Relationship of Anatolian and Greek Neolithics with the Levant using f-statistics on the 1240k dataset of the form D(CAnatolia/NWAnatolia/NGreece/SGreece/CEurope,Test; Levant,Mbuti [outgroup]). This test indicates whether Neolithic individuals from CAnatolia, NWAnatolia, Greece, or CEurope (left) share less (blue, Z score < 0) or more (orange, Z score > 0) ancestry with samples from the Levant, namely individuals from contemporary Israel associated with Natufian culture (Israel_Natufian), Pre-Pottery Neolithic B (Israel_PPNB), and Chalcolithic (Israel_C; all bottom) than the Test individuals/populations from Greece and CAnatolia. Significant Z scores above 3.0 or below −3.0 are shown with more intense colors. We find a strongly significant excess of shared drift between populations from the Levant and NGreece, SGreece, NWAnatolia, and CEurope when contrasted to Boncuklu. This signal was, however, not replicated when representing CAnatolia by Pınarbaşı. In contrast, the SGreece populations Diros_EN and Peloponnese_N appear to share excess drift with populations from the Levant, and in particular the Chalcolithic Israel_C, when contrasted to samples from NGreece, NWAnatolia, or CEurope.

f-statistics

We first used f-statistics to confirm that the assignment of samples to specific populations is not violated by the presence of shared drift with external samples. For this, we calculated all possible f-statistics on the 1240k dataset of the form of D(Individual 1 from the population tested, Individual 2 from the same population; Other samples, Outgroup) using ADMIXTOOLS (Patterson et al., 2012) and Mbuti as the outgroup. We used qpGraph from ADMIXTOOLS v7300 (Patterson et al., 2012) using 1,000 initial conditions to match specificities of the datasets. To validate the model inferred by fastsimcoal2, we aimed at comparing f-statistics predicted under the fastsimcoal2 model against those calculated from the data. For this, we first created a full admixture graph carefully matching the model in Figure 3. To validate this graph, we used data simulated with fastsimcoal2 under the model shown in Figure 3 (seeMethods S1 "- Demogenomic inference with fastsimcoal2 - Final Model" section), which we fitted with qpGraph 100 times with 1,000 initial conditions each. However, qpGraph failed to identify a suitable set of branch lengths and admixture proportion to explain the simulated data with all runs resulting in predicted f-statistics very different from those calculated on the simulated data, suggesting that the SFS captures more information of the full data than f-statistics do. Since the full graph could not be fitted, we next aimed at manually simplifying the graph. Specifically, we removed admixture edges, until we identified the simplest graph (in terms of admixture edges) for which no major differences between the f-statistics predicted under the model and those calculated from the simulated data were detected. This graph was then re-fitted with qpGraph on both a subset of the Ancient dataset (the sites from the EurEF panel shown in Table M1_4) as well as on the 1240k dataset described above. In both cases, qpGraph was run 100 times with 1,000 initial conditions each. To present the fitted graphs and quantify the reliability of the fitting, we calculated the median and 90% quantiles for all branch lengths and admixture odds ratios across the 20 best graphs as judged by the final score. A more detailed description can be found in Methods S1.

Joint Distribution of Fitness Effects

To create a representative Neolithic model population, we aggregated all Neolithic samples that were newly sequenced, plus the WC1 genome (N = 14) from the 102samples dataset. As well, to create a representative modern population, we used the SGDP populations in the 102samples dataset that are in rough proximity to the ancient populations (Polish, Bergamo, Czech, French, Hungarian, Greek, Albanian, Bulgarian, and Turkish). We annotated the synonymous and nonsynonymous SNPs using ANNOVAR (Wang et al., 2010) with hg19/GRCh37 (included with ANNOVAR) as the reference genome. Ancestral alleles were determined based on the Ensembl Compara 71 genome FASTA files. To account for missing data, we projected the joint allele frequency spectra downward to 16 Neolithic chromosomes and 28 modern chromosomes, to roughly maximise the number of segregating synonymous SNPs. The joint Distribution of Fitness Effects (DFE) analysis requires a simpler demographic model than our main inference (Figure 3), and simulations suggest that joint DFE analysis is robust to demographic model details (Huang et al., 2021). Based on prior results (Gravel et al., 2011), we used dadi (Gutenkunst et al., 2009) to fit a demographic model to the synonymous data in which the ancestors of the Neolithic population underwent a bottleneck to relative size n followed by exponential growth (Figure M1_11A). The ancestors of the modern samples diverged T time units after the bottleneck, following the same growth rate to reach final size n. T time units after this divergence, Neolithic chromosomes were sampled, T time units before present. We also included a parameter p to account for potential ancestral state misidentification (Ragsdale et al., 2016). To fit this model, we used grid points of [128, 138, 148], and the best-fit parameters were n = 0.113, n = 1.42, T = 0.106, T = 0.0054, T = 0.0232, p = 0.0634. We then fit a model including this demographic history plus a joint DFE to the nonsynonymous data (Huang et al., 2021). We modelled the DFE as a bivariate lognormal distribution. We assumed that the population-scaled mutation rate q for nonsynonymous mutations was 2.31 times that for synonymous mutations, and we again included a parameter for ancestral state misidentification. We assumed that selection coefficients were potentially different in the ancestors of the modern individuals (Figure M1_11E). The best fit parameters for the joint DFE were m = 3.29, s = 2.61, and r = 0.9968, with 0 misidentification inferred.

Demographic Analyses

MSMC2

We used MSMC2 (Schiffels and Wang, 2020) to infer past effective sizes of ancestral populations and their split times for all high-quality ancient and some representative modern individuals (Table M1_3) on the Phased dataset. To ensure high data quality especially high mappability and genotype quality, we followed (Schiffels and Wang, 2020) and used two masks: 1) per chromosome mappability masks (um75-hs37d5.bed.gz) for the human reference genome hs37d5 downloaded from https://github.com/wangke16/MSMC-IM/tree/master/masks; 2) sample-specific masks that we generated as suggested by Schiffels and Wang (2020). We then ran the generate_multihetsep.py script from MSMC-tools (https://github.com/stschiff/msmc-tools, commit 07bc8a9) to get single, multi-sample and paired population input files for MSMC2. Example of a command line for two, four or eight haplotypes: generate_multihetsep.py --chr 1 --mask ind1.chr1.mask.bed --mask mappabilityMaskperChr/chr1_m75-hs37d5.bed ind1.chr1.phased.vcf > ind1.chr1.multihetsep.txt generate_multihetsep.py --chr 1 --mask ind1.chr1.mask.bed --mask ind2.chr1.mask.bed --mask mappabilityMaskperChr/chr1_m75-hs37d5.bed ind1.chr1.phased.vcf ind2.chr1.phased.vcf > pop1.chr1.multihetsep.txt generate_multihetsep.py --chr 1 --mask ind1.chr1.mask.bed --mask ind2.chr1.mask.bed --mask ind3.chr1.mask.bed --mask ind4.chr1.mask.bed --mask mappabilityMaskperChr/chr1_m75-hs37d5.bed ind1.chr1.phased.vcf ind2.1.phased.vcf ind3.chr1.phased.vcf ind4.chr1.phased.vcf > pop1_pop2.chr1.multihetsep.txt We inferred past effective population size for each diploid individual separately (Figure M1_12) as well as using two samples per population if available (Figure S2), using command lines such as: msmc2 -t11 -s -o ind1.2haps.msmc2 ind1.chr msmc2 -t11 -I 0,1,2,3 -s -o pop1.4haps.msmc2 pop1.chr All MSMC2 results were scaled using a mutation rate of 1.25 × 10−8 per base pair per generation (Scally and Durbin, 2012; Schiffels and Wang, 2020), and a generation time of 29 years (Fenner, 2005). To estimate split times between population pairs, we used MSMC2 1) to estimate coalescent rates among the samples of the first population, 2) to estimate coalescent rates among the samples of the second population, and 3) to estimate coalescent rates across the two populations. Example command lines used for two populations pop1 and pop2 were: msmc2 -t11 -I 0,1,2,3 -s -o pop1.4haps.msmc2 pops.chr msmc2 -t11 -I 4,5,6,7 -s -o pop2.4haps.msmc2 pops.chr msmc2 -t11 -I 0-4,0-5,0-6,0-7,1-4,1-5,1-6,1-7,2-4,2-5,2-6,2-7,3-4,3-5,3-6,3-7 -s -o pop1-pop2.8haps.cross.msmc2 pops.chr We then used the MSMC2 script combineCrossCoal.py to create a single output file with all three rates: combineCrossCoal.py pop1-pop2. > pop1-pop2.combined.msmc2.final.txt For each population-pair, we then plotted the relative cross-coalescence rate (CCR), which is estimated by taking the ratio of the across-rate and the mean within-rate (Figure M1_13; Schiffels and Wang, 2020). The relative CCR indicates when two populations were a single population (values around 1) and when they were well separated into two isolated populations (values close to zero) (Schiffels and Wang, 2020). However, translating the relative CCR into estimates of split times is difficult for two reasons: First, if a population split was followed by migration, the relative CCR will remain high even after the split. Second, the uncertainty associated with the different coalescent rates translates into a gradual change in the relative CCR, even under a hard split. As a rough estimate, (Schiffels and Wang, 2020) recommends estimating split times as the time when the relative CCR hits 0.5, which we show in Figure M1_14 for all pairwise comparisons. To obtain confidence intervals around coalescence rate estimates, we generated 20 artificial genomes by block-bootstrapping MSMC2 input files in 5Mb blocks (Figures M1_15 and M1_16) as suggested in Schiffels and Wang (2020).

fastsimcoal2

Demographic inferences were carried out with fastsimcoal2.7 (Excoffier et al., 2013, 2021) on six different panels of newly sequenced ancient individuals, on the neutral SFS and with neutral mutation rate adjusted for each panel from the basal mutation (Table M1_4). Panels composition and properties: For each panel, we excluded from the Ancient dataset i) sites with any missing data in any individual of the panel using BEDOPS (Neph et al., 2012), ii) sites with different reference allele for the chimp and gorilla reference genomes, iii) CpG sites, iv) sites found in CpG islands or v) in genomic regions with recombination rate less than 1 cM/Mb. We thus kept a total of TX sites for the panel X, among which SX were polymorphic and MX monomorphic (the properties of each panel are found in Table M1_4). Second, in order to have neutral data best suited for demographic inference (Pouyet et al., 2018), we only kept Sneutral_X BGC-free A↔T and G↔C polymorphic sites, which represented ∼ 0.2 of the filtered polymorphic sites for any panel. Since we would have expected them to represent one third of all filtered sites if mutation rates had been equal for all mutation types, we computed the ratio , which represents the reduction in mutation rates that has occurred at those sites in panelX. We then estimated the number of neutrally evolving monomorphic sites in each panel as Mneutral_X = MX/3, since we expect that one third of all mutations should be BGC-free. Parameter inference via maximum likelihood: Parameter estimates were obtained by maximising the model likelihood over 50 independent runs of fastsimcoal, 50 expectation conditional maximisation (ECM) cycles per run and 500,000 coalescent simulations per estimation of the expected SFS (except for the Core models for which 200,000 simulations were run). The command line used for the estimation was of the type:where fsc is the fastsimcoal2.7 program (available on http://cmpg.unibe.ch/software/fastsimcoal2/) and xxx the generic name of the input files. The .est and .tpl input files are available in our GitHub repository: https://github.com/CMPG/originsEarlyFarmers. fsc -t xxx.tpl -n500000 -d -e xxx.est -M -L50 -q -C5 --multiSFS --logprecision 18 -c1 -B1 Likelihood comparison and model choice: When several models were tested for the same panel, we retained the model with the highest estimated likelihood over 50 runs, and recorded the maximum likelihood (ML) parameters of this model. In order to take into account the variance in the estimation of the likelihoods due the limited number of performed coalescent simulations (here 200,000 or 500,000) to estimate the expected SFS, we also compared the likelihoods of the models estimated on the basis of 10 million coalescent simulations done under the ML parameters. We repeated this procedure 100 times per model to check if the distributions of these likelihoods were overlapping and thus not distinguishable. We used the following command line to get these 100 likelihoods:Finally, we also computed the Akaike criterion (AIC) (Akaike, 1974) and the model relative likelihoods assuming site independence, even though our “neutral” sites may be linked on some chromosomes. fsc -i xxx_maxL.par -R100 -n10000000 -d -u -C5 --logprecision 18 -q Confidence intervals: Confidence intervals around ML parameter point estimates were obtained via a parametric bootstrap approach, for which we first generated 100 SFS covering T nucleotides using estimated ML parameters, with the command line:Then, for each of these bootstrapped SFS, we re-estimated the parameters of the model using 20 independent runs starting at the ML parameters value (option --initvalues in fastsimcoal2). We used 60 ECM cycles for each run and performed 500,000 simulations for estimating the expected SFS under a given set of parameter values and to estimate the model likelihood. The fastsimcoal2 command line used for the bootstrap was of the type:The limits of 95% confidence intervals were finally estimated by computing the 2.5% and 97.5% quantiles of the distribution of the 100 newly estimated ML parameter values. fsc -i xxx.par -n100 -j -d -s0 -x –I -q -u fsc -t xxx.tpl -n500000 -d -e xxx.est --initvalues xxx.pv -M -L60 -q -C5 --multiSFS --logprecision 18
REAGENT or RESOURCESOURCEIDENTIFIER
Biological samples

Ancient human bone materialThis studyAKT16; ERS10598167
Ancient human bone materialThis studyBar25; ERS10598177
Ancient human bone materialThis studyNea2; ERS10598179
Ancient human bone materialThis studyNea3; ERS10598178
Ancient human bone materialThis studyVLASA32; ERS10598181
Ancient human bone materialThis studyVLASA7; ERS10598180
Ancient human bone materialThis studyLEPE48; ERS10598171
Ancient human bone materialThis studyLEPE52; ERS10598168
Ancient human bone materialThis studySTAR1; ERS10598170
Ancient human bone materialThis studyVC3-2; ERS10598169
Ancient human bone materialThis studyAsp6;ERS10598172
Ancient human bone materialThis studyKlein7; ERS10598176
Ancient human bone materialThis studyEss7; ERS10598174
Ancient human bone materialThis studyDil16; ERS10598175
Ancient human bone materialThis studyHerx; ERS10598173

Chemicals, peptides, and recombinant proteins

AccuPrime™ Pfx SuperMixInvitrogenCat#12344040
Bst Polymerase, Large Fragment (8 U/μl)New England Biolabs GmbHCat#M0275S
dNTPs (each 10 mM)QIAGEN, Hilden, GermanyCat#201901
EDTA (0.5 M, pH 8.0)Ambion/Applied Biosystems, Life TechnologiesCat#AM9262
Lauroylsarcosine, Sodium SaltMerck Millipore, Merck KGaA, Darmstadt, GermanyCat#428010
NEBNext End Repair Enzyme MixNew England Biolabs GmbHCat#E6050L
NEBNext End Repair Reaction Buffer (10X)New England Biolabs GmbHCat#E6050L
Nuclease-free H2OLife TechnologiesCat#AM9932
PEG-4000Thermo ScientificCat#EL0011
Proteinase KRoche Diagnostics, Mannheim, GermanyCat#3115828001
T4 DNA Ligase (5 U/μl)Thermo ScientificCat#EL0011
T4 DNA Ligase Buffer (10X)Thermo ScientificCat#EL0011
ThermoPol Buffer (10X)New England Biolabs GmbHCat#M0275S
Tris-EDTASigma-AldrichCat#T9285
Tris-HCl (1M, pH 8.0)Life TechnologiesCat#15568025
USER™ enzymeNew England Biolabs GmbHCat#M5505L

Critical commercial assays

Agilent 2100 Expert Bioanalyzer System and High Sensitivity DNA Analysis KitAgilent TechnologiesCat#5067-4626 (kit)
Qubit Fluorometric quantitation and dsDNA HS Assay KitInvitrogenCat#Q32854 (kit);Cat#Q32856 (tubes)

Deposited data

Sequencing data and aligned BAMfilesThis studyENA: PRJEB50857
Code and input-files connected to this studyThis studyhttps://doi.org/10.5281/zenodo.6367517 for the release of https://github.com/CMPG/originsEarlyFarmers/
Ancient genome - Demultiplexed FASTQ files kindly provided by the authors (contact: Zuzana Hofmanová, co-author of this study)(Hofmanová et al., 2016)Bar8
Ancient genome - Demultiplexed FASTQ files(Jones et al., 2015)Bichon; ERR1078331 - ERR1078351
Ancient genome - FASTQ files (produced from BAMfiles by ENA)(Kılınç et al., 2016)Bon002; ERR1514027
Ancient genome - Demultiplexed FASTQ files kindly provided by the authors (contact: Kay Prüfer)(Lazaridis et al., 2014)Loschbour
Ancient genome - Raw sequencing FASTQ file kindly provided by the authors (contact: Lara Cassidy)(Gamba et al., 2014)NE1
Ancient genome - Aligned BAM file(Günther et al., 2018)SF12; ERR2060277
Ancient genome - Demultiplexed FASTQ files kindly provided by the authors (contact: Kay Prüfer)(Lazaridis et al., 2014)Stuttgart
Ancient genome - Demultiplexed FASTQ file kindly provided by the authors (contact: Yoan Diekmann, co-author of this study)(Brace et al., 2019)CarsPas1
Ancient genome - Demultiplexed FASTQ files kindly provided by the authors (contact: Jens Blöcher, co-author of this study)(Broushaki et al., 2016)WC1
Ancient genome - Demultiplexed FASTQ files(Jones et al., 2015)KK1; ERR1078321-ERR1078325
Modern genomes - Aligned BAMfiles from 77 modern individualsThe Simons Genome Diversity Project (SGDP)(Mallick et al., 2016)https://www.simonsfoundation.org/simons-genome-diversity-project/Individual identifiers, see Table S2.
1000 Genomes phase3 per chromosome VCFs(1000 Genomes Project Consortium et al., 2015)http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
Allen Ancient DNA Resource v42.4 and v37.2Reich lab public data releasehttps://reichdata.hms.harvard.edu/pub/datasets/amh_repo/curated_releases/index_v42.4.html;https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data
Assembly gaps and centromeresUCSC genome browserhttp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz
Chimpanzee hg19 nucleotide statesUCSC genome browserhttp://hgdownload.cse.ucsc.edu/goldenpath/hg19/vsPanTro4/axtNet/
Chimpanzee Reference Genome (panTro4)The Chimpanzee Sequencing and Analysis Consortium (2005)GenBank assembly accession: GCA_000001515.4
CpG islands listUCSC genome browser(CpG Islands (cpgIslandExt) Track)
Ensembl Compara 71 genome FASTA filesEnsembl Comparahttp://ftp.ensembl.org/pub/release-71/fasta/ancestral_alleles/homo_sapiens_ancestor_GRCh37_e71.tar.bz2
Gorilla hg19 nucleotide statesUCSC genome browserhttp://hgdownload.cse.ucsc.edu/goldenpath/hg19/vsGorGor5/axtNet/
Haplotype Reference Consortium dataset(McCarthy et al., 2016)accession number EGAD00001002729 on the European Genome-phenome Archive
HapMap file for chrX (ANGSD) HapMapChrx.gz(Rasmussen et al., 2011)http://www.popgen.dk/angsd/index.php/ANGSD
HapMap phase II b37 genetic mapN/Ahttps://github.com/odelaneau/shapeit4/tree/master/maps
Human reference sequence hs37d5(1000 Genomes Project Consortium et al., 2015)ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
Known InDel positions(1000 Genomes Project Consortium et al., 2015)ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.gz
Known InDel positions(1000 Genomes Project Consortium et al., 2015)ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
Per chromosome mappability mask for human reference genome hs37d5The Simons Genome Diversity Project (SGDP)(Mallick et al., 2016)https://github.com/wangke16/MSMC-IM/blob/master/masks/um75-hs37d5.bed.gz
Recombination Map for YRI population(1000 Genomes Project Consortium et al., 2015)ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130507_omni_recombination_rates/
Ultraconserved sites for recalibration step(Dimitrieva and Bucher, 2013)https://ccg.epfl.ch/UCNEbase/data/download/ucnes/hg19_UCNE_coord.bed

Oligonucleotides

P5 and P7 adapters(Meyer and Kircher, 2010)IDT, Leuven, BelgiumN/A
P5 and P7 indexing primers with index sequences (8 bp) from the NexteraXT index Kit v2(Kircher et al., 2012) and Illumina, San Diego, California, United StatesIDT, Leuven, BelgiumN/A

Software and algorithms

ADMIXTOOLS package v7300(Patterson et al., 2012)https://github.com/DReichLab/AdmixTools
ANGSD - version 0.917(Rasmussen et al., 2011)http://www.popgen.dk/angsd/index.php/ANGSD
ANNOVAR(Wang et al., 2010)https://annovar.openbioinformatics.org
ATLAS - commits 6bd2482 & 7cfc900(Link et al., 2017)https://bitbucket.org/wegmannlab/atlas/
ATLAS-Pipeline, commit 6df90e7Wegmann lab, Ilektra Schulzbitbucket.org/wegmannlab/atlas-pipeline
bcftools versions: 1.9 and 0.1.15(Danecek et al., 2021)https://samtools.github.io/bcftools/howtos/index.html
bwa - Burrows-Wheeler Alignment Tool - versions 0.7.15 and 0.7.17(Li, 2013)bio-bwa.sourceforge.net
BEDOPS v2.4.40(Neph et al., 2012)https://bedops.readthedocs.io/en/latest/
Bedtools 2.25.0(Quinlan and Hall, 2010)https://bedtools.readthedocs.io/en/latest/
ContamMix - version 1.0(Fu et al., 2013)https://science.umd.edu/biology/plfj/
dadi(Gutenkunst et al., 2009)https://bitbucket.org/gutenkunstlab/dadi
fastsimcoal2.7(Excoffier et al., 2013, 2021)http://cmpg.unibe.ch/software/fastsimcoal2/
fastqc - version 0.11.5Babraham Bioinformaticswww.bioinformatics.babraham.ac.uk/projects/fastqc/
GATK - version 3.7(DePristo et al., 2011)https://gatk.broadinstitute.org
HIrisPlex-S webtool(Chaitanya et al., 2018; Walsh et al., 2013)https://hirisplex.erasmusmc.nl/
IBDSeq v. r1206(Browning and Browning, 2013)http://faculty.washington.edu/browning/ibdseq.html
LEA R package v.2.6.0(Frichot and François, 2015)https://bioconductor.org/packages/release/bioc/html/LEA.html
mafft - version 7.31(Katoh et al., 2002)https://mafft.cbrc.jp/alignment/software/linux.html
MIA (Mapping Iterative Assembler) - version 1.0MPI EVA Bioinformaticshttps://github.com/mpieva/mapping-iterative-assembler
MSMC2(Schiffels and Wang, 2020)https://github.com/stschiff/msmc2
MSMC-tools, commit 07bc8a9(Schiffels and Wang, 2020)https://github.com/stschiff/msmc-tools
phy-mer(Navarro-Gomez et al., 2015)https://github.com/MEEIBioinformaticsCenter/phy-mer/
Picard-tools - version 2.9Broad Institutehttp://broadinstitute.github.io/picard/
R - version 4.0, 3.7, and 3.6.1R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austriahttps://www.R-project.org/
SAMtools - version 1.3(Li et al., 2009)http://www.htslib.org/
Samtools 1.9(Danecek et al., 2021)https://github.com/samtools/samtools
seqtk - version 1.2N/Ahttps://github.com/lh3/seqtk
SHAPEIT4 v1.2(Delaneau et al., 2019)https://odelaneau.github.io/shapeit4/
Snakemake - version 4.0(Köster and Rahmann, 2012)https://snakemake.readthedocs.io/en/stable/
Trim Galore! - version 0.4.3Babraham Bioinformaticswww.bioinformatics.babraham.ac.uk/projects/trim_galore/
Yjasc_3752_ry_compute.py, version 0.4(Skoglund et al., 2013)https://ars.els-cdn.com/content/image/1-s2.0-S0305440313002495-mmc1.zip
Yleaf(Ralf et al., 2018)https://github.com/genid/Yleaf

Other

Agencourt® AMPure® XP beadsBeckmann CoulterCat#A63880
Amicon Ultra-4 Centrifugal Filter Units, 30kDaMerck Millipore, Darmstadt, GermanyCat#UFC803096
HydroxyapatiteSigma-AldrichCat#21223
MinElute PCR Purification KitQIAGEN, Hilden, GermanyCat#28006
MSB® Spin PCRapaceInvitek Molecular GmbH, Berlin, GermanyCat#1020220400
Spezial-Edelkorund (EW60/250μ)Harnisch+Rieth, Winterbach, GermanyCat#75250
Spezial-Edelkorund (30B/50μ)Harnisch+Rieth, Winterbach, GermanyCat#75308
  126 in total

1.  Strong association of the SNP rs17822931 with wet earwax and bromhidrosis in a Chinese family.

Authors:  Dandan Shang; Xue Zhang; Miao Sun; Yang Wei; Yaran Wen
Journal:  J Genet       Date:  2013       Impact factor: 1.166

2.  Ancient cattle genomics, origins, and rapid turnover in the Fertile Crescent.

Authors:  Marta Pereira Verdugo; Victoria E Mullin; Amelie Scheu; Valeria Mattiangeli; Kevin G Daly; Pierpaolo Maisano Delser; Andrew J Hare; Joachim Burger; Matthew J Collins; Ron Kehati; Paula Hesse; Deirdre Fulton; Eberhard W Sauer; Fatemeh A Mohaseb; Hossein Davoudi; Roya Khazaeli; Johanna Lhuillier; Claude Rapin; Saeed Ebrahimi; Mutalib Khasanov; S M Farhad Vahidi; David E MacHugh; Okan Ertuğrul; Chaido Koukouli-Chrysanthaki; Adamantios Sampson; George Kazantzis; Ioannis Kontopoulos; Jelena Bulatovic; Ivana Stojanović; Abdesalam Mikdad; Norbert Benecke; Jörg Linstädter; Mikhail Sablin; Robin Bendrey; Lionel Gourichon; Benjamin S Arbuckle; Marjan Mashkour; David Orton; Liora Kolska Horwitz; Matthew D Teasdale; Daniel G Bradley
Journal:  Science       Date:  2019-07-12       Impact factor: 47.728

3.  Evidence for recent, population-specific evolution of the human mutation rate.

Authors:  Kelley Harris
Journal:  Proc Natl Acad Sci U S A       Date:  2015-03-02       Impact factor: 11.205

Review 4.  Revising the human mutation rate: implications for understanding human evolution.

Authors:  Aylwyn Scally; Richard Durbin
Journal:  Nat Rev Genet       Date:  2012-09-11       Impact factor: 53.242

5.  Human genetic data reveal contrasting demographic patterns between sedentary and nomadic populations that predate the emergence of farming.

Authors:  Carla Aimé; Guillaume Laval; Etienne Patin; Paul Verdu; Laure Ségurel; Raphaëlle Chaix; Tatyana Hegay; Lluis Quintana-Murci; Evelyne Heyer; Frédéric Austerlitz
Journal:  Mol Biol Evol       Date:  2013-09-24       Impact factor: 16.240

6.  The formation of human populations in South and Central Asia.

Authors:  Vagheesh M Narasimhan; Nick Patterson; Priya Moorjani; Nadin Rohland; Rebecca Bernardos; Swapan Mallick; Iosif Lazaridis; Nathan Nakatsuka; Iñigo Olalde; Mark Lipson; Alexander M Kim; Luca M Olivieri; Alfredo Coppa; Massimo Vidale; James Mallory; Vyacheslav Moiseyev; Egor Kitov; Janet Monge; Nicole Adamski; Neel Alex; Nasreen Broomandkhoshbacht; Francesca Candilio; Kimberly Callan; Olivia Cheronet; Brendan J Culleton; Matthew Ferry; Daniel Fernandes; Suzanne Freilich; Beatriz Gamarra; Daniel Gaudio; Mateja Hajdinjak; Éadaoin Harney; Thomas K Harper; Denise Keating; Ann Marie Lawson; Matthew Mah; Kirsten Mandl; Megan Michel; Mario Novak; Jonas Oppenheimer; Niraj Rai; Kendra Sirak; Viviane Slon; Kristin Stewardson; Fatma Zalzala; Zhao Zhang; Gaziz Akhatov; Anatoly N Bagashev; Alessandra Bagnera; Bauryzhan Baitanayev; Julio Bendezu-Sarmiento; Arman A Bissembaev; Gian Luca Bonora; Temirlan T Chargynov; Tatiana Chikisheva; Petr K Dashkovskiy; Anatoly Derevianko; Miroslav Dobeš; Katerina Douka; Nadezhda Dubova; Meiram N Duisengali; Dmitry Enshin; Andrey Epimakhov; Alexey V Fribus; Dorian Fuller; Alexander Goryachev; Andrey Gromov; Sergey P Grushin; Bryan Hanks; Margaret Judd; Erlan Kazizov; Aleksander Khokhlov; Aleksander P Krygin; Elena Kupriyanova; Pavel Kuznetsov; Donata Luiselli; Farhod Maksudov; Aslan M Mamedov; Talgat B Mamirov; Christopher Meiklejohn; Deborah C Merrett; Roberto Micheli; Oleg Mochalov; Samariddin Mustafokulov; Ayushi Nayak; Davide Pettener; Richard Potts; Dmitry Razhev; Marina Rykun; Stefania Sarno; Tatyana M Savenkova; Kulyan Sikhymbaeva; Sergey M Slepchenko; Oroz A Soltobaev; Nadezhda Stepanova; Svetlana Svyatko; Kubatbek Tabaldiev; Maria Teschler-Nicola; Alexey A Tishkin; Vitaly V Tkachev; Sergey Vasilyev; Petr Velemínský; Dmitriy Voyakin; Antonina Yermolayeva; Muhammad Zahir; Valery S Zubkov; Alisa Zubova; Vasant S Shinde; Carles Lalueza-Fox; Matthias Meyer; David Anthony; Nicole Boivin; Kumarasamy Thangaraj; Douglas J Kennett; Michael Frachetti; Ron Pinhasi; David Reich
Journal:  Science       Date:  2019-09-06       Impact factor: 47.728

7.  Paleogenomic Evidence for Multi-generational Mixing between Neolithic Farmers and Mesolithic Hunter-Gatherers in the Lower Danube Basin.

Authors:  Gloria González-Fortes; Eppie R Jones; Emma Lightfoot; Clive Bonsall; Catalin Lazar; Aurora Grandal-d'Anglade; María Dolores Garralda; Labib Drak; Veronika Siska; Angela Simalcsik; Adina Boroneanţ; Juan Ramón Vidal Romaní; Marcos Vaqueiro Rodríguez; Pablo Arias; Ron Pinhasi; Andrea Manica; Michael Hofreiter
Journal:  Curr Biol       Date:  2017-05-25       Impact factor: 10.834

8.  UCNEbase--a database of ultraconserved non-coding elements and genomic regulatory blocks.

Authors:  Slavica Dimitrieva; Philipp Bucher
Journal:  Nucleic Acids Res       Date:  2012-11-27       Impact factor: 16.971

9.  fastsimcoal2: demographic inference under complex evolutionary scenarios.

Authors:  Laurent Excofffier; Nina Marchi; David Alexander Marques; Remi Matthey-Doret; Alexandre Gouy; Vitor C Sousa
Journal:  Bioinformatics       Date:  2021-06-23       Impact factor: 6.937

10.  Inferring Genome-Wide Correlations of Mutation Fitness Effects between Populations.

Authors:  Xin Huang; Alyssa Lyn Fortier; Alec J Coffman; Travis J Struck; Megan N Irby; Jennifer E James; José E León-Burguete; Aaron P Ragsdale; Ryan N Gutenkunst
Journal:  Mol Biol Evol       Date:  2021-09-27       Impact factor: 16.240

View more
  3 in total

1.  Ancient mitochondrial diversity reveals population homogeneity in Neolithic Greece and identifies population dynamics along the Danubian expansion axis.

Authors:  Nuno M Silva; Susanne Kreutzer; Angelos Souleles; Sevasti Triantaphyllou; Kostas Kotsakis; Dushka Urem-Kotsou; Paul Halstead; Nikos Efstratiou; Stavros Kotsos; Georgia Karamitrou-Mentessidi; Fotini Adaktylou; Areti Chondroyianni-Metoki; Maria Pappa; Christina Ziota; Adamantios Sampson; Anastasia Papathanasiou; Karen Vitelli; Tracey Cullen; Nina Kyparissi-Apostolika; Andrea Zeeb Lanz; Joris Peters; Jérémy Rio; Daniel Wegmann; Joachim Burger; Mathias Currat; Christina Papageorgopoulou
Journal:  Sci Rep       Date:  2022-08-05       Impact factor: 4.996

2.  The Genetic Echo of the Tarim Mummies in Modern Central Asians.

Authors:  Shan-Shan Dai; Xierzhatijiang Sulaiman; Jainagul Isakova; Wei-Fang Xu; Najmudinov Tojiddin Abdulloevich; Manilova Elena Afanasevna; Khudoidodov Behruz Ibrohimovich; Xi Chen; Wei-Kang Yang; Ming-Shan Wang; Quan-Kuan Shen; Xing-Yan Yang; Yong-Gang Yao; Almaz A Aldashev; Abdusattor Saidov; Wei Chen; Lu-Feng Cheng; Min-Sheng Peng; Ya-Ping Zhang
Journal:  Mol Biol Evol       Date:  2022-09-01       Impact factor: 8.800

3.  Phylogeographic analysis reveals an ancient East African origin of human herpes simplex virus 2 dispersal out-of-Africa.

Authors:  Sonia Burrel; David Boutolleau; Jennifer L Havens; Sébastien Calvignac-Spencer; Kevin Merkel; Joel O Wertheim
Journal:  Nat Commun       Date:  2022-09-17       Impact factor: 17.694

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.