Literature DB >> 32697300

Y-Chromosome Variation in Southern African Khoe-San Populations Based on Whole-Genome Sequences.

Thijessen Naidoo^1,2,3,4, Jingzi Xu¹, Mário Vicente¹, Helena Malmström^1,5, Himla Soodyall^6,7,8, Mattias Jakobsson^1,3,5, Carina M Schlebusch^1,3,5.

Abstract

Although the human Y chromosome has effectively shown utility in uncovering facets of human evolution and population histories, the ascertainment bias present in early Y-chromosome variant data sets limited the accuracy of diversity and TMRCA estimates obtained from them. The advent of next-generation sequencing, however, has removed this bias and allowed for the discovery of thousands of new variants for use in improving the Y-chromosome phylogeny and computing estimates that are more accurate. Here, we describe the high-coverage sequencing of the whole Y chromosome in a data set of 19 male Khoe-San individuals in comparison with existing whole Y-chromosome sequence data. Due to the increased resolution, we potentially resolve the source of haplogroup B-P70 in the Khoe-San, and reconcile recently published haplogroup A-M51 data with the most recent version of the ISOGG Y-chromosome phylogeny. Our results also improve the positioning of tentatively placed new branches of the ISOGG Y-chromosome phylogeny. The distribution of major Y-chromosome haplogroups in the Khoe-San and other African groups coincide with the emerging picture of African demographic history; with E-M2 linked to the agriculturalist Bantu expansion, E-M35 linked to pastoralist eastern African migrations, B-M112 linked to earlier east-south gene flow, A-M14 linked to shared ancestry with central African rainforest hunter-gatherers, and A-M51 potentially unique to the Khoe-San.

Entities: CellLine Chemical Disease Gene Species

Keywords: Khoe-San; Y chromosome; haplogroups; next-generation sequencing; southern Africa

Mesh：

Year: 2020 PMID： 32697300 PMCID： PMC7375190 DOI： 10.1093/gbe/evaa098

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

The male-specific portion of the Y chromosome (MSY) has long been regarded as an effective tool in the study of human evolutionary history (Underhill and Kivisild 2007). It has proved useful mainly due to a lack of recombination along its length, making it the longest haplotypic block in the human genome (Scozzari et al. 2012); and its paternal mode of inheritance. The transmission of an intact haplotype from father to son, changing only through mutation, preserves a simpler record of its history and allows us to study the male contribution to the shaping of humanity. The mutations found on Y chromosomes sourced from numerous human populations have been used to generate Y-chromosome phylogenies (Underhill et al. 2000; Hammer et al. 2001; Y-Chromosome Consortium 2002; Karafet et al. 2008), with well-defined and geographically informative haplogroups, that is, groups of haplotypes that share common ancestors and so present as clades in a phylogeny. While the first consensus phylogeny with standardized nomenclature was published in 2002 (Y-Chromosome Consortium 2002) with the last full version published in 2008 (Karafet et al. 2008) and a minimal reference version published in 2014 (van Oven et al. 2014), the International Society of Genetic Genealogy (ISOGG) maintains a comprehensive online version (https://isogg.org/tree/index.html) that is updated every year. The utility of the Y-chromosome phylogeny notwithstanding, the ascertainment bias present when initially sourcing the mutations used to build it, resulted in limitations. The use of predefined sets of variants, limited the ability to generate unbiased estimates of global diversity and accurate estimates of the time to most recent common ancestor (TMRCA) (Poznik et al. 2013). As a solution, short tandem repeat polymorphisms (STRs) have been used for a long time, however, STRs had their own biases and issues (i.e., see Hallast et al. [2015] for a comparison between sequence-based and STR-based TMRCAs). The emergence of next-generation sequencing (NGS) technology, resulted in the discovery of thousands of unbiased variants (Cruciani et al. 2011; Francalacci et al. 2013; Poznik et al. 2013; Scozzari et al. 2014; Karmin et al. 2015; Barbieri et al. 2016), which allowed for substantially more accurate estimates of the age the Y-chromosome phylogeny and its haplogroups. Although studies usually attempted to balance the global representation of populations in their samples, there did appear to be an underrepresentation of African samples in the final results, with this being especially true of haplogroups A and B. Barbieri et al. (2016) addressed this discrepancy by sequencing a portion of the Y chromosome in 547 Khoe-San- and Bantu-speakers; resulting in much older estimates of the ages of haplogroups A and B and their subclades. Following high-coverage sequencing of the full Y chromosome in a data set of 19 male Khoe-San individuals and comparison with existing full Y-chromosome sequence data, the present study describes the distribution of haplogroups found, in the wider African context. The high level of resolution afforded by a sequencing analysis allowed us to potentially resolve the source of haplogroup B-P70 in the Khoe-San. The use of two slightly differing phylogenies when assigning our haplogroups, allowed for the reconciliation of the haplogroup A-M51 data from Barbieri et al. (2016) with the most recent version of the ISOGG Y-chromosome phylogeny.

Results

We performed high-coverage whole-genome sequencing of 25 Khoe-San individuals from five different populations (Schlebusch et al. 2020). In this study, we discuss the results of the Y-chromosome sequences of the 19 male individuals that were included in the study. After processing and filtering the sequence data (see Materials and Methods), we obtained 5,783 variants; with an average depth of 31×. Once merged with the comparative data from seven additional populations (Drmanac et al. 2010; Auton et al. 2015; Lachance et al. 2012), our total data set contained 7,878 variants from 8.8 Mb (8,800,463 bases) of Y-chromosome sequence in 48 individuals.

Khoe-San Haplogroups

The major haplogroups found in our sequenced individuals were A-M14 (A1b1a1), A-M51 (A1b1b2a), B-M112 (B2b), E-M2 (E1b1a1), and E-M35 (E1b1b1); and were thus strongly concordant with previous surveys of Y-chromosome variation in Khoe-San populations (Underhill et al. 2000; Naidoo et al. 2010; Batini et al. 2011; Barbieri et al. 2016). The additional resolution provided by sequencing, however, uncovered several new variants especially supporting branches in haplogroups A and B, and allowed for greater clarity regarding the relationships of some subclades and markers (see table 1 and supplementary table S1, Supplementary Material online, for haplogroup assignments and population information).

Table 1

Population Information and Y-Chromosome Haplogroups of Individuals Used in the Study

Ref.	Sample Code	Population	AMY-Tree Haplogroup	AMY-Tree Marker	ISOGG Haplogroup
1	KSP103	Ju\|’hoansi	A2a1b1	A-M114	A1b1a1a2a
1	KSP154	!Xun	A2a1b1	A-M114	A1b1a1a2a
1	KSP116	Ju\|’hoansi	A2a1b2	A-P262	A1b1a1a2b
1	KSP139	Nama	A3b1a	A-P71	A1b1b2a1a
1	KSP124	Nama	A3b1a	A-P71	A1b1b2a1a1
1	KSP140	Nama	A3b1a	A-P71	A1b1b2a1a1
1	KSP105	Ju\|’hoansi	A3b1c	A-V306	A1b1b2a1b
1	KSP106	Ju\|’hoansi	A3b1c	A-V306	A1b1b2a1b
1	KSP146	!Xun	A3b1b	A-V37	A1b1b2a2
1	KSP150	!Xun	A3b1b	A-V37	A1b1b2a2
4	PygmyBaka1	Baka	B2b1a1b	B-M8035	B2b1a1c2a1∼
4	PygmyBaka3	Baka	B2b1a1b	B-M8035	B2b1a1c2a2∼
4	Hadza2	Hadza	B2b1a2	B-M7583	B2b1a2a1∼
4	Hadza5	Hadza	B2b1a2	B-M7583	B2b1a2a2∼
4	Sandawe5	Sandawe	B2b1a2	B-M7583	B2b1a2b1∼
4	Hadza3	Hadza	B2b1a2	B-M7583	B2b1a2b2a∼
1	KSP111	Ju\|’hoansi	B2b1a2	B-M7583	B2b1a2b2b∼
1	KSP155	!Xun	B2b1a2	B-M7583	B2b1a2b2b∼
4	Sandawe4	Sandawe	B2b1b*	B-M7104	B2b1b∼
4	Hadza4	Hadza	B2b1b*	B-M7104	B2b1b1b∼
2	NA19429	Luhya	E1b1a1a1a	E-M58	E1b1a1a1a1a3a∼
4	Hadza1	Hadza	E1b1a1a1f1a1c*	E-P116	E1b1a1a1a1c1a1a3a1a1a2b∼
4	PygmyBakola1	Bakola	E1b1a1a1f1a1c*	E-P116	E1b1a1a1a1c1a1a3a1a1b∼
1	KSP063	Karretjie	E1b1a1a1f1a1d	E-CTS8030	E1b1a1a1a1c1a1a3a1c1
2	NA19443	Luhya	E1b1a1a1f1a1*	E-U174	E1b1a1a1a1c1a1a3a1c1a1a∼
2	NA19428	Luhya	E1b1a1a1f1a1d	E-CTS8030	E1b1a1a1a1c1a1a3a1d1b1a1∼
4	PygmyBaka2	Baka	E1b1a1a1f1a1d	E-CTS8030	E1b1a1a1a1c1a1a3a1g∼
2	NA18504	Yoruba	E1b1a1a1f1a1*	E-U174	E1b1a1a1a1c1a1a3c1b1a1∼
2	NA18507	Yoruba	E1b1a1a1f1a1*	E-U174	E1b1a1a1a1c1a1a3d3∼
3	NA19834	Afr. American	E1b1a1a1f1a1*	E-U174	E1b1a1a1a1c1a1a3d6a
1	KSP067	Karretjie	E1b1a1a1f1a1*	E-U174	E1b1a1a1a1c1a1a3e1∼
2	NA18501	Yoruba	E1b1a1a1g1*	E-U209	E1b1a1a1a2a1a1b∼
2	NA18498	Yoruba	E1b1a1a1g1*	E-U209	E1b1a1a1a2a1a2a1∼
1	KSP092	\|Gui and ‖Gana	E1b1a1a1g1a1	E-U181	E1b1a1a1a2a1a3b1a1
1	KSP096	\|Gui and ‖Gana	E1b1a1a1g1a1	E-U181	E1b1a1a1a2a1a3b1a1
2	NA18871	Yoruba	E1b1a1a1g1a*	E-U290	E1b1a1a1a2a1a3b1a10a∼
3	NA19703	Afr. American	E1b1a1a1g1a1	E-U181	E1b1a1a1a2a1a3b1a1b∼
1	KSP069	Karretjie	E1b1a1a1g1a2	E-Z1725	E1b1a1a1a2a1a3b1a2a2
2	NA19397	Luhya	E1b1a1a1g1a2	E-Z1725	E1b1a1a1a2a1a3b1a2a2a1a∼
3	NA19700	Afr. American	E1b1a1a1g1a*	E-U290	E1b1a1a1a2a1a3b1a5∼
4	Sandawe1	Sandawe	E1b1a1a1g1*	E-U209	E1b1a1a1a2a1a3b1d1c1a1b∼
4	PygmyBedzan1	Bedzan	E1b1a1a1g1*	E-U209	E1b1a1a1a2a1a3b1d1d∼
3	NA21732	Maasai	E1b1b1a3c	E-AM00003	E1b1b1a1b2a3
3	NA21737	Maasai	E1b1b1a3c	E-AM00003	E1b1b1a1b2a3
1	KSP137	Nama	E1b1b1d*	E-M293	E1b1b1b2b2a1
1	KSP152	!Xun	E1b1b1d*	E-M293	E1b1b1b2b2a1a1c∼
4	Sandawe2	Sandawe	E1b1b1d*	E-M293	E1b1b1b2b2a1a1c∼
4	Sandawe3	Sandawe	E1b1b1d*	E-M293	E1b1b1b2b2a1a1d∼

1, this study; 2, Auton et al. (2015); 3, Drmanac et al. (2010); 4, Lachance et al. (2012).

*designates a Y chromosome paragroup

Population Information and Y-Chromosome Haplogroups of Individuals Used in the Study 1, this study; 2, Auton et al. (2015); 3, Drmanac et al. (2010); 4, Lachance et al. (2012). *designates a Y chromosome paragroup

The Internal Structure of A-M51

Haplogroup A1b1b2a (A-M51) has often been found to be the most common haplogroup A subclade in southern Africa (Barbieri et al. 2016), though usually found primarily in the Khoe-San or in populations with significant levels of Khoe-San admixture. Three previously reported subclades (Scozzari et al. 2012; Barbieri et al. 2016), haplogroups A1b1b2a1a (A-P71), A1b1b2a2 (A-V37), and A1b1b2a1b (A-V306), were found in our data set; and the branching structure was further refined and reconciled with the ISOGG Y phylogeny (ISOGG 2019-2020) (fig. 1 and table 1). Haplogroup A1b1b2a2 is the first to branch off, as reported by Barbieri et al. (2016), whereas haplogroups A1b1b2a1a and A1b1b2a1b are united by marker M239. Within haplogroup A1b1b2a1a, all three individuals were also derived at marker P102, with one of them ancestral at marker P291. This is at odds with the current ISOGG phylogeny, which has P291 basal to P102. The three subclades segregated independently among the populations, with A1b1b2a2 found in two !Xun individuals, A1b1b2a1a in three Nama individuals and A1b1b2a1b in two Ju|’hoansi individuals.

Fig. 1.

Y-chromosome phylogeny of the 48 individuals in the data set. Populations and haplogroups are shown on the right of the tree. Refer to supplementary table S1, Supplementary Material online, for additional information for each branch (numbered) in the phylogeny, including numbers of variants per branch and the lists of variants defining each branch. Branch lengths in the figure are representative of topology, and do not reflect TMRCA estimates nor the number of variants per branch.

Gene Flow into Khoe-San Populations

Haplogroup E1b1b1b2b2a1 (E-M293) was found in two Khoe-San: one Nama and one !Xun. This E-M35 subclade was previously linked to the movement of pastoralist groups from eastern Africa to southern Africa ∼2,000 years ago (Henn et al. 2008; Bajic et al. 2018). Moreover, the Y-chromosome from the Nama individual belonged to the basal E1b1b1b2b2a1 clade, whereas the !Xun individual belonged to a subclade further derived for markers CTS2104, CTS2297, CTS2553, and Y17343, which were also found in two Sandawe individuals (fig. 1 and supplementary table S1, Supplementary Material online). The two haplogroup B2b (B-M112) chromosomes, found within one Ju|’hoansi and one !Xun, fell within a subclade of haplogroup B2b1a2b2∼ (B-M7592), defined by marker M7591. The closest related Y chromosome belonged to a sister clade within haplogroup B2b1a2b2∼ (B2b1a2b2a∼) and was from a Hadza individual (fig. 1 and supplementary table S1, Supplementary Material online). Although these Khoe-San B2b chromosomes are known to be derived at marker P8 (unpublished data), this marker has since been removed from recent versions of the ISOGG Y-chromosome phylogeny. The finding that marker M7591 is also derived and appears to be phylogenetically equivalent to markers P70 and P8, allows for the reintroduction of the lineage to the phylogeny. The E1b1a1 (E-M2) haplogroup occurs in individuals from many populations (fig. 1). Khoe-San Y chromosomes within haplogroup E1b1a1 are likely to have been introduced by surrounding Bantu-speaker populations. The rapid expansion of E1b1a1 across sub-Saharan Africa (de Filippo et al. 2011) makes it difficult to pinpoint its exact sources into the Khoe-San without taking into account historical data.

Branch Length Heterogeneity

Several earlier studies (Scozzari et al. 2014; Hallast et al. 2015; Barbieri et al. 2016) found evidence of branch length heterogeneity among Y-chromosome haplogroups, and provided possible reasons for its occurrence. We also noted significant differences in branch length heterogeneity among the major African haplogroups (supplementary tables S2 and S3, Supplementary Material online). A reduced mean branch length for haplogroup A, noted previously by Scozzari et al. (2014), was again apparent from our data. Although most major haplogroups differed significantly (with the exception of the E1b1a subclades), we found that haplogroup B did not appear to have as reduced a mean branch length, relative to haplogroup E, as found previously (Hallast et al. 2015; Barbieri et al. 2016). Within haplogroup E, E1b1b1 was found to have the highest mean branch length; though this may have been due to a lower sample size compared with haplogroup E1b1a.

Placement of Branches

Markers have been added to the Y-chromosome phylogeny at a very rapid pace in the last few years. As a result, the positions of many branches have not been finalized. Our results indicate the need to reposition a few of these tentatively placed branches, especially within haplogroup A. For instance, all three A1b1a1 chromosomes were also derived for several markers tentatively assigned to haplogroups A1b1a1a2b∼ and A1b1a1a2a1a∼ (∼ denotes unconfirmed placement on ISOGG phylogeny). Further, although the ISOGG phylogeny places Y chromosomes derived at marker P262 into a subclade of A1b1a1a2a, our results reveal that this is incorrect. Marker P262-derived Y chromosomes are better placed in a sister clade to haplogroup A1b1a1a2a. This branch is supported by over 60 additional variants that segregate with P262, and we have renamed it A1b1a1a2b (fig. 1 and supplementary table S1, Supplementary Material online). Again within haplogroup A, all A1b1b2a chromosomes were derived at markers tentatively assigned to haplogroups A1b1b2b∼ and A1b1b2b2∼. The placement of these markers is more likely to be basal to both A1b1b2a and A1b1b2b.

Discussion

In this study, we present the Y chromosomes of 19 Khoe-San individuals, following whole-genome sequencing (Schlebusch et al. 2020). We used AMY-tree version 2.0 (Van Geystelen et al. 2013) to identify the major haplogroups in the data set, based on their “Updated tree version 2.0” and known haplogroup-defining variants. Notably, some of the variants used to define haplogroups in this earlier phylogeny were not present in more recent versions of the ISOGG Y-chromosome phylogeny, but were still being used to explore variation in African populations (Barbieri et al. 2016). Once we identified the placement of our samples on the ISOGG Y-chromosome phylogeny, we were able to reposition these haplogroup-defining variants onto the ISOGG Y-chromosome phylogeny; specifically, in the case of haplogroup A1b1b2a chromosomes. As well, hundreds of new variants were discovered; most notably those found to populate the relatively sparse branches of haplogroups A1b1 and B2b. Their discovery in this study and in previous studies (Barbieri et al. 2016) should be taken into account and used to build out the branches of the oft-neglected haplogroups A and B. Even in this minimal data set of 19 individuals, the haplogroup distribution reflected previous findings in Khoe-San populations (Underhill et al. 2000; Naidoo et al. 2010; Batini et al. 2011; Barbieri et al. 2016). The associations of the Khoe-San with other populations in the same major haplogroups also gave some indication of historical interactions and shared ancestry. While haplogroups A1b1a1, A1b1b2a, B2b, and E1b1b1 have long been associated with Khoe-San populations (Underhill et al. 2000), at this point, it appears haplogroup A1b1b2a may be the only surviving haplogroup in the Khoe-San to have originated autochthonously. The presence of haplogroup A1b1a1 in the Khoe-San in this study and others (Batini et al. 2011) has been characterized primarily by the more terminal lineages of the haplogroup and lineages ancestral to these have been found in central Africa (Batini et al. 2011). The majority of haplogroup B2b lineages found so far in Khoe-San populations have fallen into the subclades B-P6 and B-P70 (Batini et al. 2011; Barbieri et al. 2016), whereas other B2b lineages have been found in central and eastern Africa. Previously, Batini et al. (2011) linked the presence of B-P70 in the Khoe-San to possible gene flow from central African Rainforest hunter-gatherer populations. This was based on the presence and diversity of haplogroup B-P7 (the ancestral lineage to B-P70) in the Rainforest hunter-gatherer populations and low levels of it elsewhere. Our findings, however, place the Khoe-San-specific lineage within a clade (B2b1a2b2∼) together with a Hadza lineage. The TMRCA of haplogroup B2b1a2b2∼ has been estimated to ∼21–26 ka (table 2). Given this relatively deep age, we cannot rule out the possibility that B2b1a2b2∼ was also present in central African Rainforest hunter-gatherer populations, and entered the Khoe-San population as postulated by Batini et al. (2011). However, our findings together with multiple lines of evidence pointing to gene flow from eastern Africa into southern Africa (Henn et al. 2008; Schlebusch et al. 2012, 2017; Breton et al. 2014; Macholdt et al. 2014; Pickrell et al. 2014; Skoglund et al. 2017; Bajic et al. 2018; Schlebusch and Jakobsson 2018) indicate that B2b1a2b2∼ in the Khoe-San may have come from eastern Africa. Notably, this B2b1a2b2∼ subclade has mainly been found in San populations such as the Ju|’hoansi and !Xun, and not in Khoekhoe herder populations such as the Nama (Schlebusch 2010). This is in contrast to the E1b1b1b2b2a1 haplogroup, which is found quite commonly in Nama individuals. As the E1b1b1b2b2a1 haplogroup is a marker of the movement of pastoralism from eastern Africa to southern Africa (Henn et al. 2008), this would indicate that the arrival of B2b1a2b2∼ in southern Africa was separate from the arrival of pastoralism. Evidence of a gradient of relatedness among eastern African and southern African hunter-gatherers was demonstrated by Skoglund et al. (2017); and while most of the eastern African genomic component present in southern African hunter-gatherers was attributed to the arrival of pastoralism, it appeared that the ancient southern African individuals who did not show the strong signal of eastern African admixture still shared more alleles with eastern Africans than with western Africans; possibly indicating some level of isolation-by-distance. Evidence of gene flow between the Hadza and the Ju|’hoansi, a population shown to have been affected minimally by the recent gene flow from eastern Africa (Schlebusch et al. 2017; Skoglund et al. 2017), has also been noted (Schlebusch et al. 2012). These factors may indicate that the presence of haplogroup B2b1a2b2∼ in southern Africa may predate the arrival of pastoralism.

Table 2

TMRCA Estimates of Y-Chromosome Haplogroups Found in the Study

	TMRCA Estimate (ka)				Barbieri et al. (2016)
Mutation rate (mutations/bp/year)	0.74×10⁻⁹		0.87×10⁻⁹		0.82×10^−9a
Haplogroup	Median	95% HPD	Median	95% HPD	Median
A1b	169,018	133,711–216,880	146,952	116,496–185,147	193,000
A1b1a1a2	12,507	7,829–18,812	10,745	7,119–15,375	33,000
A1b1b2a	47,966	33,936–66,832	40,981	29,632–54,679	64,000
A1b1b2a1	35,572	23,600–52,078	30,342	20,532–41,834
A1b1b2a1a	6,933	4,685–9,705	5,917	3,998–8,207
A1b1b2a1b	8,267	4,620–13,098	7,045	4,018–10,887
A1b1b2a2	1,669	763–2,996	1,410	655–2,415
B2b1	50,073	40,106–61,847	42,908	34,338–52,537	62,000
B2b1a	46,948	37,249–58,643	40,260	32,103–49,891
B2b1a1c2a	143	2–524	120	2–368
B2b1a2	34,798	26,491–44,885	29,731	22,676–37,405
B2b1a2a	734	245–1,502	622	217–1,240
B2b1a2b	28,981	20,801–38,357	24,766	18,283–31,989
B2b1a2b2	25,557	17,459–34,967	21,819	15,397–28,878
B2b1b	44,411	33,525–57,120	38,086	28,982–48,180
E1b1a1a1a1c1a1a3	14,332	11,338–18,158	12,253	10,133–14,921	22,000
E1b1a1a1a2a1a	9,358	7,418–11,582	8,078	6,512–9,769	15,000
E1b1b	30,063	19,813–42,157	25,468	17,560–35,183	21,000
E1b1b1b2b2a1	7,376	5,088–10,121	6,268	4,390–8,472

Mutation rate from Poznik et al. (2013).

TMRCA Estimates of Y-Chromosome Haplogroups Found in the Study Mutation rate from Poznik et al. (2013). The subclades of haplogroup A1b1b2a showed independent segregation in three Khoe-San populations. Several more individuals from these populations need to be screened to elucidate the distribution of these haplogroups in Khoe-San populations. Barbieri et al. (2016) obtained estimates of the TMRCA for the Y-chromosome phylogeny and the major African haplogroups with 547 individuals, using counts of mutations and BEAST analysis. With fewer individuals (48) and two mutation rates, 0.74×10−9 (Karmin et al. 2015) and 0.87×10−9 (Helgason et al. 2015), our TMRCA estimates for the A1b root (A2-T in Barbieri et al. [2016]) were lower (table 2); though still within the HPD intervals. Our TMRCA estimates of the major African haplogroups were also usually lower than the (Barbieri et al. 2016) estimates (likely due to the much lower sample size and lower haplogroup diversity), with the exception of haplogroup E1b1b1. This was reflected by haplogroup E1b1b1 also displaying the longest branch lengths in our data set (supplementary tables S2 and S3, Supplementary Material online). Although we corroborated the findings of other studies (Scozzari et al. 2014; Hallast et al. 2015; Barbieri et al. 2016) which also observed branch length heterogeneity, the patterns we observed differed slightly, with no clear sign of long E1b1a branches (Barbieri et al. 2016). This may have been due to differences in our sample sizes, or possibly due to the higher resolution of deep sequencing allowing us to uncover more variants from a larger proportion of the Y chromosome (8.8 Mb vs. ∼965 kb in Barbieri et al. 2016). The discovery of new markers on the Y chromosome and the expansion of the phylogeny have gathered pace since DNA sequencing using next-generation methods has become more commonly used to analyze this portion of the genome. The current phylogeny, however, is one with several new branches, together with some uncertainty regarding their precise placement. The screening of more samples for the already discovered markers would help in finalizing the placement of branches with unconfirmed placement.

Materials and Methods

Samples and Sequencing Pipeline

The present study is a subset of a larger study (Schlebusch et al. 2020) that sequenced high-coverage full genomes (average depth: 56×) from 25 Khoe-San individuals. The individuals were selected from a set of samples that were previously genotyped on an Illumina 2.5 M chip (Schlebusch et al. 2012). We selected five individuals from five different Khoe-San populations to represent southern, central, and northern Khoe-San groups for the full genome study, 19 of these individuals were male (3 Karretjie People; 4 Nama; 5 Ju|’hoansi; 2 |Gui and ‖Gana; and 5 !Xun). Selection criteria for the individuals for the full genome study, included low amounts of admixture with Bantu-speakers or Europeans, based on autosomal data. Although males were preferentially selected, we did not consider information on Y chromosomes and mitochondrial DNA before selecting the individuals. Thus, in terms of Y chromosomes, these individuals constitute random draws from the populations. DNA samples from individuals were collected with the subjects’ informed consent, and the project was approved by the Human Research Ethics Committee (Medical) at the University of the Witwatersrand, Johannesburg (Protocol Number: M1604104 and M180654), the Working Group of Indigenous Minorities in Southern Africa (WIMSA), and the South African San Council (SASC). The study was also approved by the Swedish ethical review authority, Reference Number: Dnr 2019-05174. DNA libraries of the Khoe-San samples were prepared with TrueSeq DNA Sample preparation kit v2 (Cat No. FC-121-2001/2002, Illumina Inc.). These were sequenced on an Illumina HiSeq Sequencing System (Illumina Inc., San Diego, CA) at the SciLifeLab SNP&SEQ Technology Platform in Uppsala. BAM files were generated by mapping the reads to the 1000 genomes phase 2 reference assembly (hs37d5) using BWA 0.6.2 (BWA-MEM algorithm) (Li and Durbin 2010), and further processed with GATK v.2.5.2 (McKenna et al. 2010), Picard v.1.92, and Samtools (Li et al. 2009). This involved duplicate marking in Picard, realignment around indels in GATK, calculating the MD flag with Samtools calmd and Base Quality Score Recalibration (BQSR) in GATK. We used the UnifiedGenotyper module of GATK to call Y-chromosome variants from BAM files of the samples according to GATK Best Practices recommendations (DePristo et al. 2011; Van der Auwera et al. 2013). The default SNP genotype likelihoods calculation mode was used, we specified the ploidy argument as 1 for the haploid Y-chromosome data, and we did not consider indels. Both a variant-sites call set and the complete sequence VCF file containing invariant sites were generated for downstream analysis. The QD, DP, MQ, and FS information were used to determine hard-filtering thresholds, which were set at no less than 5%; with minimum QD = 1.16, minimum DP = 120.0, minimum MQ = 10.0, and maximum FS = 70.0. Both raw variant and nonvariant sites were filtered, based on these thresholds, using GATK’s VariantFiltration and VCFtools (Danecek et al. 2011). For comparative data, we selected males from seven additional African (or of African descent) populations from the 1000 Genomes Project (Auton et al. 2015)—5 Yoruba and 4 Luhya, the Complete Genomics diversity panel (Drmanac et al. 2010)—3 African Americans and 2 Maasai, and from the Lachance et al. (2012) data set—5 Hadza, 5 Sandawe, 3 Baka, 1 Bakola, and 1 Bedzan. Variants were filtered using in-house scripts with the following parameters: a minimum depth (DP) of 6×, a minimum genotype quality (GQ) of 50, and excluding records that were marked “VQLOW.” As is recognized by many studies, a large portion of human Y chromosome is ill-suited for NGS (Poznik et al. 2013; Wei et al. 2013; Karmin et al. 2015); and so we applied an additional regional filter for all Y-chromosome sequences, in order to restrict further analysis to regions of Y chromosome from which we could obtain reliable sequence data. The filter was defined by Karmin et al. (2015) (filter a + b+d) on the basis of analyses of Illumina HiSeq data with human reference genome GRCh37. We merged the final variant call sets using the vcf-merge function from the VCFtools package (Danecek et al. 2011) and removed sites with >5% missingness.

Haplogroup Assignment and Branch Length Analysis

Haplogroup assignment was performed using AMY-tree v. 2.0 (Van Geystelen et al. 2013). Additionally, variants were assigned to Y-chromosome phylogeny branches based on allele sharing and clade formation among individuals. Only variants <5% missingness (a maximum of two individuals with missing data per variant) were used in subsequent steps. Reference-derived variants were identified and were either removed or correctly placed on the phylogeny. In order to root the phylogeny, we used A00 sequences (Mendez et al. 2013; Karmin et al. 2015) to confirm the status of known A1b-defining variants in our data set. Variants defining the A1b1 and BT branches were differentiated from each other by checking against the A00 outgroup sequences (see supplementary tables S1 and S4, Supplementary Material online). Branch-defining variants were then matched against the ISOGG 2019-2020 Y-chromosome phylogeny (http://www.isogg.org/tree/index.html) to assign the most recent haplogroup names to the branches; including the internal branches, to confirm phylogenetic congruency. Scripts are available on request to the authors. We also assessed whether branch length differed among the major haplogroups, by counting variants from the A1b root till the ends of the terminal branches. Haplogroups were compared using a Mann–Whitney U test.

Phylogenetic Analysis

We reconstructed a phylogenetic tree and dated the nodes using BEAST V1.8.2 (Drummond et al. 2012). The Y-chromosome variant-sites VCF file was converted to FASTA format using a script (vcf-tab-to-fasta; https://code.google.com/archive/p/vcf-tab-to-fasta/, last accessed May 28, 2020), before importing to BEAUti, a graphical user interface application included in BEAST package, to generate the BEAST input XML file. We chose the best-fitting substitution model by conducting test runs in jModelTest V2.7.1 (Darriba et al. 2012). The tree model was set to Coalescent: constant size with a piecewise-linear skyline model. We applied general time reversible substitution model for Y-chromosome data, using a log-normal relaxed clock with a mutation rate 0.74×10−9 mutations/bp/year (Karmin et al. 2015) and 0.87×10−9 mutations/bp/year (Helgason et al. 2015). To reduce the computational load, we used only variant sites for the Y-chromosome BEAST analysis. However, we incorporated the information of invariant sites by specifying the nucleotide composition in the BEAST input XML file (Karmin et al. 2015). To determine the nucleotide composition of invariant sites, we compared the variant-sites VCF with the all-sites VCF and counted the number of A, T, C, and G nucleotides for invariable sites. For each mutation rate, we performed two independent runs of 100 million MCMC iterations with a sampling in every 1,000 steps. The initial 10% of each run was discarded as burn-in. The output was inspected in Tracer v1.6, confirming that all EES values were >200 and the two runs were combined with LogCombiner (Rambaut et al. 2018). We annotated maximum clade credibility trees in TreeAnnotator (Drummond et al. 2012) and extracted the mean, median, and 95% HPD intervals of the node heights for dating. Click here for additional data file.

42 in total

1. Hierarchical patterns of global human Y-chromosome diversity.

Authors: M F Hammer; T M Karafet; A J Redd; H Jarjanazi; S Santachiara-Benerecetti; H Soodyall; S L Zegura
Journal: Mol Biol Evol Date: 2001-07 Impact factor: 16.240

2. Ancient west Eurasian ancestry in southern and eastern Africa.

Authors: Joseph K Pickrell; Nick Patterson; Po-Ru Loh; Mark Lipson; Bonnie Berger; Mark Stoneking; Brigitte Pakendorf; David Reich
Journal: Proc Natl Acad Sci U S A Date: 2014-02-03 Impact factor: 11.205

3. Y-chromosomal variation in sub-Saharan Africa: insights into the history of Niger-Congo groups.

Authors: Cesare de Filippo; Chiara Barbieri; Mark Whitten; Sununguko Wata Mpoloka; Ellen Drofn Gunnarsdóttir; Koen Bostoen; Terry Nyambe; Klaus Beyer; Henning Schreiber; Peter de Knijff; Donata Luiselli; Mark Stoneking; Brigitte Pakendorf
Journal: Mol Biol Evol Date: 2010-11-25 Impact factor: 16.240

4. Evolutionary history and adaptation from high-coverage whole-genome sequences of diverse African hunter-gatherers.

Authors: Joseph Lachance; Benjamin Vernot; Clara C Elbers; Bart Ferwerda; Alain Froment; Jean-Marie Bodo; Godfrey Lema; Wenqing Fu; Thomas B Nyambo; Timothy R Rebbeck; Kun Zhang; Joshua M Akey; Sarah A Tishkoff
Journal: Cell Date: 2012-07-26 Impact factor: 41.582

5. Molecular dissection of the basal clades in the human Y chromosome phylogenetic tree.

Authors: Rosaria Scozzari; Andrea Massaia; Eugenia D'Atanasio; Natalie M Myres; Ugo A Perego; Beniamino Trombetta; Fulvio Cruciani
Journal: PLoS One Date: 2012-11-07 Impact factor: 3.240

6. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330

7. Development of a single base extension method to resolve Y chromosome haplogroups in sub-Saharan African populations.

Authors: Thijessen Naidoo; Carina M Schlebusch; Heeran Makkan; Pareen Patel; Rajeshree Mahabeer; Johannes C Erasmus; Himla Soodyall
Journal: Investig Genet Date: 2010-09-01

8. The Y-chromosome tree bursts into leaf: 13,000 high-confidence SNPs covering the majority of known clades.

Authors: Pille Hallast; Chiara Batini; Daniel Zadik; Pierpaolo Maisano Delser; Jon H Wetton; Eduardo Arroyo-Pardo; Gianpiero L Cavalleri; Peter de Knijff; Giovanni Destro Bisol; Berit Myhre Dupuy; Heidi A Eriksen; Lynn B Jorde; Turi E King; Maarten H Larmuseau; Adolfo López de Munain; Ana M López-Parra; Aphrodite Loutradis; Jelena Milasin; Andrea Novelletto; Horolma Pamjav; Antti Sajantila; Werner Schempp; Matt Sears; Aslıhan Tolun; Chris Tyler-Smith; Anneleen Van Geystelen; Scott Watkins; Bruce Winney; Mark A Jobling
Journal: Mol Biol Evol Date: 2014-12-02 Impact factor: 16.240

9. A recent bottleneck of Y chromosome diversity coincides with a global change in culture.

Authors: Monika Karmin; Lauri Saag; Mário Vicente; Melissa A Wilson Sayres; Mari Järve; Ulvi Gerst Talas; Siiri Rootsi; Anne-Mai Ilumäe; Reedik Mägi; Mario Mitt; Luca Pagani; Tarmo Puurand; Zuzana Faltyskova; Florian Clemente; Alexia Cardona; Ene Metspalu; Hovhannes Sahakyan; Bayazit Yunusbayev; Georgi Hudjashov; Michael DeGiorgio; Eva-Liis Loogväli; Christina Eichstaedt; Mikk Eelmets; Gyaneshwer Chaubey; Kristiina Tambets; Sergei Litvinov; Maru Mormina; Yali Xue; Qasim Ayub; Grigor Zoraqi; Thorfinn Sand Korneliussen; Farida Akhatova; Joseph Lachance; Sarah Tishkoff; Kuvat Momynaliev; François-Xavier Ricaut; Pradiptajati Kusuma; Harilanto Razafindrazaka; Denis Pierron; Murray P Cox; Gazi Nurun Nahar Sultana; Rane Willerslev; Craig Muller; Michael Westaway; David Lambert; Vedrana Skaro; Lejla Kovačevic; Shahlo Turdikulova; Dilbar Dalimova; Rita Khusainova; Natalya Trofimova; Vita Akhmetova; Irina Khidiyatova; Daria V Lichman; Jainagul Isakova; Elvira Pocheshkhova; Zhaxylyk Sabitov; Nikolay A Barashkov; Pagbajabyn Nymadawa; Evelin Mihailov; Joseph Wee Tien Seng; Irina Evseeva; Andrea Bamberg Migliano; Syafiq Abdullah; George Andriadze; Dragan Primorac; Lubov Atramentova; Olga Utevska; Levon Yepiskoposyan; Damir Marjanovic; Alena Kushniarevich; Doron M Behar; Christian Gilissen; Lisenka Vissers; Joris A Veltman; Elena Balanovska; Miroslava Derenko; Boris Malyarchuk; Andres Metspalu; Sardana Fedorova; Anders Eriksson; Andrea Manica; Fernando L Mendez; Tatiana M Karafet; Krishna R Veeramah; Neil Bradman; Michael F Hammer; Ludmila P Osipova; Oleg Balanovsky; Elza K Khusnutdinova; Knut Johnsen; Maido Remm; Mark G Thomas; Chris Tyler-Smith; Peter A Underhill; Eske Willerslev; Rasmus Nielsen; Mait Metspalu; Richard Villems; Toomas Kivisild
Journal: Genome Res Date: 2015-03-13 Impact factor: 9.043

10. Refining the Y chromosome phylogeny with southern African sequences.

Authors: Chiara Barbieri; Alexander Hübner; Enrico Macholdt; Shengyu Ni; Sebastian Lippold; Roland Schröder; Sununguko Wata Mpoloka; Josephine Purps; Lutz Roewer; Mark Stoneking; Brigitte Pakendorf
Journal: Hum Genet Date: 2016-04-04 Impact factor: 4.132

3 in total

1. Khoe-San Genomes Reveal Unique Variation and Confirm the Deepest Population Divergence in Homo sapiens.

Authors: Carina M Schlebusch; Per Sjödin; Gwenna Breton; Torsten Günther; Thijessen Naidoo; Nina Hollfelder; Agnes E Sjöstrand; Jingzi Xu; Lucie M Gattepaille; Mário Vicente; Douglas G Scofield; Helena Malmström; Michael de Jongh; Marlize Lombard; Himla Soodyall; Mattias Jakobsson
Journal: Mol Biol Evol Date: 2020-10-01 Impact factor: 16.240

2. Weaving Mitochondrial DNA and Y-Chromosome Variation in the Panamanian Genetic Canvas.

Authors: Nicola Rambaldi Migliore; Giulia Colombo; Marco Rosario Capodiferro; Lucia Mazzocchi; Ana Maria Chero Osorio; Alessandro Raveane; Maribel Tribaldos; Ugo Alessandro Perego; Tomás Mendizábal; Alejandro García Montón; Gianluca Lombardo; Viola Grugni; Maria Garofalo; Luca Ferretti; Cristina Cereda; Stella Gagliardi; Richard Cooke; Nicole Smith-Guzmán; Anna Olivieri; Bethany Aram; Antonio Torroni; Jorge Motta; Ornella Semino; Alessandro Achilli
Journal: Genes (Basel) Date: 2021-11-29 Impact factor: 4.096

3. Male-biased migration from East Africa introduced pastoralism into southern Africa.

Authors: Mário Vicente; Imke Lankheet; Thembi Russell; Nina Hollfelder; Vinet Coetzee; Himla Soodyall; Michael De Jongh; Carina M Schlebusch
Journal: BMC Biol Date: 2021-12-07 Impact factor: 7.431

3 in total