Literature DB >> 36170230

GWAS in the southern African context.

Yolandi Swart¹, Gerald van Eeden¹, Caitlin Uren^1,2, Gian van der Spuy^1,3, Gerard Tromp^1,2,3, Marlo Möller^1,2.

Abstract

Researchers would generally adjust for the possible confounding effect of population structure by considering global ancestry proportions or top principle components. Alternatively, researchers would conduct admixture mapping to increase the power to detect variants with an ancestry effect. This is sufficient in simple admixture scenarios, however, populations from southern Africa can be complex multi-way admixed populations. Duan et al. (2018) first described local ancestry adjusted allelic (LAAA) analysis as a robust method for discovering association signals, while producing minimal false positive hits. Their simulation study, however, was limited to a two-way admixed population. Realizing that their findings might not translate to other admixture scenarios, we simulated a three- and five-way admixed population to compare the LAAA model to other models commonly used in genome-wide association studies (GWAS). We found that, given our admixture scenarios, the LAAA model identifies the most causal variants in most of the phenotypes we tested across both the three-way and five-way admixed populations. The LAAA model also produced a high number of false positive hits which was potentially caused by the ancestry effect size that we assumed. Considering the extent to which the various models tested differed in their results and considering that the source of a given association is unknown, we recommend that researchers use multiple GWAS models when analysing populations with complex ancestry.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36170230 PMCID： PMC9518849 DOI： 10.1371/journal.pone.0264657

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

Differential assortment of ancestral allele frequencies across a genome results from selection, mutation or genetic drift, when previously isolated populations interbreed [1, 2]. Taking advantage of these differences in allele frequencies can help identify population-specific disease risk alleles associated with disease phenotypes due to various ancestries being exposed to distinct environments and pathogens [3]. Admixed populations present unique opportunities to identify ancestry-specific disease risk alleles for various populations simultaneously. African ancestries harbour the most genetic diversity and contain more complex linkage disequilibrium (LD) blocks than other continental populations and the precise evolutionary events that shaped their genomes are mostly unknown [4, 5]. When the historical events are unknown, it is often difficult to know beforehand which effect (allele, ancestry or the interaction between them) has the most significant effect on the disease phenotype under study. Therefore, studying admixed African populations presents a rich opportunity for the discovery of ancestry-specific disease risk alleles, however, the complex genomic architecture of admixed African genomes warrants careful consideration of both global and local ancestry. A genome-wide association study (GWAS) is commonly used to discover associations between single nucleotide polymorphisms (SNPs) and a diverse spectrum of complex traits of interest [6]. Attributing associations at the SNP level, however, requires large sample sizes when the SNP being tested has a small effect on the trait of interest. GWAS is further complicated when individuals have admixture-induced linkage disequilibrium (admixture LD) blocks. Admixture mapping uses the admixture LD blocks inherited from a specific ancestral population to test for an association with the trait of interest [2]. Hence, the ancestry rather than the genotypes are traced in the association process [7] and the SNPs that affect the trait of interest can only be localised to their respective ancestral blocks (presented as an admixture peak on a Manhattan plot). There are also smaller LD blocks within admixture LD blocks that originated prior to admixture that are inherited from the parent populations that contributed to the admixture (ancestry LD). Admixture LD and ancestry LD cause population stratification in a GWAS which reduces the power to detect a significant association and could lead to increased false positive hits and false negative hits [8]. Therefore, at a given locus within an admixed population there will be groups of individuals with similar LD. If one or more of these groups is disproportionately represented within a given phenotype, a false association will exist between that group and that phenotype. Many approaches address this problem by including global ancestry and local ancestry in GWAS models [9-11]. Duan et al. [8] expanded on these approaches by additionally modelling the interaction between allelic effects and ancestry effects. They proposed a two-step approach, called the Local Ancestry Adjusted Allelic (LAAA) testing procedure, that first detects associations by jointly modelling allelic effects, ancestry effects and interaction effects and then determines the source of the association. Their approach proved robust and powerful, but it was only assessed in a two-way admixed population. Therefore, it remains to be determined if this model will be able to capture the correct association signals with minimal power loss and spurious associations (including both false positive hits and false negative hits) in complex multi-way admixed populations exhibiting genetic heterogeneity (i.e. ancestry patterns that differ between admixed populations formed by the same source populations). We believe that testing this will go a long way to identifying robust models that can be used in a wide array of demographic scenarios. We used the complex demographic history of the Nama and the South African Coloured (SAC) populations to simulate a three-way admixed population (as represented by the Nama) and a five-way admixed population (as represented by the SAC). The genotypes from these simulated populations were used to simulate three phenotypes, each with a different association signal. Therefore the aim of this study was to determine if the LAAA model will be able to robustly capture the true causal variants regardless of the source of the association or the complexity of the admixture compared to other frequently used GWAS models. We also compare other models commonly used in GWAS and we investigate the efficacy of each model when presented with true versus inferred ancestry.

Material and methods

Simulating genotypes

msprime [12] was used to jointly simulate the demographic history (Fig 1) of the Nama, South African Coloured (SAC), Han Chinese in Beijing, China (CHB), British From England and Scotland (GBR), Gujarati Indians in Houston, Texas, USA (GIH), Gumuz, Luhya in Webuye, Kenya (LWK) and Mende in Sierra Leone (MSL). The model describing the demographic history was defined in a Demes Specification (https://github.com/popsim-consortium/demes-python) that was then converted to a msprime demographic model using the demes python library. Demes Specifications are YAML files that contain information about populations, their properties (like effective population size changes) and the relationship of a population to other populations.

Fig 1

The demographic histories of the simulated populations (Nama and SAC).

The demographic histories of the simulated populations (Nama and SAC).

Thick solid lines with open arrowheads indicate an ancestor -> descendant relation, dashed lines indicate an admixture pulse and faint solid lines with closed arrowheads indicate continuous migration. The two admixture events for the Nama is indicated with the dashed lines with arrowheads (purple and light pink). The one admixture event for the SAC is indicated by the four solid lines with open arrowheads (green, dark pink, light pink and light brown). The demographic history of the Nama, an indigenous population of southern Africa [13], is complex and extends back to the emergence of modern humans. Furthermore, the Nama also integrated and received ancestral contributions from the indigenous Khoe-San group from southern Africa. The Khoe-San are reported to have the most divergent lineages compared to other living populations [14-18] and it is believed that they have largely remained isolated until ~2000 years ago [13, 14, 19]. The Nama is reported to have experienced 5–25% gene flow from eastern African caprid and cattle pastoralists ~2000 years ago [20]. These events were finally followed by recent admixture with European colonists ~250 years ago. The demographic model for the Nama (Fig 1) was based on the above mentioned events and a conference poster by Ragsdale et al. [21] where they inferred detailed parameterized demographic models for five present day populations (GBR, Gumuz, LWK, MSL and Nama) using joint allele frequency and LD statistics. The Khoe-San also contribute a significant ancestral component to a South African population that represents a highly admixed group from multiple ancestral populations (SAC) [22, 23]. Many non-African and African populations moved into southern Africa over the last ~600 years and integrated with the Khoe-San [22]. Uren et al. [13] found that the SAC has Khoe-San (32–43%), Bantu-speaking African (20–36%), European (21–28%) and East and South-East Asian (9–11%) ancestral contributions. The demographic model for the SAC (Fig 1) was based on the above ancestral proportions with a single admixture event, for simplicity, between the ancestral populations ~400 years ago. The demographic models were used to simulate DNA sequence evolution of 5000 Nama and 5000 SAC individuals as tree sequences (Fig 2). We assumed a generation time of 25 years, a mutation rate of 1.29e-08 and a recombination rate as specified by the Combined HapMap II recombination map. The CHB, GBR, GIH, LWK and MSL (1000 individuals of each) were simulated to act as reference populations for the Nama and the SAC. The Nama acted as a reference population for any Khoe-San ancestral contribution in the Nama and the SAC. Our aim with this simulation was not to create models that accurately represent the demographic history of the Nama or the SAC, but to simulate realistic genotype data for a three-way and five-way admixed population as represented by the Nama and the SAC. The tree sequence was then used to create VCFs using tskit [24]. The number of SNPs in the resultant VCFs was reduced to ~400 000 SNPs to reduce model runtime by randomly selecting variants. We also wrote a custom script that works its way up the tree sequence to find the ancestry at a given locus at a given time using the migration history of that locus, thereby enabling us to recover the true ancestry of an individual at a given locus.

Fig 2

Overview of the methods used to simulate genotypes using the software msprime and phenotypes using the software PhenotypeSimulator.

Simulating phenotypes

PhenotypeSimulator [25] was used for phenotype simulation, because it allows very fine control over individual effects. It entails a three-step process: (1) simulating phenotype components of interest, (2) scaling the components according to the amount of variance explained and (3) combining the phenotype components for each individual. The phenotype components are further divided into genetic effects and noise effects. Genetic effects are made up of genetic variant effects, population structure, local ancestry and the interaction between variant effects and local ancestry effects. Noise effects are made up of observational noise effects and confounding variable such as sex and age. Three kinds of phenotypes were simulated for the Nama (Table 1) and the SAC (Table 2)—phenotypes with an allelic only (AO) effect, phenotypes with an ancestry plus allelic (APA) effect and phenotypes with an APA and local ancestry adjusted allelic (LAAA) effect. Three replications, each with a different set of 10 causal SNPs of each kind of phenotype were simulated randomly and a replication was simulated with each ancestral population (three for the Nama and five for the SAC) as the source of the local ancestry effect. Therefore, in total 27 phenotypes were simulated for the Nama and 45 phenotypes we simulated for the SAC.

Table 1

Breakdown of the effect sizes that make up the phenotypes simulated for the Nama.

		Genetic effects				Noise effects
Phenotype	Ancestry of interest	Variant	Local ancestry	Interaction	Population structure	Observational	Covariates
AO	Nama	0.12	0.00	0.00	0.48	0.16	0.24
AO	GBR	0.12	0.00	0.00	0.48	0.16	0.24
AO	EP	0.12	0.00	0.00	0.48	0.16	0.24
APA	Nama	0.12	0.06	0.00	0.42	0.16	0.24
APA	GBR	0.12	0.06	0.00	0.42	0.16	0.24
APA	EP	0.12	0.06	0.00	0.42	0.16	0.24
LAAA	Nama	0.12	0.06	0.04	0.38	0.16	0.24
LAAA	GBR	0.12	0.06	0.04	0.38	0.16	0.24
LAAA	EP	0.12	0.06	0.04	0.38	0.16	0.24

Table 2

Breakdown of the effect sizes that make up the phenotypes simulated for the SAC.

		Genetic effects				Noise effects
Phenotype	Ancestry of interest	Variant	Local ancestry	Interaction	Population structure	Observational	Covariates
AO	Nama	0.12	0.00	0.00	0.48	0.16	0.24
AO	GBR	0.12	0.00	0.00	0.48	0.16	0.24
AO	MSL	0.12	0.00	0.00	0.48	0.16	0.24
AO	CHB	0.00	0.00	0.00	0.60	0.16	0.24
AO	GIH	0.00	0.00	0.00	0.60	0.16	0.24
APA	Nama	0.12	0.06	0.00	0.42	0.16	0.24
APA	GBR	0.12	0.06	0.00	0.42	0.16	0.24
APA	MSL	0.12	0.06	0.00	0.42	0.16	0.24
APA	CHB	0.12	0.06	0.00	0.42	0.16	0.24
APA	GIH	0.12	0.06	0.00	0.42	0.16	0.24
LAAA	Nama	0.12	0.06	0.04	0.38	0.16	0.24
LAAA	GBR	0.12	0.06	0.04	0.38	0.16	0.24
LAAA	MSL	0.12	0.06	0.04	0.38	0.16	0.24
LAAA	CHB	0.12	0.06	0.04	0.38	0.16	0.24
LAAA	GIH	0.12	0.06	0.04	0.38	0.16	0.24

For all phenotypes, genetic effects were assumed to account for 60% of the total variance. When genetic variant effects were present, they were assumed to account for 20% of the total genetic effect or 12% of the total effect, when local ancestry effects are present, they were assumed to account for 10% of the total genetic effect or 6% of the total effect and when the interactions between variant effects and local ancestry effects were present, they were assumed to account for 6.67% of the total genetic effect or 4% of the total effect. After these effects were accounted for, the proportion of the genetic effect that remains was attributed to population structure and kinship. Observational noise was assumed to be 40% of the total noise effect or 16% of the total effect and the non-genetic covariates, sex and age, was assumed to account for the remaining 60% of the noise effect or 24% of the total effect. These proportions are completely arbitrary and only the genetic variant effects are based on literature—complex diseases, such as type 2 diabetes and systemic lupus erythematosus, have been associated with 4–40 loci with a total heritability explained by variant effects of 1.5–50% [26].

Ancestry inference

The simulated Nama and SAC datasets were merged with their respective appropriate source populations using PLINK v2.0 (https://www.cog-genomics.org/plink/2.0/) [27] in order to infer both global and local ancestry. For the Nama, the GBR (n = 1000), LWK (n = 1000) and Nama (n = 1000) were included to represent their contributing ancestral source populations. For the SAC, the GBR (n = 1000), MSL (n = 1000), Nama (n = 1000), GIH (n = 1000) and CHB (n = 1000) were included to represent their contributing ancestral source populations. After merging of admixed and source ancestral populations, all minor alleles with a frequency < 0.05 were excluded from analysis. The final dataset after quality control and data filtering consisted of 387 959 autosomal variants and 5000 Nama and 5000 SAC, in addition to 3000 ancestral individuals for the Nama and 5000 ancestral individuals for the SAC. The software RFMix was used to infer global and local ancestry for both the Nama and SAC datasets [28]. RFMix is 30X faster than other local ancestry inference software and is accurate in multi-way admixture scenarios [28, 29]. Default parameters were used, except for the number of generations since admixture, which was set to 13 for the Nama and 16 for the SAC.

GWAS models

The following five regression models were tested simultaneously for each simulated phenotype for both the Nama and SAC: The standard model—only allelic effects are considered and no global or local ancestry effects are modeled. () is the continuous outcome (Y). 1 and 2 represent the covariates (age and gender) and α1 and α2 is the corresponding marginal effect. Xx represents the number of reference alleles at the locus under investigation and β is the corresponding marginal effect. 2. The global ancestry (GA) model—frequently used in GWAS to account for the possible confounding effect of population structure in the study population. represents the estimated global ancestry and αp is the corresponding marginal effect. 3. The local ancestry (LA) model—used in admixture mapping studies to localise potential associations to ancestral blocks. The model is sensitive to frequency disparities across ancestral populations. i represents the number of ancestry alleles (separate model for each contributing ancestry) at the locus under investigation and y is the corresponding marginal effect. 4. The ancestry plus allelic (APA) model—tests for both an allelic association and an association with an ancestry in the phenotype. Therefore jointly testing the associations modelled in model 1 + 2. 5. The local ancestry adjusted allelic (LAAA) model—an extension of the APA model that also includes an interaction term between the allele present at a specific locus and the ancestry. ix represents the number of ancestry-specific reference alleles at the locus under investigation and ɳ is the corresponding marginal effect. When global ancestry is included in a given model, the smallest ancestry proportion was excluded as a covariate to avoid complete separation of the data for both the Nama (GBR) and SAC (CHB). Thus, two ancestral components (Nama and LWK) for the Nama and four ancestral components (Nama, MSL, GBR, GIH) for the SAC were included as covariates in association testing, together with gender and age. A total of 5000 Nama and 5000 SAC and 387 959 autosomal variants were included in statistical analysis. The lm() function in R was used for linear regression association testing. The conventional significance threshold of 5 ✕ 10⁻⁸ for association testing was used. Dosage files were compiled at each locus for the allelic state (0, 1 or 2 copies of the major allele), for the ancestry (0, 1 or 2 copies of the ancestry of interest) and for the interaction between the allelic state and the ancestry (0, 1 or 2 copies of the major allele that is from the ancestry of interest).

Results

Local ancestry inference accuracy

We inferred the local ancestry for the simulated Nama and SAC populations, since the true local ancestry will not be available in a GWAS using real data. Therefore, we could also assess the impact of LAI accuracy on the GWAS models in different phenotypic scenarios. An overall 84.90% accuracy for the Nama (Table 3) and an overall 85.40% accuracy for the SAC was attained. For the ancestral components of the Nama we attained 76.07% for the LWK, 88.35% for the GBR and 84.90% for the Nama (Khoe-San). For the ancestral components of the SAC (Table 4) we attained 83.13% for the CHB, 89.24% for the GBR, 84.56% for the MSL, 84.73% for the Nama and 85.38% for the GIH.

Table 3

The ancestral proportions and the accuracy of the inferred local ancestry for the Nama.

	LWK	GBR	Nama	Total
Ancestral proportion	0.100	0.150	0.750	1.000
LAI accuracy	0.761	0.884	0.853	0.849

Table 4

The ancestral proportions and the accuracy of the inferred local ancestry for the SAC.

	CHB	GBR	MSL	Nama	GIH	Total
Ancestral proportion	0.100	0.175	0.250	0.350	0.125	1.000
LAI accuracy	0.831	0.892	0.846	0.847	0.854	0.854

Comparison of GWAS models applied to simulated GWAS data for the Nama and the SAC

We compared the true positive (the number of times when the source of association could be correctly identified) and the false positive hits (the number of times when the source of the association was incorrectly identified as a true association) from each GWAS model applied to all of the simulated phenotypes to get an indication of model success rate in the simulated Nama and SAC populations. Three iterations of each phenotype were simulated with different sets of causal SNPs, the GWAS models were run for each iteration and the number of true positive and false positive hits (excluding sites in LD with causal SNPs) were averaged to produce the results in Figs 3–6. Across all phenotypes in both populations, whether true (Figs 4 and 6) or inferred (Figs 3 and 5) ancestry was used, the LAAA model achieved the highest or tied for the highest true positive hit count. However, the AO phenotypes in the Nama (Figs 3 and 4) sometimes identified more true positive hits using the standard model, but this only for the AO phenotype. As expected, the standard model was tied with the LAAA model for the highest hit count in phenotypes with only an allelic effect (Figs 5 and 6) in the SAC. In phenotypes that have an ancestry effect in the Nama, either the APA model or the standard model achieved the second highest hit count after the LAAA model (Figs 3 and 4). However, in the SAC the APA model identified more true positive hits in all phenotypes (Figs 5 and 6). The LA model did not identify any true positive hits in many cases and produced a substantial number of false positive hits when there is an ancestry effect present (Figs 3–6). The GA model identified more true positives than the LA model, but fewer than the standard model. The standard model produced the lowest number of false positive hits regardless of the phenotype it was applied to, but when there is an ancestry effect present, the standard model never identified more than 6/10 causal variants whereas models such as LAAA sometimes identified 9/10.

Fig 3

Comparison of true positive hits (left) and false positive hits (right) for the Nama with inferred local ancestry used in GWAS models.

The average hits for three runs with different causal SNPs are shown. The simulated phenotypes are denoted as “phenotype-ancestral source of association”, e.g.,LAAA-LWK means the LAAA phenotype with the LWK ancestral component as the ancestral source of association. The various GWAS models used are indicated in different colours. Grey represents the APA model, yellow represents the GA model, blue represents the LA model, green represents the LAAA model and the orange represents the Standard model.

Fig 6

Comparison of true positive hits (left) and false positive hits (right) for the SAC with true ancestry used in GWAS models.

Fig 4

Comparison of true positive hits (left) and false positive hits (right) for the Nama with true local ancestry used in GWAS models.

Fig 5

Comparison of true positive hits (left) and false positive hits (right) for the SAC with inferred local ancestry used in GWAS models.

Comparison of true positive hits (left) and false positive hits (right) for the Nama with inferred local ancestry used in GWAS models.

Comparison of true positive hits (left) and false positive hits (right) for the Nama with true local ancestry used in GWAS models.

Comparison of true positive hits (left) and false positive hits (right) for the SAC with inferred local ancestry used in GWAS models.

Comparison of true positive hits (left) and false positive hits (right) for the SAC with true ancestry used in GWAS models.

The average hits for three runs with different causal SNPs are shown. The simulated phenotypes are denoted as “phenotype-ancestral source of association”, e.g.,LAAA-CHB means the LAAA phenotype with the CHB ancestral components as the ancestral source of association. The various GWAS models used are indicated in different colours. Grey represents the APA model, yellow represents the GA model, blue represents the LA model, green represents the LAAA model and the orange represents the Standard model. All models produced a low number of false positive hits for the AO phenotypes relative to the other phenotypes. Overall, models with a local ancestry component (APA, LA, LAAA) produced substantially more false positive hits than models that only account for allelic effects and global ancestry effects (Figs 3–6). There are more false positive hits across phenotypes and models when inferred ancestry was used for the Nama and the SAC instead of the true ancestry (Figs 3–6). For the SAC there are slightly more false positive hits for phenotypes with the CHB and GBR as the source of the ancestral effect when using the true ancestry (Fig 6) compared to the inferred ancestry (Fig 5). In every phenotype, except the AO phenotypes, the LWK had the lowest number of false positive hits in the Nama (Figs 3 and 4) and the CHB had the lowest number of false positive hits in the SAC when the inferred ancestry was used (Fig 6). This pattern holds true for the other inferred ancestries (except the GIH) in the SAC phenotypes (Fig 5). In the Nama, the use of inferred local ancestry in the APA, LA and GA models (Fig 3) also produced more false positive hits in the AO phenotypes than using the true ancestry in these models. Using inferred ancestry over the true ancestry noticeably decreased the number of true positive hits detected by the APA, LA and the GA, whereas the LAAA model is less affected by inaccurate inferred local ancestry.

Discussion

Generally, researchers adjust for the possible confounding effect of population structure by considering global ancestry proportions or top principal components in statistical models. Conversely, researchers would conduct admixture mapping, which involves inferring ancestry at each genomic loci along the genome of an admixed individual (local ancestry inference), to increase the power to detect variants with an ancestry effect [30, 31]. This is sufficient in simple admixture scenarios, such as two-way admixed populations (African-Americans) and three-way admixed populations (Latinos/Hispanics) [1, 32, 33]. Populations from southern Africa, however, can be complex multi-way admixed populations, with up to five contributing ancestries [34, 35]. Furthermore, the underlying genetic architecture is mostly unknown and can be complex due to various historical events in southern Africa [4]. Thus, conventional admixture mapping strategies might result in either increased false positive hits or false negative hits for complex multi-way admixed populations. Duan et al. [8] first described LAAA analysis as a robust method for discovering association signals, while producing few false positive hits, by comparing the LAAA model to other well-known models used for GWAS in admixed populations [8]. Their simulation study, however, was limited to a two-way admixed population. Realising that their findings might not translate to other admixture scenarios, we simulated a three- and five-way admixed population to compare the GA, LA, APA and LAAA models that they tested and added a conventional GWAS model that does not model any local ancestry effects (the standard model). Our main objective is was to discern if the LAAA model robustly captures the true causal variants in complex multi-way admixed populations without prior knowledge of admixture dynamics compared to other traditional GWAS models. The GA model is the most widely used model to control for population structure in GWAS, because it has been shown to improve GWAS results when the phenotype exhibits a differential allelic effect across ancestral populations (scenario AO). Therefore, adjusting for only global ancestry proportions or the top principle components should be sufficient to control for the confounding effect of population structure. In phenotypes where there is no ancestry effect, we expected that the GA model would improve upon the standard model, however, this was not the case. The LA model identified the least amount of true positive causal variants and sometimes none at all. This makes sense, since only local ancestry is considered in statistical analysis and the possible allelic effect is not accounted for. The LA model is more successful in phenotypes where the ancestral component that causes the ancestry effect is small, which is expected, since the effect is concentrated in fewer SNPs. We only observed this effect in the five-way admixed population and not in the three-way admixed population (Figs 4 and 6). This effect could be due to chance, since causal SNPs were drawn randomly from the dataset without any regard for the local ancestry of the SNP. Our simulations would have to be adjusted to ensure that all of the causal SNPs are found in or outside of the contributing ancestry’s ancestral LD blocks and many more iterations would have to be done to test this. Any success in identifying true causal variants with this method is negated by the large number of false positive hits that are produced. The number of false positive hits produced by this model are generally organised into admixture peaks where all the SNPs that share an ancestral LD block with the causal variant will have a strong association with the phenotype. This approach only indicates the ancestral LD block associated with the disease trait and, therefore, requires more information to identify which variant is responsible for the association perceived in the admixture peak. Furthermore, when the simulated local ancestry effects are sufficiently large, over inflation of test statistics could occur. The APA model is expected to perform better than the first two models when there is both an allelic and ancestry effect, because both of these effects are modelled at each genomic locus. When testing genetic markers that are proxies for a disease causal loci, the differential LD within admixed populations can result in heterogeneity of effect estimates by local ancestry [36]. Therefore, additional adjustments beyond global ancestry are required to increase power when the admixture-induced LD is in the opposite direction as the LD in the ancestral population [8, 36]. This agrees with what we observe from our results where the GA model identified fewer true positive hits in comparison to the APA and LAAA models when the simulated phenotypes displayed a local ancestry effect. The APA model is also more readily affected by LAI accuracy than the LAAA model. Hence, the APA model could be more prone to false negative hits and false positive hits. The APA model might also be insufficient for certain admixture scenarios, such as scenarios where admixture induced LD blocks mask the allelic effect of the causal SNP when the tagging SNP is located in a different ancestral LD block to the causal SNP. The LAAA model outperforms the other statistical models by either achieving the most true positive hits or achieving the second highest hit count. Most importantly, the LAAA is particularly robust in the presence of genetic effect heterogeneity. Interestingly, Duan et al. [8] and Liu et al. [36] both recommend to first identify admixture peaks and then to use the interaction term as a secondary follow-up test to elucidate the source of the association obtained in the admixture peak. However, our results demonstrate the possible true causal variants that might be missed when not including the interaction term from the beginning in complex multi-way admixed populations. Importantly, the LAAA model was robust against inaccurate local ancestry inference (0.849 for the Nama and 0.854 for the SAC) with only minor changes in the number of true positive hits between GWAS performed with inferred ancestry versus true ancestry. We assumed independence of ancestry-specific effects across chromosomes. Therefore, the effect of other ancestral LD blocks (besides the ancestral LD blocks in which the causal SNPs are found) on a given phenotype was not simulated. This is important to note when considering our results in terms of real data, since this omission can have an impact on elucidating the true causal SNPs. Future studies should consider including all five ancestries in a linear mixed model to account for cases where traits are affected by the interaction between multiple ancestral LD blocks and their interaction with the causal SNPs. Another unexpected observation in the results, is the large number of false positive hits for both the APA and LAAA models compared to the other three models (Standard, GA and LA). This could be due to the simulated local ancestry effect size being too large, causing associations of whole ancestral blocks with the trait of interest. This study was limited by only conducting three replications of the simulated genotypes and phenotypes. Ideally, we would perform a few thousand replications to allow us to calculate the power of each model to identify causal SNPs. However, we were limited by the computational resources required for such an experiment. Alternatively, we could have reduced the number of SNPs used in the analysis and we could have simulated a single population instead of two. The impact of changes of the mutation rate and recombination rate on the statistical models ability to capture true variants could also be tested in the future. The simplification of the SAC as one admixture event could have implications for the admixture-induced LD blocks, since colonisation and admixture that occurred in the Cape was complex, at different time points and still ongoing. Hence, the simulations should be expanded to multiple admixture events at different time points in the future. In conclusion, investigating admixed individuals can be advantageous, due to diverse patterns of LD across continental populations. Nevertheless, the differential allele frequencies and LD patterns that exist between the ancestries should be incorporated correctly for true causal variants to be identified. The findings of this study show that jointly testing an allele effect, ancestry effect, and allele effect heterogeneity across ancestry (LAAA) in a regression model guards against missing genuine associations from any of the three sources. However, it is evident from the large number of false positive hits detected by some models, that the ancestry effect size we assumed across our simulated phenotypes was too large. Therefore, we were unable to fully assess the suitability of the LAAA model in a complex multi-way admixture scenario. Considering the extent to which the various models we tested differed in their results and considering that the source of a given association is unknown, we recommend that researchers use multiple GWAS models when analysing populations with complex ancestry. Specifically when the population under study exhibits extensive genetic diversity and complicated LD patterns, which are often observed for southern African populations. 30 May 2022

PONE-D-22-03783

GWAS in the Southern African context

PLOS ONE Dear Dr. Swart, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 14 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Badri Padhukasahasram Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf. 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. Thank you for stating in your Funding Statement: “This research was partially funded by the South African government through the South African Medical Research Council and the National Research Foundation. YS was supported by a Stellenbosch University Postgraduate Bursary.” Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now. Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement. Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf. 4. Thank you for stating the following financial disclosure: “This research was partially funded by the South African government through the South African Medical Research Council and the National Research Foundation. YS was supported by a Stellenbosch University Postgraduate Bursary.” Please state what role the funders took in the study. If the funders had no role, please state: ""The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."" If this statement is not correct you must amend it as needed. Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf. 5. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. 6. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments: Dear Dr. Swart, The reviews for your manuscript GWAS in the Southern African context are now available. Based on comments from 3 expert reviewers, I recommend that the manuscript be accepted provided authors make minor revisions outlined by all reviewers. In particular, changes to the text to make it clearer, explanation of definitions of true and false positives, better figure and figure legends can improve the manuscript. In terms of major comments, authors can consider a simplified 3-way and 5-way admixture scenarios to be simulated and tested to avoid additional confounding effects on the GWAS methods tested. Lastly, reviewers also request that the model and scripts used be made readily available to the readers. I look forward to a revised manuscript that addresses all these comments and concerns. Sincerely, Dr. Badri Padhukasahasram [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors set out to validate a recently published approach to use admixture LD to improve GWAS in more complex admixture scenarios (3-way and 5-way). The authors use realistic effect sizes for genetic effects maximizing the applicability of their results to real GWAS situations. However, because of the complex demographic history of the source populations for the admixed communities used as models in this study, it is hard to disentangle effects from varying levels of a particular ancestry and those from a difference in diversity from a single ancestry. The paper has the potential to be a broadly useful resource to the genetic epidemiology community, and could be improved if a simplified 3-way and 5-way admixture scenario were simulated and tested to avoid additional confounding effects on the GWAS methods tested. However, that may be beyond the scope of what the authors hope to present here and I would not require it for accepting this manuscript. I do find several issues with the paper that I think need to be addressed. While GWAS models are well specified (lines 225-253), the conditions of the simulated datasets (both phenotypes and genotypes) are not. It is difficult to evaluate the paper's claims without being able to evaluate how the data were produced. 1. I would suggest the authors include links to the model files used and the custom script used for ancestry assignment (line 161) from the simulated data should be linked to or provided as supplementary data. 2. It is unclear if the authors simulated all the admixture events (arrows) represented in figure 1. 3. It is unclear to me how the genotypes/phenotypes are being linked for analysis? How were the 10 causal SNPs chosen? 4. The three types of phenotypes simulated need to be explained in more detail. The discussion (lines 388-390) seems to suggest that a phenotype caused by a local ancestry-specific SNP was not simulated and tested with the various methods. Would distributing causal SNPs randomly across ancestries not result in biased results of all methods tested that are looking for an ancestry-specific effect? Could this explain the high number of false positives seen in the APA and LAAA models compared to the Standard and GA? 5. The relationship alluded to in Lines 316-318 between the amount of ancestry and false positives should be more formalized, for example simulating datasets with 5-95% of a certain ancestry and plotting the change in false positives. This would be broadly informative beyond the study populations used here. 6. On Line 271, how was LAI accuracy inferred? Were inaccurate segments identified and removed or were they misassigned? 7. The results of this study showing that in many instances the Standard model performs as well or nearly as well as the APA or LAAA, while providing orders of magnitude fewer false positives. Identifying 2-3 fewer true hits but avoiding ~1000 false positives that would make interpreting GWAS results extremely difficult, so to support the use of only a Standard Model. This is quite counter-intuitive and requires more discussion. Reviewer #2: The author is not consistent with abbreviation and full name (under abstract GWAS is not describes) then later LAAA is used as abbreviation and as a full term. The aim and the rational of this study its not clearly stated. The Author is using a public data but it was not mentioned clearly under method section that, this is a public data and therefore there is no need for ethical clearance. I think there is a need for a paragraph of sample/data description. Figures have poor quality (they are blur). All Figure legends have incomplete description of what is happening in a figure. I think sub title will make it easier to follow. Reviewer #3: 1 This is an interesting and self-contained paper that will be very helpful to people analysing complex populations. The authors simulated 3-way and 5-way admixture populations that are broadly representative of two groups in South Africa and then with different phenotypes try to understand which models correctly detect SNPs associated with the phenotypes. There are several different modelling approaches and as seen in the paper how you do this has a significant timpact 2 The paper is written clearly and overall the methods are sound. 3 I have one major criticism 4 There is no definition of true positive and false positive -- see detailed comments below. Without knowing this, I am not sure I agree with the conclusions (or at least I can't independently say I do). If their definition of a false positive would include a SNP that is _not_ a SNP in the "input" of the simulation as being causal but is for example 20 base pairs away and has been detected because of LD effects then I think that needs considerable motivation and explanation to justify calling that a false positive. Anyway -- please be clear and define this. 5. The number of SNPs used is only 400k -- which is small -- does this impact the result? If we were using a dense array and/or imputed data would this help or make things more complicated 6. The limitations expressed on page 15 are significant There are different results for 3-way and 5-way admixture (and as the authors point out this differs from results by others for 2-way admixture). But is that a function of the n in n-way or is it more complex than that? I suppose I want a more definitive answer, which is not possible with the evidence givern. 7. As a point of principle I think the code that was developed should be made available via GitHub or the like. 8. Bibliography -- a very annyoing feature of the paper is that references are numbered in the text but there is no number in the bibliography!!!! 9 Other issues: l97-98: There is no reference for the claim that two-way admixed populations "generally" have simple demographic histories and originate from a single pulse admixture event. I suppose this depends what is meant by "simple" and "pulse", but I think either give some reference that supports this or weaken the claim. l117 : "From" -> from Comma after "Khoe-San" on l128 should deleted -- the subject of the sentence is complex and hence the temptation but the comma is wrong and it would be better to simplify the txt. l122 : See point above about availability -- shouldn't these specifications be made available publicly. Just from a selfish point of view you are more likely to get citations if you do and people use them. l 126-130 : I think the characterisation of Khoe-San as a group culturally related needs care as this is contested esp by members of the communities. This is not my area, but the term Khoe-san language is a term that is more convenient for people outside those communities rather than accurate (perhaps like speaking about the Danish-Japanese language gruop?) and I don't think you can say that pastoralism and hunter-gatherer lifestyles are close culturally. As a relatively recent phenomenon as a result of colonialism since the the late 18th C this may have become true in recent historical times but to extrapolate from one place may be risk over-generalising. l142: wouldn't South Asian be better than East Asian if you are using GIH as the proxy. l143: This modelling of SAC is as the authors recognise a simplification -- white settlement in the Cape only started ~370 years ago with ongoing admixture -- and probably at peak from 300-150 years ago. This is not intended as a criticism and I don't think this result weakens the results but I think the authors should address the consequences of this simplification. l167 Simulating phenotypes: A little more detail would be useful here -- I'd like to understand how this takes LD and both global and local ancestry into account. I didn't understand how the SNPs were chosen -- do you just choose one lead SNP in a region or do you simulate SNPs that though not causal are in LD with the "causal" SNP. This para ia bit opaque. Is different LD in different popualtions considered. l178 -- use dash not hyphen (l 201, 227, 238 and many other palces too) l 288-307 -- I don't think true positives etc should be hyphenated l299 -- semi-colon not comma at end l378 -- principle -> principal l383 -- amount -> number l395 -- I am intrigued by this. When we do a GWAS we often get hits caused by SNPs being in LD wth the causal (or at least a lead SNP). We don't see these as false hits though because we understand that they tag the causal SNP. Of course this depends on how bug the admixture block is -- I wonder whether your framework of true positives/false positives is correct, and I think this needs clarificaition at least. If you define a false positive as a SNP that is found in the simulated data output then I think it is too strong -- I think this related to my comment on line 167 above -- see my general comment. l436 -- no comma -- the subject is complex but it just makes it worse to put a comma in. Similar issue on line 456 l448 I don't think only more complex admixture scenarios. Also just understanding the variability you get that may depend on changes in admixture ratios, LD-structure, population difference and so on. l 462. This sentence needs a main verb. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

5 Aug 2022 Thank you for your review and comments on our research manuscript. Please find our point-by-point response to the Reviewers comments below. Corrections to the conform to the journal requirements have been made as suggested. This include: - Figures were regenerated and submitted to (PACE) digital diagnostic tool to ensure that figures meet PLOS requirements. - Figure legends were updated. - Sentence reconstruction where more clarity was required. - Bibliography was updated to showcase numbers as required Amendment to ‘Funding Statement’: Please including the following as our funding statement. There is no specific grant number attached to this, but the phrasing has been approved the South African Medical Research Council. This research was funded (partially or fully) by the South African government through the South African Medical Research Council and the National Research Foundation. GvE was supported by the DST-NRF Innovation Doctoral Scholarship. YS was supported by a Stellenbosch University Postgraduate Bursary. There was no additional external funding received for this study. The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the manuscript. Amendment tot Data availability statement: All genetic data used in this manuscript was simulated or retrieved from public repositories (http://www.internationalgenome.org). All scripts used to generate and conduct data analysis are deposited and publicly available on Github: https://github.com/YolandiSwart/GWAS_SA The following was done to improve the overall quality of the paper as recommended by reviewers: - Explanation of true and false positives o We added an explanation in lines 282-285, “We compared the true positives (the number of times the source of association could be correctly identified) and the false positives (the number of times the source of the association was incorrectly identified as a true association) from each GWAS model applied to all of the simulated phenotypes to get an indication of model success rate in the simulated Nama and SAC populations.” - Consider a simplified 3-way and 5-way admixture scenarios to be simulated and tested to avoid additional confounding effects on the GWAS methods tested. o Thank you for this comment. We agree with the reviewer that this would be interesting and we indeed considered including more simplified 2-way, 3-way and 5-way admixed populations. However, just with the factors we considered we already simulated 24 distinct phenotypes with three different causal sets for each and tested 5 GWAS models on each of those iterations. Thus, 360 GWAS was run in total. This was computationally intensive and we were restricted with computational resources. Thus, we focused on the most realistic scenario one would encounter if you would conduct GWAS analysis on South African populations. - The model and scripts made readily available to the readers o All scripts used in the manuscript were made publicly available via a Github link. (https://github.com/YolandiSwart/GWAS_SA) Responses to each reviewers comments: Reviewer #1: 1. I would suggest the authors include links to the model files used and the custom script used for ancestry assignment (line 161) from the simulated data should be linked to or provided as supplementary data. Thank you for this comment. All scripts are now publicly available via a github link. (https://github.com/YolandiSwart/GWAS_SA) 2. It is unclear if the authors simulated all the admixture events (arrows) represented in figure 1. Thank you for this comment. No, not all admixture events were simulated that are represented in figure 1. To clear up any confusion we added two additional sentences in lines 133-136. “The two admixture events for the Nama are indicated with the dashed lines with arrowheads (purple and light pink). Four admixture events were simulated at one time point for the SAC. These are indicated by the four solid lines with open arrowheads (green, dark pink, light pink and light brown).” 3. It is unclear to me how the genotypes/phenotypes are being linked for analysis? How were the 10 causal SNPs chosen? Thank you for this comment. The genotypes are linked with the phenotypes when conducting regression model analysis for each simulated phenotype to see if there is a significant association between the simulated phenotype and causal variant (represented as the continuous outcome (Y)). The choice of 10 causal SNPs are explained in lines 169-172. The 10 causal SNPs are randomly simulated, since one would not know before conducting a GWAS where the causal SNP would be located. “Three replications, each with a different set of 10 causal SNPs of each kind of phenotype were simulated randomly and a replication was simulated with each ancestral population (three for the Nama and five for the SAC) as the source of the local ancestry effect.” This is also confirmed in the discussion in lines 411-413, “This effect could be due to chance, since causal SNPs were drawn randomly from the dataset without any regard for the local ancestry of the SNP.” 4. The three types of phenotypes simulated need to be explained in more detail. Thank you for this comment. We have added a thorough explanation of the simulated three phenotypes in lines 167-173. Two tables (Table 1 for Nama and Table2 for the SAC) are included to showcase the breakdown of the effect sizes of the simulated phenotypes. The scripts on Github will also clarify the confusion. “Three phenotypes were simulated for the Nama (Table 1) and the SAC (Table 2) - phenotypes with an allelic only (AO) effect, phenotypes with an ancestry plus allelic (APA) effect and phenotypes with an APA and local ancestry adjusted allelic (LAAA) effect. Three replications, each with a different set of 10 causal SNPs of each kind of phenotype were simulated randomly and a replication was simulated with each ancestral population (three for the Nama and five for the SAC) as the source of the local ancestry effect. Therefore, in total 27 phenotypes were simulated for the Nama and 45 phenotypes were simulated for the SAC.” The discussion (lines 388-390) seems to suggest that a phenotype caused by a local ancestry-specific SNP was not simulated and tested with the various methods. Would distributing causal SNPs randomly across ancestries not result in biased results of all methods tested that are looking for an ancestry-specific effect? Could this explain the high number of false positives seen in the APA and LAAA models compared to the Standard and GA? Thank you for this comment. The goal of this manuscript was to test if the LAAA robustly capture the true causal variants regardless of the source of the association or the complexity of the admixture compared to other frequently used GWAS models. We intended to distribute causal SNPs across ancestries, because this is the most realistic scenario that we witness in admixed South African populations. Hence, the models used should be robust in any admixture scenario. We also included an explanation for the high number of false-positives in the discussion in lines 449-451, “Another unexpected observation in the results, is the large number of false positive hits for both the APA and LAAA models compared to the other three models (Standard, GA and LA). This could be due to the simulated local ancestry effect size being too large, causing associations of whole ancestral blocks with the trait of interest.” 5. The relationship alluded to in Lines 316-318 between the amount of ancestry and false positives should be more formalized, for example simulating datasets with 5-95% of a certain ancestry and plotting the change in false positives. This would be broadly informative beyond the study populations used here. Thank you for this comment. We agree that it would be an interesting analysis to do, but the level of heterogeneity in the simulated data already encompasses a wide range of admixture which is more representative of modern populations. This will become even more applicable in the future. 6. On Line 271, how was LAI accuracy inferred? Were inaccurate segments identified and removed or were they misassigned? Thank you for this comment. To test the LAI accuracy of a given run, the inferred ancestry was compared to the true ancestry per locus per individual. Segments were not removed as we wanted to assess the effects of misassigned tracks on the models (Figure 3-6). This is evident in the scripts added to the Github page (https://github.com/YolandiSwart/GWAS_SA/blob/main/Ancestry_inference/lai_accuracy_per_anc.R). 7. The results of this study showing that in many instances the Standard model performs as well or nearly as well as the APA or LAAA, while providing orders of magnitude fewer false positives. Identifying 2-3 fewer true hits but avoiding ~1000 false positives that would make interpreting GWAS results extremely difficult, so to support the use of only a Standard Model. This is quite counter-intuitive and requires more discussion. Thank you for this comment. In the discussion in lines 451-453 we explain that the amount of false positives seen in the APA or LAAA and not in the AO models, is most probably due to the ancestry effect size that we simulated and not due to the model itself. Reviewer #2: The author is not consistent with abbreviation and full name (under abstract GWAS is not describes) then later LAAA is used as abbreviation and as a full term. Thank for this comment. We apologise for this inconsistency. Abbreviations were corrected throughout the manuscript. The aim and the rational of this study it’s not clearly stated. Thank for this comment. The aim and rationale are now stated in line 92. “Therefore the aim of this study was to determine if the LAAA model will be able to robustly capture the true causal variants regardless of the source of the association or the complexity of the admixture. ” The Author is using a public data but it was not mentioned clearly under method section that, this is a public data and therefore there is no need for ethical clearance. Thank for this comment. All genetic data used in this manuscript was simulated or retrieved from public repositories (http://www.internationalgenome.org). All scripts used to generate and conduct data analysis are deposited and publicly available on Github: https://github.com/YolandiSwart/GWAS_SA I think there is a need for a paragraph of sample/data description. Thank for this comment. In lines 158-194 we explain the sample/data simulated for this study with two descriptive tables (Table 1 and Table 2), as well as included the scripts to generate this datasets on Github. https://github.com/YolandiSwart/GWAS_SA Figures have poor quality (they are blur). Thank for this comment. We re-uploaded figures to ensure good quality. All Figure legends have incomplete description of what is happening in a figure. Thank for this comment. We agree and expanded the descriptions of all figures throughout the manuscript. Fig 1. The demographic histories of the simulated populations (Nama and SAC). Thick solid lines with open arrowheads indicate an ancestor -> descendant relation, dashed lines indicate an admixture pulse and faint solid lines with closed arrowheads indicate continuous migration. The two admixture events for the Nama are indicated by the dashed lines with arrowheads (purple and light pink). Four admixture events were simulated at one time point for the SAC. These are indicated by the four solid lines with open arrowheads (green, dark pink, light pink and light brown). Fig 2. Overview of the methods used to simulate genotypes using the software msprime and phenotypes using the software PhenotypeSimulator. Fig 3. Comparison of true positives (left) and false positives (right) for the Nama with inferred local ancestry used in GWAS models. The average hits for three runs with different causal SNPs are shown. The simulated phenotypes are denoted as “phenotype-ancestral source of association”, e.g.,LAAA-LWK means the LAAA phenotype with the LWK ancestral component as the ancestral source of association. The various GWAS models used are indicated in different colours. Grey represents the APA model, yellow represents the GA model, blue represents the LA model, green represents the LAAA model and the orange represents the Standard model. Fig 4. Comparison of true positive hits (left) and false positive hits (right) for the Nama with true local ancestry used in GWAS models. The average hits for three runs with different causal SNPs are shown. The simulated phenotypes are denoted as “phenotype-ancestral source of association”, e.g.,LAAA-LWK means the LAAA phenotype with the LWK ancestral component as the ancestral source of association. The various GWAS models used are indicated in different colours. Grey represents the APA model, yellow represents the GA model, blue represents the LA model, green represents the LAAA model and the orange represents the Standard model. Fig 5. Comparison of true positive hits (left) and false positive hits (right) for the SAC with inferred local ancestry used in GWAS models. The average hits for three runs with different causal SNPs are shown. The simulated phenotypes are denoted as “phenotype-ancestral source of association”, e.g.,LAAA-CHB means the LAAA phenotype with the CHB ancestral component as the ancestral source of association. The various GWAS models used are indicated in different colours. Grey represents the APA model, yellow represents the GA model, blue represents the LA model, green represents the LAAA model and the orange represents the Standard model. Fig 6. Comparison of true positive hits (left) and false positive hits (right) for the SAC with inferred local ancestry used in GWAS models. The average hits for three runs with different causal SNPs are shown. The simulated phenotypes are denoted as “phenotype-ancestral source of association”, e.g.,LAAA-CHB means the LAAA phenotype with the CHB ancestral components as the ancestral source of association. The various GWAS models used are indicated in different colours. Grey represents the APA model, yellow represents the GA model, blue represents the LA model, green represents the LAAA model and the orange represents the Standard model. Reviewer #3: 1 This is an interesting and self-contained paper that will be very helpful to people analysing complex populations. The authors simulated 3-way and 5-way admixture populations that are broadly representative of two groups in South Africa and then with different phenotypes try to understand which models correctly detect SNPs associated with the phenotypes. There are several different modelling approaches and as seen in the paper how you do this has a significant timpact 2 The paper is written clearly and overall the methods are sound. 3 I have one major criticism 4 There is no definition of true positive and false positive -- see detailed comments below. Without knowing this, I am not sure I agree with the conclusions (or at least I can't independently say I do). If their definition of a false positive would include a SNP that is _not_ a SNP in the "input" of the simulation as being causal but is for example 20 base pairs away and has been detected because of LD effects then I think that needs considerable motivation and explanation to justify calling that a false positive. Anyway -- please be clear and define this. Thank you for this comment. We added defined true and false positives in lines 282-285, “We compared the true positives (the number of times the source of association could be correctly identified) and the false positives (the number of times the source of the association was incorrectly identified as a true association) from each GWAS model applied to all of the simulated phenotypes to get an indication of model success rate in the simulated Nama and SAC populations.” Indeed, we do not consider a SNP in LD with a true causative variant as a false positive, as pointed out in the discussion (line 399):, “Our simulations would have to be adjusted to ensure that all of the causal SNPs are found in or outside of the contributing ancestry’s ancestral LD blocks and many more iterations would have to be done to test this.” This is also further discussed in lines 438-442, “We assumed independence of ancestry-specific effects across chromosomes. Therefore, the effect of other ancestral LD blocks (besides the ancestral LD blocks in which the causal SNPs are found) on a given phenotype was not simulated. This is important to note when considering our results in terms of real data, since this omission can have an impact on elucidating the true causal SNPs.” 5. The number of SNPs used is only 400k -- which is small -- does this impact the result? If we were using a dense array and/or imputed data would this help or make things more complicated Thank you for this comment. Yes, increasing the number of SNPs would indeed make things more complicated and required more computational resources. We have previously investigated imputation accuracy and showed we that imputation in southern African populations are feasible and produces quality data (Schurz et al. 2019, doi: 10.3389/fgene.2019.00034). Due to the complexity and computational expense of the work presented in this manuscript, we chose not to use imputed data to introduce more statistical bias. Increasing the number of SNPs in our simulations would not have improved the accuracy, as we knew where the causal SNPs were. 6. The limitations expressed on page 15 are significant There are different results for 3-way and 5-way admixture (and as the authors point out this differs from results by others for 2-way admixture). But is that a function of the n in n-way or is it more complex than that? I suppose I want a more definitive answer, which is not possible with the evidence givern. Thank you for this comment. Each admixture scenario will be different due to historical events. Most African populations had integration from various continental ancestral populations and the exact and location of ancestry proportions are mostly unknown (genetic heterogeneity). Hence this is more complex than the n in n-way, since ancestry patterns may be different between admixed populations formed by the same source populations. 7. As a point of principle I think the code that was developed should be made available via GitHub or the like. Thank you for this comment. We agree with this statement and all scripts are now publicly available via a github link. https://github.com/YolandiSwart/GWAS_SA 8. Bibliography -- a very annyoing feature of the paper is that references are numbered in the text but there is no number in the bibliography!!!! Thank you for comment. We apologise and corrected the numbers in the bibliography. 9 Other issues: l97-98: There is no reference for the claim that two-way admixed populations "generally" have simple demographic histories and originate from a single pulse admixture event. I suppose this depends what is meant by "simple" and "pulse", but I think either give some reference that supports this or weaken the claim. Thank you for this comment. We have removed this statement. l117 : "From" -> from – This was corrected Comma after "Khoe-San" on l128 should deleted -- the subject of the sentence is complex and hence the temptation but the comma is wrong and it would be better to simplify the txt. Thank you for this comment. This sentence was restructured in lines 132-133 for easier reading. “Uren et al. (13) found that the SAC has Khoe-San (32-43%), Bantu-speaking African (20-36%) , European (21-28%) and East and South-East Asian (9-11%) ancestral contributions.” l122 : See point above about availability -- shouldn't these specifications be made available publicly. Just from a selfish point of view you are more likely to get citations if you do and people use them. Thank you for this comment. All scripts are now publicly available via a github link. https://github.com/YolandiSwart/GWAS_SA l 126-130 : I think the characterisation of Khoe-San as a group culturally related needs care as this is contested esp by members of the communities. This is not my area, but the term Khoe-san language is a term that is more convenient for people outside those communities rather than accurate (perhaps like speaking about the Danish-Japanese language gruop?) and I don't think you can say that pastoralism and hunter-gatherer lifestyles are close culturally. As a relatively recent phenomenon as a result of colonialism since the the late 18th C this may have become true in recent historical times but to extrapolate from one place may be risk over-generalising. Thank you for this comment. The term Khoe-San encompasses both Khoekhoe (pastoralist) and San (hunter-gatherer) populations. Khoe-San refers to the two population groups collectively, rather than referring to a language group. During the course of our research on these communities, we have seen that this term has largely been accepted by these communities but each have their own identity. We however agree that our previous statement was confusing and have rephrased the sentence for clarity in lines 110-114. “The demographic history of the Nama, an indigenous Khoe-San population of southern Africa (13), is complex and extends back to the emergence of modern humans. Similar to the SAC, the Nama also received ancestral contributions from a number of non-African populations. “ l142: wouldn't South Asian be better than East Asian if you are using GIH as the proxy. Thank you for this comment. We used GIH as proxy for South Asian in line 141. l143: This modelling of SAC is as the authors recognise a simplification -- white settlement in the Cape only started ~370 years ago with ongoing admixture -- and probably at peak from 300-150 years ago. This is not intended as a criticism and I don't think this result weakens the results but I think the authors should address the consequences of this simplification. Thank you for this comment. We agree with this statement and added an additional discussion in lines 455-459. “The simplification of the SAC as multiple admixture events simultaneously could have implications for the admixture-induced LD blocks, since colonisation and admixture that occurred in the Cape was complex, at different time points and still ongoing. Hence, the simulations should be expanded to multiple admixture events at different time points in the future.” l167 Simulating phenotypes: A little more detail would be useful here -- I'd like to understand how this takes LD and both global and local ancestry into account. I didn't understand how the SNPs were chosen -- do you just choose one lead SNP in a region or do you simulate SNPs that though not causal are in LD with the "causal" SNP. This para ia bit opaque. Is different LD in different popualtions considered. Thank you for this comment. LD was not simulated in the different populations. Line 288 specifies this: “positives and false positives (excluding sites in LD with causal SNPs) were averaged to”. The choice of 10 causal SNPs are explained in lines 167-173. The 10 causal SNPs are randomly simulated, since one would not know before conducting a GWAS where the causal SNP would be located. We have added a thorough explanation of the simulated three phenotypes in lines 167-173. Two tables (Table 1 for Nama and Table2 for the SAC) are included to showcase the breakdown of the effect sizes of the simulated phenotypes. The scripts on Github will also clarify the confusion. “Three phenotypes were simulated for the Nama (Table 1) and the SAC (Table 2) - phenotypes with an allelic only (AO) effect, phenotypes with an ancestry plus allelic (APA) effect and phenotypes with an APA and local ancestry adjusted allelic (LAAA) effect. Three replications, each with a different set of 10 causal SNPs of each kind of phenotype were simulated randomly and a replication was simulated with each ancestral population (three for the Nama and five for the SAC) as the source of the local ancestry effect. Therefore, in total 27 phenotypes were simulated for the Nama and 45 phenotypes were simulated for the SAC.” Both global and local ancestry are only inferred after simulation of genotypes. Although the ancestral populations contributing to the admixed population is simulated as stipulated in lines 138-151. “The demographic model for the Nama (Fig 1) was based on the above mentioned events and a conference poster by Ragsdale et al. [21] where they inferred detailed parameterized demographic models for five present day populations (GBR, Gumuz, LWK, MSL and Nama) using joint allele frequency and LD statistics. Khoe-San groups also contributed a significant ancestral component to many South African populations, including a highly admixed group from multiple ancestral populations (SAC) [22][23]. Many non-African and African populations moved into southern Africa over the last ~600 years and integrated with Khoe-San groups [22]. Uren et al. [13] found that the SAC has Khoe-San (32-43%), Bantu-speaking African (20-36%) , European (21-28%) and East and South Asian (9-11%) ancestral contributions. The demographic model for the SAC (Fig 1) was based on the above ancestral proportions with a single admixture event, for simplicity, between the ancestral populations ~400 years ago. We made the following corrections to the manuscript and apologise for these errors: l178 -- use dash not hyphen (l 201, 227, 238 and many other palces too) All dashes were changed to hyphens. l 288-307 -- I don't think true positives etc should be hyphenated We removed the hyphen in all instances l299 -- semi-colon not comma at end We have corrected this. l378 -- principle -> principal Corrected l383 -- amount -> number Corrected l436 -- no comma -- the subject is complex but it just makes it worse to put a comma in. Similar issue on line 456 Corrected l 462. This sentence needs a main verb. A main verb was added l395 -- I am intrigued by this. When we do a GWAS we often get hits caused by SNPs being in LD wth the causal (or at least a lead SNP). We don't see these as false hits though because we understand that they tag the causal SNP. Of course this depends on how bug the admixture block is -- I wonder whether your framework of true positives/false positives is correct, and I think this needs clarificaition at least. If you define a false positive as a SNP that is found in the simulated data output then I think it is too strong -- I think this related to my comment on line 167 above -- see my general comment. Thank you for this comment, yes we do not see these as false positives. We stated that we excluded sites in LD with causal SNPs in line 283. “positive and false positive hits (excluding sites in LD with causal SNPs) were averaged to”. We also expanded on the definition of false positive hits and true positive hits in our study context in lines 282-284. “We compared the true positive (the number of times when the source of association could be correctly identified) and the false positive hits (the number of times when the source of the association was incorrectly identified as a true association) from each GWAS model applied” This is also further discussed in lines 438-442, “We assumed independence of ancestry-specific effects across chromosomes. Therefore, the effect of other ancestral LD blocks (besides the ancestral LD blocks in which the causal SNPs are found) on a given phenotype was not simulated. This is important to note when considering our results in terms of real data, since this omission can have an impact on elucidating the true causal SNPs.” l448 I don't think only more complex admixture scenarios. Also just understanding the variability you get that may depend on changes in admixture ratios, LD-structure, population difference and so on. Thank you for this comment. We agree with this statement. We removed this statement in lines 455-458. We thank all reviewers for this opportunity to revise and improve our manuscript. Submitted filename: Response_to_reviewers (1).docx Click here for additional data file. 8 Aug 2022 GWAS in the Southern African context PONE-D-22-03783R1 Dear Dr. Swart, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Badri Padhukasahasram Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 6 Sep 2022 PONE-D-22-03783R1 GWAS in the Southern African context Dear Dr. Swart: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Badri Padhukasahasram %CORR_ED_EDITOR_ROLE% PLOS ONE

35 in total

1. Fast and accurate inference of local ancestry in Latino populations.

Authors: Yael Baran; Bogdan Pasaniuc; Sriram Sankararaman; Dara G Torgerson; Christopher Gignoux; Celeste Eng; William Rodriguez-Cintron; Rocio Chapela; Jean G Ford; Pedro C Avila; Jose Rodriguez-Santana; Esteban Gonzàlez Burchard; Eran Halperin
Journal: Bioinformatics Date: 2012-04-11 Impact factor: 6.937

2. Y-chromosomal evidence of a pastoralist migration through Tanzania to southern Africa.

Authors: Brenna M Henn; Christopher Gignoux; Alice A Lin; Peter J Oefner; Peidong Shen; Rosaria Scozzari; Fulvio Cruciani; Sarah A Tishkoff; Joanna L Mountain; Peter A Underhill
Journal: Proc Natl Acad Sci U S A Date: 2008-08-04 Impact factor: 11.205

3. Genome-wide association study of ancestry-specific TB risk in the South African Coloured population.

Authors: Emile R Chimusa; Noah Zaitlen; Michelle Daya; Marlo Möller; Paul D van Helden; Nicola J Mulder; Alkes L Price; Eileen G Hoal
Journal: Hum Mol Genet Date: 2013-09-20 Impact factor: 6.150

4. Ancestry estimation and control of population stratification for sequence-based association studies.

Authors: Chaolong Wang; Xiaowei Zhan; Jennifer Bragg-Gresham; Hyun Min Kang; Dwight Stambolian; Emily Y Chew; Kari E Branham; John Heckenlively; Robert Fulton; Richard K Wilson; Elaine R Mardis; Xihong Lin; Anand Swaroop; Sebastian Zöllner; Gonçalo R Abecasis
Journal: Nat Genet Date: 2014-03-16 Impact factor: 38.330

5. A robust and powerful two-step testing procedure for local ancestry adjusted allelic association analysis in admixed populations.

Authors: Qing Duan; Zheng Xu; Laura M Raffield; Suhua Chang; Di Wu; Ethan M Lange; Alex P Reiner; Yun Li
Journal: Genet Epidemiol Date: 2017-12-10 Impact factor: 2.135

6. Genomic variation in seven Khoe-San groups reveals adaptation and complex African history.

Authors: Carina M Schlebusch; Pontus Skoglund; Per Sjödin; Lucie M Gattepaille; Dena Hernandez; Flora Jay; Sen Li; Michael De Jongh; Andrew Singleton; Michael G B Blum; Himla Soodyall; Mattias Jakobsson
Journal: Science Date: 2012-09-20 Impact factor: 47.728

Review 7. 10 Years of GWAS Discovery: Biology, Function, and Translation.

Authors: Peter M Visscher; Naomi R Wray; Qian Zhang; Pamela Sklar; Mark I McCarthy; Matthew A Brown; Jian Yang
Journal: Am J Hum Genet Date: 2017-07-06 Impact factor: 11.025

8. Comparing local ancestry inference models in populations of two- and three-way admixture.

Authors: Ryan Schubert; Angela Andaleon; Heather E Wheeler
Journal: PeerJ Date: 2020-10-02 Impact factor: 2.984

9. Genetic substructure and complex demographic history of South African Bantu speakers.

Authors: Dhriti Sengupta; Ananyo Choudhury; Cesar Fortes-Lima; Shaun Aron; Gavin Whitelaw; Koen Bostoen; Hilde Gunnink; Natalia Chousou-Polydouri; Peter Delius; Stephen Tollman; F Xavier Gómez-Olivé; Shane Norris; Felistas Mashinya; Marianne Alberts; Scott Hazelhurst; Carina M Schlebusch; Michèle Ramsay
Journal: Nat Commun Date: 2021-04-07 Impact factor: 14.919

10. Refining the Y chromosome phylogeny with southern African sequences.

Authors: Chiara Barbieri; Alexander Hübner; Enrico Macholdt; Shengyu Ni; Sebastian Lippold; Roland Schröder; Sununguko Wata Mpoloka; Josephine Purps; Lutz Roewer; Mark Stoneking; Brigitte Pakendorf
Journal: Hum Genet Date: 2016-04-04 Impact factor: 4.132