Literature DB >> 35025869

Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations.

Zhi Ming Xu^1,2, Sina Rüeger^1,2, Michaela Zwyer^3,4, Daniela Brites^3,4, Hellen Hiza^3,4,5, Miriam Reinhard^3,4, Liliana Rutaihwa^3,4, Sonia Borrell^3,4, Faima Isihaka⁵, Hosiana Temba⁵, Thomas Maroa⁵, Rastard Naftari⁵, Jerry Hella⁵, Mohamed Sasamalo⁵, Klaus Reither^3,4, Damien Portevin^3,4, Sebastien Gagneux^3,4, Jacques Fellay^1,2,6.

Abstract

Genome-wide association studies rely on the statistical inference of untyped variants, called imputation, to increase the coverage of genotyping arrays. However, the results are often suboptimal in populations underrepresented in existing reference panels and array designs, since the selected single nucleotide polymorphisms (SNPs) may fail to capture population-specific haplotype structures, hence the full extent of common genetic variation. Here, we propose to sequence the full genomes of a small subset of an underrepresented study cohort to inform the selection of population-specific add-on tag SNPs and to generate an internal population-specific imputation reference panel, such that the remaining array-genotyped cohort could be more accurately imputed. Using a Tanzania-based cohort as a proof-of-concept, we demonstrate the validity of our approach by showing improvements in imputation accuracy after the addition of our designed add-on tags to the base H3Africa array.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35025869 PMCID： PMC8791479 DOI： 10.1371/journal.pcbi.1009628

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

By mapping the associations between single-nucleotide polymorphisms (SNPs) and various phenotypes, genome-wide association studies (GWAS) have allowed us to gain unprecedented knowledge on the genetic basis of various human diseases and traits. An important prerequisite to conducting GWAS is the availability of a cost-effective yet accurate high-throughput genotyping method. Genotyping arrays have been used widely over the past 15 years, including in many studies facilitated by biobank resources such as the UK Biobank [1]. However, genotyping arrays rely on the imputation of a sparse set of tag SNPs (e.g. millions of SNPs) to achieve acceptable density genome-wide (e.g. tens of millions of variants). The quality of imputation is dependent on the suitability of the tag SNPs and the similarity of haplotype structure between the reference panel and the study population [2-5]. For study populations where a genetically similar reference panel or population-specific array content may not be available, whole-genome sequencing (WGS) offers an alternative to genotyping arrays. Previous studies have suggested that WGS may offer substantial gains in such a scenario, potentially pinpointing loci absent in GWAS conducted using genotyping arrays [6, 7]. However, due to the large sample sizes often required to gain sufficient statistical power in GWAS, the cost of WGS can still be prohibitive despite its recent decrease [8]. An alternative to WGS is the development of population-specific reference panels and genotyping arrays. For example, African-specific reference panels and genotyping arrays have been developed in recent years in an attempt to rectify the underrepresentation of African populations in genetic studies [9-11]. Notably, the Human Heredity and Health in Africa (H3Africa) consortium has developed the H3Africa genotyping array, which contains approximately 2.2 million tags to capture genetic variability observed in various African populations [12]. Furthermore, the African Genome Resources (AFGR) reference panel has been designed to capture the haplotype structure of various African populations to improve imputation accuracy. However, given that only a subset of African populations are represented in current publicly available reference panels, there remains African populations for which imputation is suboptimal. This is exacerbated by the fact that the level of genetic diversity is much higher among African populations compared to non-African populations, driven by the long evolutionary history and lack of bottlenecks [13, 14]. Thus, to achieve similar imputation accuracy across all African populations as to non-African populations (e.g. European or Asian), larger and more diverse imputation reference panels are needed to capture the full extent of variation [15]. For the remaining underrepresented populations, we propose the use of a combination of add-on tags and an internal population-specific reference panel as a cost-effective approach to improve genotype imputation. For a GWAS cohort, we propose to perform WGS in a small subset (e.g. 10% of the entire cohort) for two purposes. Firstly, array manufacturers often allow researchers to add customised content to existing array designs. For example, Illumina offers the flexibility to add 5000 or 20,000 probes for the commercially available H3Africa array. In our proposed approach, the WGS data would be leveraged to determine population-specific linkage disequilibrium (LD) structures, and thus enable the selection of add-on tags to improve genotype imputation. Secondly, the strategy to supplement external reference panels with WGS samples from an internal study cohort has been employed by previous studies [16, 17]. Specifically, it has been shown that the addition of even a relatively small number of samples from the internal cohort leads to improved imputation accuracy, especially if the study population is genetically dissimilar from the populations captured by existing reference panels [18, 19]. In our proposed approach, the WGS data would be used to construct an internal population-specific reference panel that supplements existing publicly available reference panels. This would function jointly with the selected add-on tags to further improve genotype imputation. Given that the population that an existing genotyping array is designed for (array population) is different from the population one would like to genotype (study population), we envision that the add-on tags selected by our proposed approach would improve imputation accuracy under the following scenarios. Firstly, the minor allele of a target variant could be rare in the array population but common in the study population, and thus not captured by the existing array content. The WGS data of the study population could be leverage to select an add-on tag SNP that is in strong linkage disequilibrium (LD) with the target variant (S1(A) Fig). The internal reference panel could then be leverage to impute the target variant. Secondly, the minor allele of an existing tag could be common in the array population but rare in the study population. In such a scenario, the existing tag would be insufficient to tag the target variant, and an additional add-on tag specific to the study population would be required. (S1(B) Fig). Finally, the haplotype structure may be different between the array population and the study population. An existing tag may be on the same haplotype block as the target variant in the array population, but a different haplotype block in the study population. Thus, an add-on tag on the same haplotype block as the target variant in the study population could be added. The internal reference panel could then be leveraged to impute the target variant. (S1(C) Fig). As a proof-of-concept example, we utilize 116 high coverage WGS samples from participants of the TB-DAR cohort (Tuberculosis patients recruited in a hospital in Dar es Salaam, Tanzania). Since the Tanzanian population has not been incorporated in existing reference panels and array designs, including the AFGR reference panel and the H3Africa genotyping array, this cohort provides an ideal basis to evaluate our approach. We first illustrate the necessity for add-on tags by calculating the genetic differentiation between our Tanzanian cohort and populations represented in the AFGR reference panel. Using the AFGR reference panel, we show that fewer sites could be successfully imputed for Tanzanian individuals compared to individuals from Sub-Saharan African populations that are represented in the reference panel. We proceed to leverage the WGS data to construct an internal reference panel that captures Tanzanian specific haplotype structures, and to select add-on tags that target common variants in the Tanzanian population that are poorly imputed under the base H3Africa array. We then confirm the validity of our approach by evaluating the improvement in imputation accuracy enabled by the addition of add-on tags, and compare our approach to add-on tags selected by random and by the Tagger software. We use both the internal Tanzanian reference panel and the external H3Africa reference panel for imputation, where for each site the genotype call is derived based on the reference panel with a higher predicted imputation accuracy for the site. Finally, we present an alternative selection scheme for mitochondrial and Y chromosome variants. We show that mitochondrial haplogroup calling can be improved, while the coverage of the Y chromosome on the H3Africa is mostly sufficient for accurate haplogroup calling. A summary of our proposed approach can be found in Fig 1.

Fig 1

Schematic of our add-on tag SNP selection procedures, with steps illustrating.

Step 1) Constructing a Tanzanian reference panel. Identifying candidate target variants, which are derived from poorly imputed variants when the H3Africa array is imputed based on the Tanzanian and AFGR reference panel. Step 2) Selecting add-on tags that optimally tag candidate target variants based on population-specific LD structures, allele frequencies, and probe qualities. Step 3) Evaluating improvements in imputation performance after adding add-on tags to the base H3Africa array. Calculating imputation quality metrics, including INFO score and r2 (correlation between imputed and sequencing-based genotypes). WGS, Whole-Genome Sequencing; AFGR, African Genome Resource; MAF, Minor Allele Frequency; MI, Mutual Information; LD, Linkage Disequilibrium.

Schematic of our add-on tag SNP selection procedures, with steps illustrating.

Results

Performance of external reference panels

To choose the most suitable external reference panel for the TB-DAR cohort, we masked the TB-DAR WGS data to include only sites genotyped under the H3Africa array. We then imputed the masked data and compared the imputation performance of three publicly available reference panels (as of September 2019) that include individuals of African ancestry: 1) The AFGR (African Genome Resource) reference panel hosted on the Sanger Imputation Server, which consists of 4956 individuals in total with ∼2000 individuals from Uganda, ∼100 individuals from other African populations, and the remaining from the African and Non-African populations of the 1000 Genomes project. 2) the CAAPA (Consortium on Asthma among African-ancestry Populations in the Americas) reference panel [20] hosted on the Michigan Imputation Server, which consists of 883 African American individuals, 3) the HRC (Haplotype Reference Consortium) reference panel [21] hosted on the Sanger Imputation Server, which consists of 32,470 individuals in total originating from the African and Non-African populations of the 1000 Genomes Project and other sources with predominantly individuals of European ancestry. Table 1 illustrates the imputation performance of the three reference panels, measured by mean imputation quality (INFO Score or r2), mean correlation with ground truth (r2), and the fraction of variants observed in the WGS data that were successfully imputed (Imputation Quality > 0.8). We observed that the AFGR reference panel outperformed both the CAAPA and HRC reference panel across MAF bins in all three metrics, and thus chose to use AFGR as the external reference panel in this study.

Table 1

Imputation performance of publicly available reference panels when applied to the TB-DAR data based on the H3Africa array content.

Minor allele frequency (MAF) is based on the frequency observed in the TB-DAR cohort. Imputation quality (Subcolumn 1) is measured by either INFO score (AFGR and HRC; Sanger Imputation Server) or r2 (CAAPA; Michigan Imputation Server). Correlation with ground truth (Subcolumn 2) measures the correlation between the imputed dosage and the ground truth WGS dosage using the squared pearson correlation coefficient (r2). Percent of variants imputed (Subcolumn 3) represents the fraction of variants observed in the TB-DAR WGS data that were successfully imputed (Imputation Quality > 0.8).

		AFGR			CAAPA			HRC
		Quality (INFO)	Ground Truth r²	% Variants Imputed	Quality (r²)	Ground Truth r²	% Variants Imputed	Quality (INFO)	Ground Truth r²	% Variants Imputed
MAF
	0.01–0.05	0.95	0.91	88.6	0.85	0.83	67.5	0.88	0.80	68.3
	0.05–0.1	0.98	0.96	93.4	0.93	0.91	86.8	0.96	0.90	91.9
	0.1–0.5	0.99	0.97	92.7	0.96	0.95	90.1	0.98	0.95	91.7

Imputation performance of publicly available reference panels when applied to the TB-DAR data based on the H3Africa array content.

Differentiation between the Tanzanian population and other African populations

Study participants of the TB-DAR WGS cohort originated from various ethnic groups within Tanzania (S1 Table). A majority of participants belonged to the Bantu-speaking ethnic groups (N = 108, 93.1%), with a small minority that belonged to the Nilotic (N = 1, 0.8%) and Cushitic (N = 3, 2.6%) speaking ethnic groups. Self-reported ethnic information was not available for four participants. To measure the population differentiation between the TB-DAR WGS cohort and populations represented in the AFGR reference panel, for each pair of populations we calculated the genome-wide pairwise fixation index (F). The F estimate is a metric that ranges from 0 to 1 that quantifies the degree of genetic differentiation between populations, with 0 indicating no differentiation and 1 indicating complete differentiation. Fig 2A illustrates the pairwise F measures between the TB-DAR cohort and Sub-Saharan African populations of the 1000 Genomes (1KG) project, along with the sampling locations of populations represented in the AFGR reference panel. F measures were only calculated for Sub-Saharan African 1KG populations that are also part of the AFGR reference panel, since WGS data for populations based in Uganda, Ethiopia, Namibia, and South Africa (Zulu) were not publicly available. In general, genetic differentiation was greater between populations that are further away geographically. For example, TB-DAR displayed the least differentiation with the Bantu-speaking Luhya population (LWK) in neighbouring Kenya, but the most differentiation with West African populations such as the Gambian in the Western Division of Gambia (GWD) and the Mende in Sierra Leone (MSL). A similar pattern was observed among Sub-Saharan African populations of 1KG (Fig 2B). The least differentiation was observed between population pairs in the same geographic region (e.g. YRI and ESN: F = 0.0008). The most differentiation was observed between East and West African populations (e.g. LWK and GWD: F = 0.011). However, since many diverse African populations were not included in the analysis, this is an underestimate of the full extent of genetic diversity within Africa. More comprehensive analyses have estimated a mean F of 0.027 between East and West African populations [22]. In addition, the genetic principal components (PCs) shown in S2 Fig also illustrate a similar pattern, where distances in PC space approximately scaled with geographic distances between the sampling locations of populations.

Fig 2

Genetic differentiation of African populations.

Genetic differentiation of African populations.

A) Sampling locations of the TB-DAR WGS cohort and populations within the AFGR reference panel, which includes the Sub-Saharan African populations of the 1000 Genomes (1KG) project. Line colors illustrate the degree of differentiation (F) between TB-DAR and 1KG populations. B) Pairwise F measures between 1KG populations and TB-DAR. 1000 Genomes Populations: GWD—Gambian in Western Divisions in the Gambia; MSL—Mende in Sierra Leone; YRI—Yoruba in Ibadan, Nigeria; ESN—Esan in Nigeria; LWK—Luhya in Webuye, Kenya. The map was created programmatically in R using the spData package [58], with the base layer based on public domain maps from Natural Earth (https://www.naturalearthdata.com/). Given the observed genetic differentiation between the TB-DAR and population represented in the AFGR reference panel, we next evaluated the baseline imputation performance of the TB-DAR cohort compared to Sub-Saharan African 1KG populations that are represented in the reference panel. S3 Fig illustrates that compared to all 1KG populations, lower number of variant sites were successfully imputed (INFO > 0.8) in the TB-DAR cohort across all minor allele frequency (MAF) thresholds. For example, for autosomal variants with a MAF of approximately 0.05, 95.8% of variants were successfully imputed in the TB-DAR cohort compared to a mean of 97.7% across the Sub-Saharan African 1KG populations. The difference was more pronounced for the X chromosome, with 89.4% of variants successfully imputed in the TB-DAR cohort compared to a mean of 94.3% across the Sub-Saharan African 1KG populations. These results quantify the genetic diversity of populations within Africa, and illustrate the differentiation of the TB-DAR cohort from Sub-Saharan African populations of the 1KG project that are represented in the AFGR reference panel. Driven by such differentiation, imputation performance was lower in the TB-DAR cohort when the AFGR reference panel was used. Thus, the need to supplement external reference panels with Tanzanian specific haplotypes and to design population-specific add-ons for the TB-DAR cohort is warranted.

Selection of add-on tag SNPs and improvements in imputation accuracy

The selection of add-on tag SNPs was conducted under two different settings. Under a coverage-guaranteeing setting (Setting 1), we selected 1869 add-on tags within 337 prioritized TB-associated regions. In addition, under an efficiency-driven setting (Setting 2), we selected 2503 further add-on tags across the rest of the genome. S4 Fig shows the distribution of all selected tags across chromosomes. To confirm the validity of our approach, we used the TB-DAR WGS testing set to compare the imputation accuracy of target variants based on three array designs: 1) As a baseline, the H3Africa array without any add-on tags 2) The H3Africa array with random add-on tags 3) The H3Africa array with population-specific add-on tags selected based on the proposed approach. Fig 3 shows the mean imputation quality of target variants that our add-on tags were designed to tag across different MAF percentile bins. Under both settings, with the incorporation of add-on tags we observed strong overall improvement in imputation accuracy compared to both the baseline H3Africa array and the H3Africa array with random add-on tags, reflected by the increase in mean INFO score and r2 (correlation with WGS ground truth) across all MAF bins. While the magnitude of increase in mean imputation accuracy was similar for both settings, in general, target variants in prioritized regions were better imputed. This was as intended since, under Setting 1, even relatively well-imputed variants within each region would be tagged by add-on tags to guarantee coverage.

Fig 3

Improvement in imputation performance subsequent to the addition of add-on tags.

Improvement in imputation performance subsequent to the addition of add-on tags.

Mean INFO score and r2 (between imputed and sequenced ground truth) of target variants designed to be tagged by add-on tags based on three array designs: 1) H3Africa array without any add-on tags 2) The H3Africa array with random add-on tags 3) The H3Africa array with population-specific add-on tags selected based on the proposed approach. Facet grids illustrate results based on two tag SNP selection settings: coverage-guaranteeing within prioritized regions (Setting 1) and efficiency-driven in all other regions (Setting 2). Error bars represent standard error (SE) of the mean imputation quality within each MAF bin. An example region where our approach functioned as expected is shown in Fig 4. Our designed add-on tags lead to improved imputation of target variants compared to both the baseline H3Africa array and the H3Africa array with add-on tags selected by random, reflected by increases in both INFO score and r2. Noticeably, add-on tags were mainly located in proximity to the previously poorly imputed target variants (left side of the region). This indicates, as designed, that only add-on tags that are in relatively strong LD with poorly imputed target variants were selected, as LD generally scales inversely with distance. With the addition of random tags that are approximately uniformly distributed across the entire region, the overall gain in imputation performance was less.

Fig 4

Improvement in imputation performance in an example region.

Improvement in imputation performance in an example region.

Example region on chromosome 10 where the incorporation of add-on tags lead to the increase in imputation performance. Facet grids illustrate imputation performance of the H3Africa array without any add-on tags, with random add-on tags, and with add-on tags selected under the proposed approach. Color of dots represent type of variant (existing H3Africa tags, add-on tags, or any other imputed variants. Next, we compared the efficiency of add-on tags selected by the proposed approach to those selected by random and to those selected by the existing Tagger software [23]. We evaluated the number of successfully tagged variants, measured by the number of imputed variants with imputation improvements subsequent to the addition of add-on tags. Our proposed approach explicitly targets variants poorly imputed by either the AFGR reference panel or the internal population specific reference panel. Conversely, Tagger does not rely directly on imputation accuracy to select target variants. Rather, the software infers un-tagged target variants based on existing tags, and subsequently select tags that most efficiently capture these previously un-tagged variants. Table 2 summarises the performance of our proposed approach and Tagger, compared to random selection. Compared to random selection, performance of add-on tags selected by the proposed approach was higher under both Setting 1 and Setting 2. However, compared to Tagger, performance of add-on tags selected by the proposed approach was similar under both Setting 1 and Setting 2. Under Setting 1 (Coverage Guaranteeing), each add-on tag successfully tagged 22.5 variants, while each random and Tagger tag tagged 18.7 and 21.4 variants respectively. Under an INFO score threshold of 0.8 (commonly used in GWAS), this translates to an additional 3196 and 1290 imputed variants being incorporated under our proposed approach compared to random selection and Tagger respectively. Under Setting 2 (Efficiency Driven), each add-on tag successfully tagged 78.3 variants, while each random and Tagger tag tagged 65.8 and 78.1 variants respectively. Under an INFO score threshold of 0.8, this translates to an additional 8260 and 1852 imputed variants being incorporated under the proposed approach compared to random selection and Tagger respectively. For all methods, the number of successfully tagged variants per tag was higher in Setting 2 compared to Setting 1. This is expected since under Setting 1, short haplotypes are tagged to guarantee coverage, thus resulting in the reduced efficiency of each add-on tag.

Table 2

Performance of add-on tags, categorized based on settings and methods.

	Method	Number of Probes	Mean Probe-ability Score (± SE)	Targets with Improvment in INFO Score				Additional Targets Exceeding INFO Score of 0.8
	Method	Number of Probes	Mean Probe-ability Score (± SE)	Per Probe (± SE)	Per Tag (± SE)	%AFGR	%Tanz	Per Probe (± SE)	Per Tag (± SE)	%AFGR	%Tanz
Setting 1 (1869 tags)
	Proposed Approach	2114	0.71±0.006	19.9±3.2	22.5±3.2	33.1	66.9	3.5±0.5	4.0±0.5	65.8	34.2
	Tagger	2186	0.75±0.006	18.3±2.9	21.4 ±3.0	31.8	68.2	2.8±0.3	3.3±0.4	62.8	37.2
	Random Tags	NA	NA	NA	18.7±2.4	26.8	73.2	NA	2.3±0.3	64.2	35.8
Setting 2 (2503 tags)
	Proposed Approach	2688	0.87±0.004	72.9±2.7	78.3±2.7	28.4	71.6	9.2±0.6	9.9±0.6	70.7	29.3
	Tagger	2905	0.73±0.005	67.3±2.5	78.1±2.7	27.6	72.4	7.9±0.5	9.2±0.5	67.1	32.9
	Random Tags	NA	NA	NA	65.8±2.5	26.9	73.1	NA	6.6±0.5	75.4	24.6

Performance of add-on tags, categorized based on settings and methods.

Number of probes (Column 2) indicates the total number of Illumina probes that are required to genotype the add-on tags. The mean probe-ability score (Column 3) estimates the genotyping success rate for the selected add-on probes. The number of successfully tagged imputed variants are measured by either any improvement in INFO score (Column 4), or those exceeding INFO score of 0.8 when previously below (Column 5). Per probe and per tag indicate the number of imputed variants with imputation improvements per add-on tag and add-on probe respectively. Standard error (SE) represents variability of the per tag and per probe metric across different genomic regions. %AFGR and %Tanz indicate the proportion of imputed variants with better imputation accuracy based on the AFGR or internal Tanzanian reference panel respectively. Another difference between our proposed approach and Tagger is that Tagger does not incorporate information with regards to the number of Illumina probes required for each tag. This is specific to the Illumina platform, where two probes are required to determine the relative intensity of complementary alleles at a locus (A/T or G/C SNPs) while a single probe is sufficient for non-complementary alleles (e.g. A to C or G). Thus, add-on content selected by Tagger can potentially result in a more costly array composition when two add-on tags that have similar tagging efficiency require a different number of probes. This was especially pronounced under Setting 2, where more probes were selected by Tagger compared to the proposed approach with no significant overall gain in the number of successfully tagged variants. Table 2 shows that comparing the proposed method with Tagger, the proposed method had similar number of successfully tagged variants per tag (78.3 and 78.1 respectively), but higher number of successfully tagged variants per Illumina probe (72.9 and 67.3 respectively).

Mitochondria and Y chromosome haplogroups

Since mitochondrial and Y chromosome haplogroups provide an efficient manner to track human evolutionary history, we targeted haplogroup markers to improve the accuracy of haplogroup calling. The distribution of mitochondrial and Y chromosome haplogroups within the TB-DAR WGS cohort are shown in S5A and S5B Fig respectively. With regards to the mitochondrial DNA, most individuals belonged to the L haplogroup. This was consistent with findings based on the 1000 Genomes project [24], where the L haplogroups were found to be the dominant haplogroups in African populations. For the Y chromosome, a majority of male individuals belonged to the E haplogroups, with a small minority belonging to the B, R, and others. This was also consistent with the 1000 Genomes project [25], where the E haplogroups were found to be dominant in African populations. Also in the Luhya population in neighbouring Kenya a small minority belonged to the B haplogroup [25]. To ensure that our add-on content includes haplogroup markers that complement the existing content on the H3Africa array, we selected 103 and 31 haplogroup marker SNPs as add-ons for the mitochondria and Y chromosome respectively. For the mitochondria, we saw an average improvement in haplogroup calling of 22% compared to the H3Africa array. For the Y chromosome, due to the limited number of add-on tagsand sufficient coverage by the H3Africa array, we did not observe any significant differences in haplogroup calling.

Discussion

The use of internal population-specific reference panels that supplements external reference panels have been shown to improve genotype imputation, especially in populations that are underrepresented in existing reference panels [18, 19]. Using a Tanzanian cohort as proof-of-concept, our work confirms the utility of internal reference panels. Additionally, we showed that the use of add-on tags jointly with an internal reference panel could further improve the imputation accuracy of common variants. With regards to application in GWAS, the cost-effectiveness of our approach lies in the balance between sensitivity and power. We envision that our method would be more cost-effective for larger cohorts, as only a small subset of the cohort would need to be whole-genome sequenced under a fixed cost to increase sensitivity for the entire cohort. However, for smaller cohorts, the gain from the increased power through genotyping more individuals using a commercially available array without any add-on content may be more preferable to the sensitivity gained from our approach. Furthermore, array manufactures may also be less willing to offer add-on customisation for small cohorts. For example, Illumina only offers the possibility of customisation of the H3Africa array for orders with more than 1152 samples. With the release of larger and more diverse reference panels, the expected gain from the proposed approach is expected to decrease for many populations. The proposed approach can also be time-consuming to implement, due to administrative and logistical factors involved in sequencing, selection of additional variants, and the production of custom arrays. Thus, those interested in applying the approach should first check whether their study population of interest is well-represented in existing array designs and reference panels, and then consider whether the potential sensitivity gained from the proposed approach may be worthwhile. However, until reference panels and corresponding array designs are able to capture almost all of the human genetic diversity, we still believe that our approach fills an important gap for certain underrepresented populations. Our add-on tag selection procedure did not explicitly target population-specific variants, such as ancestry informative markers [26, 27]. Rather, any common variants in our study population that are poorly imputed under the existing base array content were targeted, and these variants could be either specific to the study population or not. Such a choice was driven by the aim of GWAS, which is to map any variant associated with the trait of interest. Nevertheless, the proposed approach did successfully leverage population specific haplotype structures to improve imputation as designed. This was reflected by the substantial fraction of the tagged imputed variants (66.9% and 71.6%, under Setting 1 and Setting 2 respectively) that were better imputed by the internal Tanzanian reference panel compared to the AFGR reference panel (Table 2). An add-on tag that most efficiently tags a target variant (in the strongest LD) may not necessarily be the optimal tag or even possible to be assayed on an array, as the genotyping error rate of the probe for the particular tag SNP may be high. In the case of Illumina platforms, the quality of the probe(s) that assay each tag SNP is predicted by a proprietary algorithm that outputs a “probe-ability” score publicly available to researchers. Thus, we were able to rectify such an issue by limiting our selection to add-on tags with probes that have high success rates (Illumina probe-ability score > 0.3), and weighed the trade-off between LD strength and probe quality equally when selecting the optimal add-ons. Nevertheless, a more complex weighting scheme may result in even better performance. Furthermore, while the addition of 5000 probes in this study did not result in the saturation of any bead pools of the array, this may become problematic with a larger number of add-on probes. For such scenarios, our approach would have to be modified to balance the saturation of existing bead pools against the suitability of the probe and the tagging efficiency of the tag. We introduced two settings for the selection of add-on tags, namely either coverage-guaranteeing (Setting 1) or efficiency-driven (Setting 2). For users of our approach, the number of regions assigned to each setting could be adjusted depending on the study. For example, if there exists strong prior knowledge with regards to genes implicated in or loci associated with the trait of interest, these regions could be assigned to Setting 1. Conversely, for traits with a lack of prior knowledge, a greater proportion of regions could be assigned to Setting 2, such that tag selection would be conducted in a more hypothesis-free manner. A limitation of our approach is that only common variants in the TB-DAR cohort (MAF > 0.05) were targeted by the selected add-on tags. Such a choice was made due to the limited sample size of our WGS cohort, where for rare variants in the TB-DAR cohort there would be insufficient observations to estimate LD. Nevertheless, the imputation accuracy of low-frequency variants (for example, 0.01 < MAF < 0.05) which are in strong LD with the targeted variants could still increase if tested in a larger testing set. Another limitation of our approach is that the selection of add-on tags within the MHC regions may be suboptimal. Our approach could be improved by utilizing more accurate variant calling in the MHC regions, for example through the incorporation of alternative contigs of the reference genome [28]. Rather than targeting all common variation in the region, tags could also be selected to tag HLA alleles [29], which could be inferred using HLA allele typing approaches based on the WGS data [30]. Finally, the proposed approach uses a single-marker tagging approach based on pairwise MI to identify add-on content. The use of multi-marker tagging approaches could improve the efficiency of the selected add-on tags [31, 32]. In conclusion, in order to improve imputation accuracy in populations underrepresented in existing reference panels and genotyping array designs, we propose a framework where a subset of a cohort is sequenced and the rest genotyped using an array supplemented with the selected add-on tag SNPs. Using a Tanzanian-based cohort as a proof-of-concept, we demonstrated that under our approach, the WGS data could be leveraged to supplement existing reference panels and to select add-on tags, such that imputation accuracy is improved. Our approach is generalizable to any other population to improve genotype imputation, and thus provides a cost-effective solution to increase the power of GWAS in a diverse range of underrepresented populations and to further our understanding of human genetic diversity.

Materials and methods

Ethics statement

Ethical approval for the TB-DAR cohort has been obtained from the Ethikkomission Nordwest- und Zentralschweiz (EKNZ UBE-15/42), the Ifakara Health Institute—Institutional Review Board board (IHI/IRB/EXT/No: 24–2020) and the National Institute for Medical Research in Tanzania—Medical Research Coordinating Committee (NIMR/HQ/R.8c/Vol.I/1622). A written informed consent has been obtained from every patient who has been recruited into the TB-DAR cohort. This consent includes the use of the patient’s blood for human genomic analyses.

Study description

This study was conducted based on a cohort of adult pulmonary tuberculosis (TB) patients from Dar es Salaam, Tanzania (TB-DAR). Participants were recruited at the Temeke Regional Hospital in Dar es Salaam. 128 patients were randomly selected from the cohort for WGS, and 116 samples which passed sequencing quality control were retained. Ethnic information of patients are based on self-reported information.

Whole genome sequencing and quality control

WGS was performed at the Health2030 Genome Center in Geneva on the Illumina NovaSeq 6000 instrument (Illumina Inc, San Diego CA, USA), starting from 1 μg of whole blood genomic DNA and using Illumina TruSeq DNA PCR-Free reagents for library preparation and the 150nt paired-end sequencing configuration. Average coverage was above 30× for 75 samples, between 10× and 30× for 40 samples, and approximately 8× for a single sample. Sequencing reads were aligned to the GRCh38 (GCA_000001405.15) reference genome using bwa [33] (Version 0.7.17), and duplicates marked using Picard (Version 2.8.14, http://broadinstitute.github.io/picard/). Following the GATK best practices (Germline short variant discovery) [34], Base Quality Score Re-calibration (BQSR) was applied using the GATK package [35] (Version 4.0.9.0). Variants were called individually per sample and then jointly. A Variant Quality Score Re-calibration (VQSR) based filter was then applied, with a truth sensitivity threshold of 99.7 and an excess heterozygosity threshold of 54.69. Samples with a high genotype missingness rate (> 0.5) were excluded. To ensure that coordinates of the TB-DAR WGS data matched the GRCh37 based AFGR reference panel, a liftover was applied using Picard LiftoverVcf with the UCSC chain file (hg38ToHg19). Only variants that were successfully lifted over to the same chromosome were retained. Within the X and Y chromosomes, variants within the pseudoautosomal regions [36, 37] were excluded.

Comparison of external reference panels

To compared the performance of external reference panesl, the entire WGS data was masked such that only sites present on the H3Africa array were retained. We measure imputation performance based on three metrics: the imputation quality (INFO score reported by AFGR/HRC and r2 reported by CAAPA), correlation with ground truth measured by squared pearson correlation coefficient (r2), and the fraction of variants observed in the WGS data that were successfully imputed (INFO or r2 > 0.8).

Fixation index and genetic principal components

To conduct principal component analysis (PCA), only autosomal single-nucleotide variants that were genotyped in both 1000 Genomes and TB-DAR WGS cohorts were included. Long-range LD regions were identified separately across all super-populations and across only African populations using the snp_autoSVD function of the bigsnpr package [38] in R, and variants within long-range LD regions were excluded. Using PLINK (Version 1.9) [39], LD pruning [40] (plink --indep-pairwise 1000 50 0.05) was applied and principal components were derived based on the merged cohorts (TB-DAR and all 1000 Genomes super-populations or TB-DAR and all 1000 Genomes African populations). To measure differentiation between the TB-DAR WGS cohort and various 1000 Genomes African populations, pairwise fixation index (F) was calculated using the Hudson estimate implemented in the EIGENSOFT software package (version 8.0.0). Within the TB-DAR WGS cohort, relatedness between individuals was calculated using KING [41], and a random individual in a pair of first degree relatives was excluded. Within the 1KG African populations, relatedness information was obtained from the 1KG project, and individuals labelled as the child, sibling, or grandparent of families or trios were excluded (281 individuals excluded). Only single-nucelotide variants on the autosomes that were genotyped and common (MAF> 0.05) in the merged cohort (TB-DAR and all 1000 Genomes African populations) were included.

Selection of add-on tag SNPs

Our approach to select add-on tags can be divided into three main steps. In step 1, genotype imputation was performed. Poorly imputed variants were identified, and act as candidate target variants which our add-on tags would be designed to tag. In step 2, the optimal add-on tags were selected based on the population-specific LD structure and allele frequencies of the study cohort. In step 3, we evaluated the improvement in imputation performance when the selected add-on tags were incorporated onto the base H3Africa array.

Step 1: Genotype imputation and identification of candidate target variants

The TB-DAR WGS cohort was divided into a training set (3/4 of the data) and a testing set (1/4 of the data). To achieve optimal imputation accuracy, two reference panels (both internal and external) were used to capture haplotype structures present in both the Tanzanian population and in other African populations. A internal Tanzanian reference panel based on the TB-DAR WGS training set samples was constructed using Minimac3 [42]. The African Genome Resources (AFGR) reference panel (https://www.apcdr.org/) hosted on the Sanger imputation service (https://imputation.sanger.ac.uk/) [43] was also utilized, where EAGLE2 [44] was used for phasing and the positional Burrows-Wheeler transform (PBWT) [45] was used for imputation. While Minimac3 and EAGLE2 output imputation accuracy metrics that are not strictly comparable (r2 and INFO score), the two metrics have been shown to be highly correlated [46]. Thus, for each site the genotype call was derived based on a direct comparison of imputation accuracy between the two reference panels. The higher score was stored and herein referred to as INFO score. To identify poorly imputed variants expected under the H3Africa array content (Version 2, https://chipinfo.h3abionet.org/help), the TB-DAR WGS testing set was masked such that only sites present on the H3Africa array were retained. The masked data was imputed using both reference panels, and for each variant, imputation was based on the reference panel that yielded a better imputation score. Candidate target variants were designated as variants that are poorly imputed (INFO < 0.8) but common (MAF > 0.05) in the TB-DAR WGS cohort.

Step 2: Add-on tag SNP selection

For each region, the set of candidate target variants (S1) was defined as variants that are poorly imputed but common in the TB-DAR cohort (MAF > 0.05). The set of candidate add-on tag SNPs (S2) was defined as sequenced common single-nucleotide variants in the TB-DAR cohort, part of the AFGR Reference Panel or the TB-DAR reference panel, with genotype missingness below 0.5, and available as Illumina Infinium probes (probe-ability score > 0.3). The set of existing tags (S3) was initialized as tags that are part of the H3Africa array. LD information between variants were calculated based on TB-DAR WGS training set. We utilized mutual information (MI) as a LD metric (See S1 Appendix), consistent with the choice of a previous array design study for the Japanese population [47]. To select the optimal set of add-on tags, we followed the framework of a forward-selection based algorithm [47]. In summary, the algorithm select tags that are in the strongest LD with the highest number of candidate target variants not captured by existing tags. S6 Fig illustrates an example of a single iteration of the add-on tag selection algorithm. For a single iteration of the add-on tag SNP selection algorithm: For a candidate target variant (j), the existing tag that is in strongest LD with it was identified. The MI score of the target variant (s) was defined as: where I denotes the MI between variant i and variant j. In S6(A) Fig, SNP4 and SNP5 are the target variants, and SNP3 was identified as an existing tag that is in strongest LD with both target variants. For each pair of candidate add-on tag (k) and candidate target variant (j), the add-on tag’s efficiency was defined as the expected change in MI (δ) resulting from the incorporation of the add-on tag: In S6(B) Fig, SNP1, SNP4, and SNP5 are the candidate tags. SNP4 and SNP5 are the target variants, and the change in MI against both variants (δ) were calculated for each candidate tag. The efficiency of a candidate add-on tag (e) against all candidate target variants was defined based on the sum of the changes in MI: where N denotes the number of probes required for the k candidate add-on tag (This is specific to the Illumina platform, where 2 probes are required for for A/T or C/G SNPs due to complementarity and 1 for all others). In S6(B) Fig, SNP5 is also the most efficient tag, with e = 0.9 because it requires only a single probe and increases MI against SNP4 and SNP5 by 0.4 and 0.5 respectively. The optimal add-on tag (k*) was identified based on the overall rank of its efficiency and probe-ability scores: where and denotes the ranking of the efficiency score and probe-ability score respectively for the candidate add-on tag k. In S6(C) Fig, SNP5 is the most optimal tag, since it achieves the lowest overall rank of 3 given a rank of 2 based on probe-ability (S6(D) Fig) and rank of 1 based on efficiency (S6(B) Fig). k* was added to the set of existing tags (S3), and the above steps were repeated. This is illustrated in S6(E) Fig. The selection procedure was stopped when there are no candidate add-on tags remaining (S2 becomes empty), or when the stopping criteria were met.

Step 2: Region definitions and stopping criteria

To ensure the efficiency of add-on tag SNP selection but simultaneously guarantee sufficient coverage in prioritized regions, a two-step procedure for tag SNP selection procedure with unique region definitions and stopping criteria was established. Under Setting 1, regions spanning 5000 base pairs upstream and downstream of genes or variants associated with TB outcomes (reported by GWAS catalog [48], Open Targets [49], and other GWAS studies [50-52]) were considered. The killer cell immunoglobulin-like receptor (KIR) and human leukocyte antigen (HLA) gene regions were also considered. A region was subject to add-on tag SNP selection if they contained a substantial number of poorly imputed common variants (MAF > 0.05 in the TB-DAR cohort), defined as more than 20% of variants with INFO < 0.8. Regions were also subjected to add-on tag selection if it contained an uneven spatial distribution of well imputed variants, defined as the spread of poorly imputed variants (INFO < 0.8) being more than 1.25 times the spread of well-imputed variants (INFO ≥ 0.8). For example, the spread of well-imputed variants is defined as , where σ represents the standard deviation of the positions of variants under each criterion (imputed common variants with INFO > 0.8, all imputed common variants, and all sequenced common variants). To guarantee sufficient coverage, iterations of the forward-selection algorithm were run for each region independently until less than 0.5% of candidate target variants within the region showed δ improvements. The process was then repeated for each of the prioritized regions. Under Setting 2, the selection of add-on tag variants was expanded to any region across the genome that contained poorly imputed common variants (MAF > 0.05 in the TB-DAR cohort). The regions were defined as either a haplotype block (plink --blocks) [39, 53] or a region spanning 5000 base pairs upstream and downstream a candidate target variant, whichever larger. To maximize the selected add-on tags’ tagging efficiencies, a single iteration of the algorithm was run concurrently across all regions. The tag that scored the best across all regions was incorporated. The process was then repeated until the total number of budgeted add-on probes (N = 5000) has been exhausted.

Step 3: Evaluation of imputation accuracy

The TB-DAR WGS testing set was utilized to measure improvements in imputation performance enabled by the add-on tags. For all target variants tagged by at least one add-on tag, imputation quality (INFO score) derived from the base H3Africa array was compared against imputation quality derived from the H3Africa array with the addition of add-on tags. In addition, to measure the accuracy of the imputed genotypes, squared Pearson correlation coefficients (r2) were calculated between the imputed genotype dosages (0, 1 or 2) and the ground truth dosages based on the WGS data.

Selection of random add-on tag SNPs

Matching number of add-on tags were selected by random within each region under both Setting 1 and Setting 2. The criteria for an SNP to be considered as a candidate add-on tag was identical to the proposed approach, except that the candidate tag does not necessarily have to be available as a high-quality Illumina probe (probe-abiliity > 0.3). Add-on tags which were selected by the proposed approach were excluded. In a small number of regions under Setting 1, the number of remaining candidates were not sufficient to achieve a matching number. To avoid bias, a subset of add-on tags that were selected by the proposed approach were also selected by random to achieve a matching number.

Selection of add-on tag SNPs using tagger

The performance of our proposed approach was compared to Tagger [23], which is a publicly available software for tag SNP selection. Tagger was applied to each region under both Setting 1 and Setting 2 independently using the pairwise mode, and a matching number of add-on tags were selected based on the ranking of tags provided by Tagger. The criteria for an SNP to be considered as a candidate add-on tag was identical to our proposed approach. Since Tagger directly infers the set of un-tagged target variants based on existing array content rather than imputation performance, all variants common in the TB-DAR cohort (MAF > 0.05) were included as input for candidate target variants. In a small number of regions under Setting 1, the number of candidate tags proposed by Tagger was not sufficient to achieve a matching number. To avoid bias, a subset of add-on tags that were selected by the proposed approach were selected by random to achieve a matching number.

Y chromosome and mitochondrial Haplogroups

The haplogroups of TB-DAR participants were called using HaploGrep2 [54] and yhaplo [55] for the mitochondria and the Y chromosome respectively. The Phylotree mitochondrial [56] and Y chromosome [57] phylogeny databases were used to identify marker SNPs. Marker SNPs for each main haplogroup that any TB-DAR participant was part of were included as add-on SNPs, if not already existing on the H3Africa array. In addition, we added makers SNPs 2 branch points below the main haplogroup that any TB-DAR participant was part of.

Self-reported ethnic groups of study participants.

(XLSX) Click here for additional data file.

Scenarios under which add-on tags could improve genotype imputation.

The array population represents the population that the existing genotyping array is designed for. The study population represents the population that one would like to genotype and for which the add-on tags are designed for. A) A target variant with a minor allele that is rare in the array population but common in the study population. B) An existing tag with a minor allele that is common in the array population but rare in the study population. C) The haplotype structure is different between the array population and the study population. (TIF) Click here for additional data file.

Genetic principal components (PCs) based on variants sequenced in both the TB-DAR WGS and 1000 Genomes cohort.

Percent of variance explained by each PC are indicated in brackets. A) All 1000 Genomes populations, grouped according to super-populations. B) 1000 Genomes African populations. (TIF) Click here for additional data file.

Imputation performance of Sub-Saharan African populations of the 1000 Genomes project and the TB-DAR WGS cohort based on the AFGR reference panel.

A) Fraction of autosomal variant sites successfully imputed (INFO > 0.8) B) Fraction of X-chromosome variant sites successfully imputed (INFO > 0.8). (TIF) Click here for additional data file.

Number of add-on tags SNPs on each chromosome and the mitochondria.

A) Add-on tags SNPs selected based on Settting 1 and Setting 2. B) Existing tags SNPs on the H3Africa array. (TIF) Click here for additional data file.

Haplogroups of participants within the TB-DAR WGS cohort.

A) Mitochondria B) Y chromosome (Males Only). (TIF) Click here for additional data file.

Schematic illustrating a single iteration of the add-on tag SNP selection algorithm.

Edges between variants represent the strength of Linkage Disequilibrium (LD), measured by Mutual Information (MI). The optimal candidate tag SNP is selected based on the best overall rank, taking into account the efficiency (e), the number of probes required (N), and the quality of the probe (Illumina probe-ability). In subsequent iterations, the newly added tags are incorporated as existing tags, such that the change in MI (δ) includes the contribution of add-on tags. (TIF) Click here for additional data file.

Detailed derivation of pairwise mutual information (MI).

(PDF) Click here for additional data file. 5 Jul 2021 Dear Dr. Fellay, Thank you very much for submitting your manuscript "Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The majority of the reviewers concluded that a major revision would be appropriate. Therefore, in light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Alexander Schönhuth Guest Editor PLOS Computational Biology Ville Mustonen Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The sub-optimal performance of genotyping array in populations that were not represented in the dataset used for its design is well-known. To address this challenge the authors have proposed an approach that would use whole-genome sequencing of representative samples from such a cohort (especially for under-represented populations) for the identification of tag SNPs that are valuable for capturing genetic diversity/LD of the population but are absent in the current genotyping array. Using a computationally improved array they further show that the inclusion of these addon tag SNPs could lead to better imputation and thereby increase the genomic coverage of the dataset. So while this study addresses a key genomic problem I have the following major concerns: a. My first major concern is that while this approach is sound and effective in theory, the implementation is not straightforward and perhaps not even feasible in some scenarios. Firstly, the cost of sequencing ~100-150 high coverage whole genome sequences is still non-trivial, especially for many African and other under-representative populations. The same amount can be spent to genotype more individuals to increase power of the association study. Secondly, it might be extremely difficult to convince the array manufacturers to add on thousands of novel /continent-specific SNPs, especially if the cohort is small. Thirdly, not all SNPs work on all arrays genotyping platforms, so normally during array design, there is a bit of back and forth between what should go in and what is technically possible, this requires the active participation of both the research group and the manufacturer's bioinformatics team and is time and energy-consuming. Some platforms even need to experimentally validate the SNPs before adding them to the array which could further delay this. Thirdly, oftentimes these arrays are saturated in terms of bead pools so adding new bead pools might not be easy. b. The improvement of imputation by addition of tag SNPs in the array is expected, so what assumes importance in this study is the extent of the improvement and also a demonstration that the specific method/algorithm that the authors use for SNP selection is leading to improvement beyond what would have been achieved by randomly adding common novel SNPs to the array. Also, comparisons to show that the method used has better (or least similar) performance in comparison to currently available approaches for tag SNP selection is critical. I would recommend that the authors rewrite this paper focusing on their approach for tag SNP selection and comparing its performance to competing methods. Reviewer #2: Please see the attached comments Reviewer #3: The manuscript proposes an approach to increase imputation efficiency on SNP array data from populations poorly represented in public databases. The method consists of submitting a subset of the population of interest to WGS, and then use this information to select a more suitable set of population-specific SNPs to increase the quality of imputation. Their method show to be usefull and can be conveniently applied mainly to data from isolated populations. There are some minor issues outlined below that I recommend to be addressed. 1- Throughout the text (for example in the Figure 1 legend) the authors refers to common and rare SNPs. A genetic polimorphism is a phenomenom related to the allele frequencies of a single genetic locus and can not be rare or common when a single population is considered. They are referring to rare and common alleles it would be better to correct it, making the text more precise and clear. In addition, when the alternative allele of a genetic marker is too rare, by definition it could not even be classified as an SNP. 2- The authors present the add-on SNPs count per chromosome obtained from Setting 1 and Setting 2. Why the pattern of the distribution observed in the barplot of add-on SNPs is so different from that observed for the tag SNPs of the H3Africa array? Wouldn't be expected that longer chromosomes have more add-on SNPs than shorter ones? Besides, Fig. S2-A (Setting 1) shows that the count for chromosome 6 is much higher than the counts obtained from all other autosomal chromosomes. Why? It is known that chromosome 6 includes the MHC region, where genotype determination may present read mapping dificulties related to the short reads generated by high-throughput sequencing. Potential confounding factors for the reliability of MHC region are genotyping the extent of sequence level, structural polymorphism, and the choice of reference sequence. Has any extra care been taken regarding the variant calling of this specific region? Could the MHC region be inflating the chromosome 6 count of Setting 1? Could this be a source of bias? 3- Any missingness data cleaning was performed across markers? Or only at individual level? It is recommended to remove markers with high levels of missing data. 4- Regarding the estimation of within population differentiation, the method used by the authors, although creative, seems to be biased and incorrect. For two main reasons: (1) The two top principal components of tha PCA acounts for only a small fraction of the total variation (less than 2% for data shown in FigS1); (2) If the researchers have no information about the existence of population substructure this estimate makes no sense. Otherwise they would have to attribute each individual in the corresponding subpopulation and then estimate Weir & Cockerham pairwise Fst; and this procedure would be correct only if the population is subdivided in exactly two subpopulations. If the researchers have no clue about substructuring and want to check it, the proper way consists in estimate original Wright's fixation index Fst = var(p)/[p(1-p)], which will consider populations with any number of subpopulations. Since the authors have no reason to believe that there is a substructure in any population, I recommend the exclusion of this analysis or repeating the estimate in a more appropriate way. Reviewer #4: The review is uploaded as an attachment. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No: The genotyping and WGS data are not released, so the described framework here cannot be reproduced by an external researcher. Reviewer #3: Yes Reviewer #4: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No Reviewer #4: Yes: Gizem Taş Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Submitted filename: XuEtAl_PLoSCOMPBIO_2021.docx Click here for additional data file. Submitted filename: PLOS Review.pdf Click here for additional data file. 2 Sep 2021 Submitted filename: reponse_to_reviewers.pdf Click here for additional data file. 21 Oct 2021 Dear Dr. Fellay, Thank you very much for submitting your manuscript "Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. In particular, one reviewer (Reviewer 2) remained to raise somewhat more serious concerns. We would appreciate if you addressed these concerns as a final improvement of the paper. Once we see these addressed satisfactorily, we are inclined to accept the manuscript without further re-reviewing, and quickly proceed with accepting the paper. Please address also the comment on the lack of availability of the studied data. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Alexander Schönhuth Guest Editor PLOS Computational Biology Ville Mustonen Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Dear Authors, The Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have addressed most of the main concerns.There in one aspect I would like to comment on. As some of the variable such as admin/MTA, logistics, time delays (probably in the order of 3-6 months -including shipping, sequencing, selection of additional variants based on WGS and their addition to array), which are also major consideration have not been accounted for in this study. I would suggest the authors to add a couple of lines in the discussion to clarify this. Reviewer #2: I’d like to thank the author for addressing my concerns and implementing new analyses based on my previous review. I think the manuscript is now much clearer. I have some additional comments regarding the current revised manuscript and new analyses: 1. Line 45, the author mentioned the strategy to supplement external reference panels with WGS samples from an internal study cohort. This is possible perhaps in the particular scenario with the AFGR panel. However, this is increasingly impossible – given the sample size of external panels such as TOPMED. It is worth to acknowledge this possibility at some point in the manuscript. In fact, what is lacking in the current manuscript is an actual comparison that AFGR, with or without the internal Tanzanian reference, would outperform the default approach of just using the TOPMED reference panel. It would be good to show that using a smaller AFGR actually is an improvement over TOPMED, and then the authors’ heuristic will further improve upon that. 2. The comparison to tagger shows that the benefit appears to be modest, if any (Table 1). Can there by confidence interval placed on these percentages such that a comparison between Tagger vs. Proposed Approach can be more directly evaluated? 3. At times the writing become too colloquial – for example, the use of “array population” can be confusing, esp. with the similar use of a “study population”. Perhaps the authors can directly define them upfront in text if the only use of these terms are only for convenience. Reviewer #3: The authors made an excellent work in improving the clarity and precision of the terms used in the text and also in addressing the discussion of methodological limitations related to genetic markers contained in the MHC region. With the exclusion of the procedure that consisted of dividing the sample into two halves based on the top principal component, the population structure analysis is now correct. The authors were also careful in choosing a method that take into account the differences in sample sizes. In addition, the section including a comparison between the proposed approach and the tagger software has improved the manuscript substantially, especially given the care taken by the authors who have detailed some important aspects regarding Illumina's genotyping platform that could increase the cost of genotyping process (something that the presented method proposes to reduce). After a careful review of the manuscript, I congratulate the authors for the great improvement of the text. The article is of great value and I recommend its publication as is. Reviewer #4: The review is uploaded as an attachment. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None Reviewer #2: No: No information to access the genomic data, thus technically the study cannot be replicated using the exact same data. Reviewer #3: Yes Reviewer #4: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: Yes: Renan B. Lemes Reviewer #4: Yes: Gizem Taş Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Submitted filename: PLOS Review - Revised Version .pdf Click here for additional data file. 2 Nov 2021 Submitted filename: reponse_to_reviewers.pdf Click here for additional data file. 10 Nov 2021 Dear Dr. Fellay, We are pleased to inform you that your manuscript 'Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Alexander Schönhuth Guest Editor PLOS Computational Biology Ville Mustonen Deputy Editor PLOS Computational Biology *********************************************************** 11 Jan 2022 PCOMPBIOL-D-21-00653R2 Using population-specific add-on polymorphisms to improve genotype imputation in underrepresented populations Dear Dr Fellay, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Orsolya Voros PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

54 in total

1. Robust relationship inference in genome-wide association studies.

Authors: Ani Manichaikul; Josyf C Mychaleckyj; Stephen S Rich; Kathy Daly; Michèle Sale; Wei-Min Chen
Journal: Bioinformatics Date: 2010-10-05 Impact factor: 6.937

2. Informativeness of genetic markers for inference of ancestry.

Authors: Noah A Rosenberg; Lei M Li; Ryk Ward; Jonathan K Pritchard
Journal: Am J Hum Genet Date: 2003-11-20 Impact factor: 11.025

3. Efficiency and power in genetic association studies.

Authors: Paul I W de Bakker; Roman Yelensky; Itsik Pe'er; Stacey B Gabriel; Mark J Daly; David Altshuler
Journal: Nat Genet Date: 2005-10-23 Impact factor: 38.330

4. Next generation disparities in human genomics: concerns and remedies.

Authors: Anna C Need; David B Goldstein
Journal: Trends Genet Date: 2009-11 Impact factor: 11.639

5. The DNA sequence of the human X chromosome.

Authors: Mark T Ross; Darren V Grafham; Alison J Coffey; Steven Scherer; Kirsten McLay; Donna Muzny; Matthias Platzer; Gareth R Howell; Christine Burrows; Christine P Bird; Adam Frankish; Frances L Lovell; Kevin L Howe; Jennifer L Ashurst; Robert S Fulton; Ralf Sudbrak; Gaiping Wen; Matthew C Jones; Matthew E Hurles; T Daniel Andrews; Carol E Scott; Stephen Searle; Juliane Ramser; Adam Whittaker; Rebecca Deadman; Nigel P Carter; Sarah E Hunt; Rui Chen; Andrew Cree; Preethi Gunaratne; Paul Havlak; Anne Hodgson; Michael L Metzker; Stephen Richards; Graham Scott; David Steffen; Erica Sodergren; David A Wheeler; Kim C Worley; Rachael Ainscough; Kerrie D Ambrose; M Ali Ansari-Lari; Swaroop Aradhya; Robert I S Ashwell; Anne K Babbage; Claire L Bagguley; Andrea Ballabio; Ruby Banerjee; Gary E Barker; Karen F Barlow; Ian P Barrett; Karen N Bates; David M Beare; Helen Beasley; Oliver Beasley; Alfred Beck; Graeme Bethel; Karin Blechschmidt; Nicola Brady; Sarah Bray-Allen; Anne M Bridgeman; Andrew J Brown; Mary J Brown; David Bonnin; Elspeth A Bruford; Christian Buhay; Paula Burch; Deborah Burford; Joanne Burgess; Wayne Burrill; John Burton; Jackie M Bye; Carol Carder; Laura Carrel; Joseph Chako; Joanne C Chapman; Dean Chavez; Ellson Chen; Guan Chen; Yuan Chen; Zhijian Chen; Craig Chinault; Alfredo Ciccodicola; Sue Y Clark; Graham Clarke; Chris M Clee; Sheila Clegg; Kerstin Clerc-Blankenburg; Karen Clifford; Vicky Cobley; Charlotte G Cole; Jen S Conquer; Nicole Corby; Richard E Connor; Robert David; Joy Davies; Clay Davis; John Davis; Oliver Delgado; Denise Deshazo; Pawandeep Dhami; Yan Ding; Huyen Dinh; Steve Dodsworth; Heather Draper; Shannon Dugan-Rocha; Andrew Dunham; Matthew Dunn; K James Durbin; Ireena Dutta; Tamsin Eades; Matthew Ellwood; Alexandra Emery-Cohen; Helen Errington; Kathryn L Evans; Louisa Faulkner; Fiona Francis; John Frankland; Audrey E Fraser; Petra Galgoczy; James Gilbert; Rachel Gill; Gernot Glöckner; Simon G Gregory; Susan Gribble; Coline Griffiths; Russell Grocock; Yanghong Gu; Rhian Gwilliam; Cerissa Hamilton; Elizabeth A Hart; Alicia Hawes; Paul D Heath; Katja Heitmann; Steffen Hennig; Judith Hernandez; Bernd Hinzmann; Sarah Ho; Michael Hoffs; Phillip J Howden; Elizabeth J Huckle; Jennifer Hume; Paul J Hunt; Adrienne R Hunt; Judith Isherwood; Leni Jacob; David Johnson; Sally Jones; Pieter J de Jong; Shirin S Joseph; Stephen Keenan; Susan Kelly; Joanne K Kershaw; Ziad Khan; Petra Kioschis; Sven Klages; Andrew J Knights; Anna Kosiura; Christie Kovar-Smith; Gavin K Laird; Cordelia Langford; Stephanie Lawlor; Margaret Leversha; Lora Lewis; Wen Liu; Christine Lloyd; David M Lloyd; Hermela Loulseged; Jane E Loveland; Jamieson D Lovell; Ryan Lozado; Jing Lu; Rachael Lyne; Jie Ma; Manjula Maheshwari; Lucy H Matthews; Jennifer McDowall; Stuart McLaren; Amanda McMurray; Patrick Meidl; Thomas Meitinger; Sarah Milne; George Miner; Shailesh L Mistry; Margaret Morgan; Sidney Morris; Ines Müller; James C Mullikin; Ngoc Nguyen; Gabriele Nordsiek; Gerald Nyakatura; Christopher N O'Dell; Geoffery Okwuonu; Sophie Palmer; Richard Pandian; David Parker; Julia Parrish; Shiran Pasternak; Dina Patel; Alex V Pearce; Danita M Pearson; Sarah E Pelan; Lesette Perez; Keith M Porter; Yvonne Ramsey; Kathrin Reichwald; Susan Rhodes; Kerry A Ridler; David Schlessinger; Mary G Schueler; Harminder K Sehra; Charles Shaw-Smith; Hua Shen; Elizabeth M Sheridan; Ratna Shownkeen; Carl D Skuce; Michelle L Smith; Elizabeth C Sotheran; Helen E Steingruber; Charles A Steward; Roy Storey; R Mark Swann; David Swarbreck; Paul E Tabor; Stefan Taudien; Tineace Taylor; Brian Teague; Karen Thomas; Andrea Thorpe; Kirsten Timms; Alan Tracey; Steve Trevanion; Anthony C Tromans; Michele d'Urso; Daniel Verduzco; Donna Villasana; Lenee Waldron; Melanie Wall; Qiaoyan Wang; James Warren; Georgina L Warry; Xuehong Wei; Anthony West; Siobhan L Whitehead; Mathew N Whiteley; Jane E Wilkinson; David L Willey; Gabrielle Williams; Leanne Williams; Angela Williamson; Helen Williamson; Laurens Wilming; Rebecca L Woodmansey; Paul W Wray; Jennifer Yen; Jingkun Zhang; Jianling Zhou; Huda Zoghbi; Sara Zorilla; David Buck; Richard Reinhardt; Annemarie Poustka; André Rosenthal; Hans Lehrach; Alfons Meindl; Patrick J Minx; Ladeana W Hillier; Huntington F Willard; Richard K Wilson; Robert H Waterston; Catherine M Rice; Mark Vaudin; Alan Coulson; David L Nelson; George Weinstock; John E Sulston; Richard Durbin; Tim Hubbard; Richard A Gibbs; Stephan Beck; Jane Rogers; David R Bentley
Journal: Nature Date: 2005-03-17 Impact factor: 49.962

Review 6. Human Y-chromosome variation in the genome-sequencing era.

Authors: Mark A Jobling; Chris Tyler-Smith
Journal: Nat Rev Genet Date: 2017-05-30 Impact factor: 53.242

Review 7. Best practices for bioinformatic characterization of neoantigens for clinical utility.

Authors: Megan M Richters; Huiming Xia; Katie M Campbell; William E Gillanders; Obi L Griffith; Malachi Griffith
Journal: Genome Med Date: 2019-08-28 Impact factor: 11.117

8. HLA*LA-HLA typing from linearly projected graph alignments.

Authors: Alexander T Dilthey; Alexander J Mentzer; Raphael Carapito; Clare Cutland; Nezih Cereb; Shabir A Madhi; Arang Rhie; Sergey Koren; Seiamak Bahram; Gil McVean; Adam M Phillippy
Journal: Bioinformatics Date: 2019-11-01 Impact factor: 6.937

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

Review 10. H3Africa: current perspectives.

Authors: Nicola Mulder; Alash'le Abimiku; Sally N Adebamowo; Jantina de Vries; Alice Matimba; Paul Olowoyo; Michele Ramsay; Michelle Skelton; Dan J Stein
Journal: Pharmgenomics Pers Med Date: 2018-04-10

1 in total

1. A genealogical estimate of genetic relationships.

Authors: Caoqi Fan; Nicholas Mancuso; Charleston W K Chiang
Journal: Am J Hum Genet Date: 2022-04-12 Impact factor: 11.043

1 in total