Literature DB >> 30131328

Imputation-Aware Tag SNP Selection To Improve Power for Large-Scale, Multi-ethnic Association Studies.

Genevieve L Wojcik¹, Christian Fuchsberger^2,3, Daniel Taliun², Ryan Welch², Alicia R Martin¹, Suyash Shringarpure¹, Christopher S Carlson⁴, Goncalo Abecasis², Hyun Min Kang², Michael Boehnke², Carlos D Bustamante^1,5, Christopher R Gignoux⁶, Eimear E Kenny^7,8,9,10.

Abstract

The emergence of very large cohorts in genomic research has facilitated a focus on genotype-imputation strategies to power rare variant association. These strategies have benefited from improvements in imputation methods and association tests, however little attention has been paid to ways in which array design can increase rare variant association power. Therefore, we developed a novel framework to select tag SNPs using the reference panel of 26 populations from Phase 3 of the 1000 Genomes Project. We evaluate tag SNP performance via mean imputed r2 at untyped sites using leave-one-out internal validation and standard imputation methods, rather than pairwise linkage disequilibrium. Moving beyond pairwise metrics allows us to account for haplotype diversity across the genome for improve imputation accuracy and demonstrates population-specific biases from pairwise estimates. We also examine array design strategies that contrast multi-ethnic cohorts vs. single populations, and show a boost in performance for the former can be obtained by prioritizing tag SNPs that contribute information across multiple populations simultaneously. Using our framework, we demonstrate increased imputation accuracy for rare variants (frequency < 1%) by 0.5-3.1% for an array of one million sites and 0.7-7.1% for an array of 500,000 sites, depending on the population. Finally, we show how recent explosive growth in non-African populations means tag SNPs capture on average 30% fewer other variants than in African populations. The unified framework presented here will enable investigators to make informed decisions for the design of new arrays, and help empower the next phase of rare variant association for global health.

Entities: Chemical Disease Gene Species

Keywords: Genomics; Imputation; Statistical Genetics; array design; tag SNPs

Mesh：

Year: 2018 PMID： 30131328 PMCID： PMC6169386 DOI： 10.1534/g3.118.200502

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

There is a growing recognition in genomic research of the need for very large-scale associations studies and genome-wide arrays are often the cost-efficient technology of choice. In this study we explore ways to improve array design for rare variant imputation, an underused means to increase power in association studies. We describe a pipeline in which an array is empirically evaluated based on genome-wide imputation accuracy, rather than pairwise linkage disequilibrium, to improve tagging and give real-world estimates of array performance. We explore the impact of patterns of demography on array performance, and discuss the trade-off between accurate rare variant imputation and trans-ethnic utility. This work provides a framework and insights that can guide the next generation of array development. The vast majority of human genomic variation is rare (Nelson ), and an appreciable fraction of rare variants are likely to be functionally consequential. (Kircher ) The gold standard approach to assay rare variation (MAF < 1%) is via deep sequencing. So far, large-scale sequencing studies have had some, but limited, success for discovery of rare variant associations (Emond ; Lohmueller ; SIGMA Type 2 Diabetes Consortium ; UK10K Consortium ). There is a new appreciation that studies of hundreds of thousands or millions of individuals will be needed to drive well-powered discovery efforts. (Lindquist ; Kosmicki ) Currently, genome sequencing on this scale is prohibitively expensive and computationally burdensome. In contrast, genome-wide genotyping arrays are inexpensive, with far less bioinformatic overhead compared to sequencing. The past decade of genomic research has seen the development of myriad commercial high-throughput genotyping arrays.(Hoffmann ; Hoffmann ) While initially designed to capture common variants (International HapMap Consortium 2003), in recent years arrays have been leveraged to capture variation at the rare end of the frequency spectrum. One strategy is to ascertain rare variants directly on arrays, which is restricted to a very narrow subset of the rare variant spectrum due to array size limits. (Igartua ; Wessel ; McCarthy ) Another strategy is to leverage the haplotype structure determined by common variants on the array, which form a ’scaffold’, for accurate inference of un-genotyped variation through multi-marker imputation into sequenced reference panels of whole genomes. The strategy of genotyping, followed by imputation, has the potential to recover rare untyped variants in very large cohorts of arrayed samples at no additional experimental cost. (Huang ; Michailidou ) Imputation increases the effective sample size, leading to increased statistical power. (Pritchard and Przeworski 2001) This model bridging genotyping and imputation has prompted efforts to build deep reference sequence databases and a renewed interest in methods for improving genome-wide scaffold design. (1000 Genomes Project Consortium ; UK10K Consortium ; McCarthy ). Genotype array scaffolds have historically been designed using algorithms that select tagging single nucleotide polymorphisms (tag SNPs) that are in linkage disequilibrium (LD) with a maximal number of other SNPs. Tag SNP algorithms are optimized to maximize this score, typically described as pairwise coverage. However, imputation tools increasingly incorporate sophisticated haplotype information to impute unobserved variants. (Howie ; Fuchsberger ; Browning and Browning 2016) Consequently, it is not clear that tag SNPs that maximize pairwise coverage will be tag SNP’s that provide, in aggregate, the best GWAS scaffold for accurate imputation. (de Bakker ) Further, most tag SNP selection algorithms use LD architecture in a single population (Weale ; Carlson ), while we know LD patterns can vary extensively between populations. (1000 Genomes Project Consortium ) Historically, many commercial arrays were designed by selecting tag SNPs from European populations, although arrays targeting some other populations have recently entered the market. (Hoffmann ; Hoffmann ) The number of SNPs tagged by a tag SNP can vary appreciably between populations due to demographic forces of migration, population expansion, and genetic drift. This may diminish GWAS scaffold performance in populations other than those in which the tag SNPs were selected, which in turn, can lead to reduced power for imputation-based association. This is a particularly pernicious problem in populations for which no targeted commercial array is available, in studies with multi-ethnic populations, and for accurate estimation of the transferability of genetic risk across populations. As association studies grow larger and increasingly diverse, there is a need to reassess design criteria for GWAS scaffolds and arrays. (Carlson ; Fuchsberger ) On the one hand, tag SNPs that tag lower frequency variants are likely to be on the lower end of the site frequency spectrum and, consequentially, more geospatially restricted. (Nelson ; Bustamante ; Gravel ; Mathieson and McVean 2014) On the other hand, as studies grow very large, cohort heterogeneity is likely to increase substantially. (Banda ; Marouli ) Given finite GWAS scaffold density, examining the trade-off between lowering the frequency threshold for accurate imputation and extending utility to multiple populations will become important. (Nelson ; Martin ) In this manuscript, we describe a framework for developing well-powered tag SNP selection leveraging thousands of whole genomes from diverse populations for balanced cross-population coverage. In our study, genomic coverage is evaluated based on genome-wide imputation accuracy as measured by mean imputed r2 at untyped sites, rather than pairwise linkage disequilibrium. Moving beyond pairwise metrics allows us to account for haplotype diversity across the genome and demonstrates population-specific biases from pairwise estimates. Assessing accuracy using leave-one-out cross-validation yields a real-world estimate of genomic coverage. We examine the effect of allele frequency, correlation thresholds, and population diversity on the selection of tag SNP and on the landscape of tag-able variation. This work demonstrates that, while there may be limits given current reference panels, improving GWAS scaffold design is an underused means to increase power in association studies.

Materials and Methods

Genetic Data

The genetic data are from the 1000 Genomes Project (1000 Genomes) Phase 3 data release, version 2 (7/8/2014) containing whole genome sequences for 2,535 individuals from 26 global populations. (1000 Genomes Project Consortium ) Sequence data were in VCFv4.1 format, mapped to the forward strand and variants annotated as reference or alternate alleles. Only biallelic SNPs were included in this analysis (77,224,748 SNPs total). A list of known cryptically related individuals was obtained from the 1000 Genomes FTP site, and one individual from each related pair were subsequently removed (n = 62). Individuals were assigned to their super populations according to the original 1000 Genomes assignments (EAS = East Asian, EUR = European, AFR = African, SAS = South Asian, AMR = Americas, comprising 503, 501, 495, 477, and 341 individuals, respectively). Two populations of admixed African ancestry (ASW and ACB) were removed from the African super population and formed a separate African American/Caribbean (AAC) super population (n = 156).

Tag SNP Selection

Allele frequency was estimated within super population for each SNP using Plink v1.9. (Chang ) Linkage Disequilibrium (LD) was also calculated within each super population using Plink v1.9 and settings for pairwise linkage with a minimum r2 of 0.2 within a maximum distance of 1 megabase (mb). Tag SNP selection was performed per chromosome in the program TagIT (Weale ) (https://github.com/statgen/TagIt), with frequency and LD files for each super population as input. The TagIT algorithm analyzed each super population separately. After filtering based on the minor allele frequency (set as either 0.5%, 1% or 5%), TagIT annotates the tag SNP that has the highest number of LD pairs with r2 above a minimum threshold (set as either 0.2, 0.5, or 0.8). The selected tag SNP and all of its linked SNPs are masked and TagIT finds the next tag SNP with the highest number of LD pairs. The output for each super population included for each index tag SNP the number of sites in LD, as well as the number of unique sites that weren’t already tagged by a previously chosen tag. The number of unique SNPs tagged across all populations per tag SNP was tallied in the final output.

Cross-population tag SNP ranking and scoring

The naive approach ranked potential tags by the absolute number of unique SNPs that are tagged across all super populations. From this list, the top SNPs were selected for the appropriate allocation. To ensure performance of the tags across multiple populations, the cross-population prioritization schema first ranks tags by the number of populations in which they are informative, meaning they tag at least one site (Supplementary Figure 1). This ensures that the top ranked SNPs are not biased to a super population with large LD blocks or high SNP density in which one tag can contribute information about many other SNPs. Within each one of these categories (all 6 super populations down to only 1 super population), the tags are ranked by the number of unique tags across all six super populations, as was done in the original approach. The appropriate allocation is selected from the top of this list, scaled to the size of the chromosome of interest.

Metric of Performance

Coverage and imputation accuracy were assessed using all polymorphic biallelic sites within the 1000 Genomes Phase 3 data release, version 2. Sites were categorized into ten discrete minor allele frequency bins: (0.005-0.01], (0.01-0.02], (0.03-0.04], (0.04-0.05], (0.05-0.1], (0.1-0.2], (0.2-0.3], (0.3-0.4], and (0.4-0.5]. The term “coverage” is used to denote the proportion of untyped sites that had at least one tag SNP with pairwise r2 greater than a certain threshold (0.2, 0.5, or 0.8). Imputation accuracy was determined through a leave-one-out internal validation approach with the 1000 Genomes Project Phase 3 data using a modified version of Minimac. (Fuchsberger ) For this approach, each individual within the 1000Genomes data had the appropriate tag SNPs denoted as ‘genotyped’, with all other sites set as missing. These missing sites are then imputed using the rest of the 1000Genomes panel as a reference. Correlation was calculated comparing the estimated dosages from this imputation to the true genotypes from the original VCF files. While this internal validation approach may introduce overfitting of the data and an upwards bias of imputation accuracy, we sought the relative imputation accuracy for different methods and do not see any bias altering described trends.

Ascertainment Bias Analyses

Population-specific tags were selected separately through TagIT for each super population with a genome-wide allocation of 500,000 sites. All tags had a minimum MAF of 1% and a minimum r2 threshold of 0.5. Each of the single population ascertained tag lists assessed for imputation accuracy in all six super populations, including their index population. Imputation accuracy was calculated as previously described and limited to chromosome 9.

Local Ancestry

Local ancestry was estimated using RFMix (Maples ) assuming three ancestral backgrounds: African, European, and Native American, and is described in detail in (Martin ) Tracts were dropped if smaller than 20 cM to improve accuracy in local ancestry estimation. Diploid ancestry with three ancestral backgrounds yielded six categories of variation. Imputation accuracy was then calculated separately per diploid tract category, with all other sections masked out. Results were aggregated across all chromosomes to calculate the genome-wide performance per diploid ancestry. Tracts were removed from analysis if the ancestral diplotype was found in fewer than 5 individuals. This included AFR-NAT and EUR-NAT within ACB which only occurred in 2 individuals each, NAT-NAT diplotypes in ASW which occurred in one individual, and AFR-AFR diplotypes in MXL which occurred in 3 individuals.

Cross-population patterns of linkage disequilibrium

To determine how many sites were in LD with tag SNPs across all 6 super populations, we selected one million SNPs for a GWAS scaffold using a minimum r2 of 0.5 and a minimum MAF of 0.01 on chromosome 9. We calculated the number of polymorphic sites (MAF > 0.5%) and the proportion of these sites that were in LD (r2 > 0.5 or r2 > 0.8) with at least one tag marker. To determine sharing of tags across multiple populations, we calculated the proportion of tag markers that were informative in other populations, conditional upon them being informative in the index population. The proportion of sites shared among multiple populations was calculated as the proportion of tag SNPs that performed in a certain number of populations (from 1 to 6 super populations) per super population.

Tagging Potential

Tag SNPs were selected with a minimum r2 of 0.5 and a minimum MAF of 0.01 on chromosome 9. The potential for tagging was determined assuming an infinite site scaffold, using all possible tags until every pairwise relationship with r2 above 0.5 was captured. The average number of sites captured per tag was calculated in each super population separately, using only the tags that were informative within that population. We also calculated these trends assuming a scaffold of one million sites, following the same procedures. The “dark sites” were calculated as sites in which there was no pairwise correlation with any other site with r2 > 0.2, determined separately for each super population.

Data Availability

The input data from 1000 Genomes Project, Phase 3 is publicly available at the following link: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/. The program TagIt is available on github (https://github.com/statgen/TagIt), as well as a tutorial for how to select tag SNPs as detailed in this manuscript (https://github.com/chrisgene/crosspoptagging). Supplemental material available at Figshare: https://doi.org/10.25387/g3.6626762.

Results

Assessing population-specific imputation accuracy With standard GWAS scaffold design

First we designed an experiment to assess imputation accuracy performance comparing tag SNP selection from different populations. This experiment mimics the current design of many commercial arrays, in which tag SNPs were selected to capture the primarily variation in a single population or a closely related group of populations. We built a pipeline using the 26 population reference panel from Phase 3 of the 1000 Genomes Project and the Tagit algorithm(Taliun) for tag SNP selection. (Weale ) (Supplementary Table 1) Individuals were split into mutually exclusive “super populations.” These included the Admixed American (AMR), East Asian (EAS), European (EUR), and South Asian (SAS) populations as described in 1000 Genomes Project Consortium In addition, we divided the African super population into two groups: four populations from Africa (AFR) and two populations of African descent in the Americas (AAC) (see Methods). Initially, to mimic the design of many arrays, tag SNPs were selected from a single super population. We assumed a genome-wide allocation of 500,000 tag SNPs, however analyses for a single population tagging strategy were only conducted on chromosome 9 with the allocation of 21,107 sites proportional to the physical distance of chromosome 9 compared to all chromosomes combined. Potential tags were required to have a minor allele frequency (MAF) ≥ 1% and be in pairwise LD with the tagged target site with a r2 ≥ 0.5. The current generation of phase-based imputation algorithms (BEAGLE, IMPUTE2, Minimac3) leverage local haplotype information and sequenced reference panels to improve accuracy of variant inference compared to tag SNP approaches. (Marchini ; Browning and Browning 2007; Howie ; Marchini and Howie 2010; Fuchsberger ; Browning and Browning 2016) Therefore, optimal array design depends not only on tag SNP selection, but also on empirical evaluation of imputation performance. For each of the population-specific GWAS scaffolds, imputation accuracy was assessed in all six super populations by MAF bins (common, MAF = 0.05-0.5; low frequency, MAF = 0.01-0.05; and rare, MAF < 0.01) by comparing the imputed dosages to the real genotypes through leave-one-out internal validation. (see Methods). Consistently across all super populations, the population from which the tags were ascertained had the highest imputation accuracy in the common bin. (S1 Fig) Trends in imputation accuracy follow known patterns of demography. For example, if the tags were ascertained in European populations, imputation accuracy was best in Europeans (EUR), followed by out-of-Africa populations (AMR, SAS, EAS), and worst in African ancestry populations (AFR, AAC). (Figure 1) If the tags were ascertained in African populations, the inverse was observed. (S1 Fig) As expected, the same trend of reduced imputation accuracy in non-ascertained populations was exacerbated in the low frequency bin. Imputation of low frequency variants in East Asian populations (EAS) was consistently most challenging; even when tag SNPs were selected from EAS, accuracy of low frequency imputation was the same or better in other populations. This can be explained by evidence of a recent tight bottleneck followed by rapid population grown in EAS, resulting in a large proportion of rare variants that are difficult to tag due to lower LD, especially with a limited scaffold of 500,000 sites. (Gravel ) In contrast, the imputation performance of tag SNPs ascertained in AFR, AMR, and AAC populations is the same or better compared to the performance in out-of-Africa populations. This is likely due to increased allelic heterogeneity in African ancestry populations, which results in greater haplotypic diversity and a higher chance that a rare variant is well tagged by a haplotype for imputation. (1000 Genomes Project Consortium ) The imputation accuracy of AMR higher in the rare frequency bin (MAF 0.5–1%), independent of the ascertainment population, is likely due to longer haplotypes resulting from recent admixture, allowing the rare variation to be captured accurately given the limited allocation. (Gravel ) Importantly, in each case we observe a notable drop-off in performance across most of the frequency spectrum when examining imputation coverage in populations diverging from the one used for tag SNP selection. (S1 Fig).

Figure 1

Imputation Accuracy by super population of tags selected in European populations for a scaffold assuming 500,000 genome-wide variants. Tags were required to have a MAF ≥ 1% and r2 ≥ 0.5 with target sites. This trend is observed across all super populations (S1 Fig).

Comparing single vs. cross population tag SNP selection strategies

When developing a genotyping platform, it is useful to assess whether selected tag SNPs segregate in the population of interest and contribute to tagging by being in LD (high r2) with untagged sites. For example, using Illumina’s OmniExpress platform (Illumina) within the 1000 Genomes Project data, over 99.7% of the sites will be polymorphic (MAF > 0.5%) in the overall dataset. However, when we stratify by super population, each group has a differential loss due to monomorphic sites. AFR loses only <1% of sites with a MAF < 0.5%, whereas EUR and EAS lose 4.4% and 9.2% of variants, respectively. Reduction in tagging can result in loss of statistical power for downstream analysis. We quantify this as “informativeness”, or the ability of a tag SNP to both segregate in the population and provide LD information (r2 > 0.5 with at least one untagged site). Balancing representation of variation across all groups becomes very important in multi-ethnic studies. To explore different approaches for GWAS scaffold design we compared three strategies for selecting tag SNPs; single population tag SNP ascertainment, in which all tags are selected from a single population; a ‘naïve’ approach, in which all populations are combined and tags are selected based on composite statistics derived from this multi-population pool; and a ‘cross-population prioritization’ approach, in which tags are prioritized if they are both informative in multiple populations and by the number of unique sites targeted across all groups (see Methods and S2 Fig). We generated lists of tags per method assuming a total genome-wide allocation of 500,000 sites and minimum thresholds of r2 > 0.5 and minor allele frequency (MAF) ≥ 1%. Using these parameters, an exhaustive set of tag SNPs were selected using the naïve approach with tags ranked by the absolute number of sites tagged across the 6 super populations, regardless of how many super populations had LD between tags and targets. We then re-ranked them using the cross-population prioritization approach (S2 Fig). To compare the three approaches, we tallied the number of informative tags per population for each method to investigate the added value of tags contributing information in multiple populations. (Figure 2) This was done for all 22 autosomes. As per the design, all the single-population tags were informative within the super population from which tag SNPs were selected. Comparing the naïve and cross-population approaches that selected tag SNPs across all populations, the cross-population prioritization approach increased the number of informative tag SNPs in all populations relative to the naïve approach. In the naïve approach, we observed that the majority of tag SNPs were selected from the AFR population, followed by AAC, due to African-descent populations having more polymorphic sites across the genome with lower linkage disequilibrium. (Henn ; 1000 Genomes Project Consortium ) Whereas in the cross-population prioritization approach variation specific to a single population is down-weighted, leading to more balanced representation between all 6 super populations. By leveraging cross-population information the largest boost in the proportion of tag SNPs contributing linkage disequilibrium information compared to the naive approach was observed in non-African descent populations (10.5%, 28.6%, 25.9%, and 28.7% in AMR, EAS, EUR and SAS, respectively). Even the African descent populations (AFR and AAC), which dominate the naïve approach, have a higher proportion of tags in linkage disequilibrium with target sites with the cross-population prioritization approach (a 2.2% and 1.0% boost for AAC and AFR, respectively).

Figure 2

Proportion of tags that are informative by population with the three methods. (Left, lightest) tags selected from only a single population, (Center) tags selected by pooling all populations agnostically, and (Right) tags selected with the cross-population prioritization approach. Tag SNPs were informative if they were in linkage disequilibrium (r2 > 0.5) with at least one untagged site. To assess performance across the frequency spectrum we also stratified our accuracy estimates by super population-specific MAF into common, low frequency, and rare bins, as previously described. We observed that the cross-prioritization approach results in a larger proportion of tags being informative compared to both the single-population and naïve for common tag SNPs (MAF > 0.05) in all super populations. This is likely because the cross-prioritization approach prioritizes potential tag SNPs that provide LD information across multiple populations, therefore prioritizing common variants tagging common variation. However, by limiting tag SNP selection to these common variants only, the proportion of tags that provide LD information for low frequency variants is decreased compared to the single population approach, which had the highest proportion of informative tag SNPs in low and rare frequency in the target population. For example, when tags were ascertained using only AAC LD information, 19.5% of the 500,000 SNP scaffold were informative for rare variation (MAF < 1%) and 62.8% for common variation MAF > 5%) within AAC populations. When the cross-population approach was used, ensuring the prioritization of common variation, the proportion of tag SNPs informative for rare variation dropped to 6% while the proportion informative for common variation jumped up to 82.4%. This is consistent with low frequency and rare variants being population-specific, therefore not tagged by cosmopolitan common variation present in multiple populations. A notable exception is that the naïve approach contributes the most LD information for rare variants in the AMR super population. This is consistent with our previous findings showing highest imputation accuracy in the rare variation within AMR, even when the population from which tag SNPs were ascertained was different. The AMR on average exhibit longer haplotype lengths from the recently admixed populations in the Americas. (Gravel ; 1000 Genomes Project Consortium ) Because of the long haplotype tract lengths, more limited haplotypic diversity, and the limited allocation of tag SNPs, a naïve approach emphasizing the absolute number of unique sites up-weights variation that is informative for at least one of the ancestral components present in these populations.

Cross population prioritization of tag SNPs increases imputation accuracy for all groups Across frequency spectrum compared to naïve approach

The goal of tag SNP selection is to inform the unmeasured haplotypes, and therefore their performance must be evaluated in aggregate. One way to assess this is through imputation accuracy. Following the observation that cross-population prioritization selects a higher proportion of informative common tag SNPs for each population, even compared to the single population approach, we next assessed what impact this would have on imputation accuracy. We deployed the same leave-one-out internal cross validation approach as before using the 1000 Genomes Project populations (see Methods). We again assumed a genome-wide scaffold of 500,000 sites and tags had to have a MAF > 1% and r2 > 0.5 with tagged sites. Imputation accuracy was highest across all population-specific minor allele frequency bins when ascertaining in the target population in non-African non-admixed descent continental populations (EAS, EUR, and SAS). (S3 Fig) For the two African descent groups (AAC and AFR), the cross-population prioritization approach had the highest imputation accuracy across all sites. When stratified by MAF bins, the increase in informative tag SNPs for common variants with the cross population approach yielded higher imputation accuracy for common variation in all super populations. As previously seen, the population-specific nature of low frequency and rare variants led to decreased imputation accuracy in non-African descent populations for both the cross-population and naïve approach when compared to targeted single-population ascertainment. The cross-population prioritization approach had higher imputation accuracy than the naïve approach for all MAF bins. As scaffold size can dramatically affect imputation accuracy(Spencer ), we additionally examined allocations of 250,000, 500,000, 1,000,000, 1,500,000, and 2,000,000 genome-wide tags, which were all selected with r2 > 0.5 and MAF > 0.01. These allocations approximate the size range of many commercially available arrays. The cross-population prioritization scheme performed better with higher imputation accuracy than the naïve method for all super populations across all minor allele frequency bins with tags selected. (Figure 3) The biggest improvement came with the smaller array sizes. The most marked improvement was found in EAS, which originally had the lowest imputation accuracy of the 6 super populations with the naive approach. Within EAS groups, the cross-population approach increased imputation accuracy overall by 9.8% (from 67.3 to 77.1%) for a tag scaffold of 250,000 sites. For a scaffold of 500,000 sites, an overall improve of 6.2% was observed (from 77.4 to 83.6%). Improvements were largely consistent with the increase of informative tag SNPs. (Figure 2) As with the naive prioritization approach SNPs were disproportionately informative within AFR and AAC, consistent with admixed ancestry reflected by reference panels. For the smaller sizes (250K), the greatest increase in performance incorporating cross-population information was found within common SNPs (MAF > 5%). However, the larger sized scaffolds (1-2 million) showed the most improvement within the low frequency bins (MAF < 5%).

Figure 3

Increased imputation accuracy with cross-population prioritization (solid line) vs. naïve approach (dashed line) for a minimum pairwise correlation threshold of r2 > 0.5 and MAF > 1% across different scaffold sizes. Imputation accuracy was calculated separately within minor allele frequency bins for each super population.

Imputation accuracy varies by local ancestry background in admixed individuals

We also assessed imputation ancestry stratified by local ancestry diplotype in the two admixed populations, the AAC and AMR, for a genome-wide allocation of 500,000 tag SNPs. First, using phased data, we inferred haploid tracts of African, European, and Native American local ancestry along the genomes of all individuals in the AMR and AAC populations (see Methods, (1000 Genomes Project Consortium ; Martin )). Then each variant was inferred to be on one of six ancestral diploid tracts; European-European (EUR-EUR), European-African (EUR-AFR), European-Native American (EUR-NAT), African-Native American (AFR-NAT), African-African (AFR-AFR) and Native American-Native American (NAT-NAT). In all local ancestry strata the cross-population prioritization yielded improved imputation accuracy when compared to the naïve approach. When looking at ASW population (Americans of African ancestry in South West US), performance was high overall with all diploid tracts having imputation accuracies of 92.8–96.8% for all sites with minor allele frequency above 1%. (S4 Figure) The lowest imputation accuracy was found in AFR-AFR tracts, especially at the lower end of the frequency spectrum. The highest imputation accuracy was found in EUR-EUR tracts (94% overall for ASW). In AMR populations, by contrast, the NAT-NAT tracts had the lowest performance of all. An example can be seen in the MXL population (Mexican Ancestry from Los Angeles), where the highest imputation accuracy was found in the AFR-EUR tracts (overall imputation accuracy of 90.1% for all SNPs with MAF > 0.5%) and the lowest within NAT-NAT tracts (74.8% for all SNPS with MAF > 0.5%). (S4B Fig) These performances could be reflective of the relative availability of reference data relevant to these specific ancestral components.

Evaluating impact of r2 and MAF thresholds on tag SNP performance

Previous standards in scaffold design have considered minimum linkage disequilibrium (r2) and minor allele frequency (MAF) thresholds when prioritizing possible tag SNPs. However, the impact of these thresholds are often evaluated through pairwise coverage. We explored varying the minimum r2 threshold (0.2, 0.5, 0.8) and MAF (0.5%, 1%, 5%) to assess their impacts on imputation accuracy, as well as pairwise coverage, assuming a genome-wide allocation of one million tags. For common variants, a higher minimum r2 threshold (r2 > 0.8) resulted in slightly higher imputation accuracy. (Figure 4A) However, the sites in the low and rare bin demonstrate population-specific accuracy only. (S5 Fig) For AFR, SAS, and EAS, a less stringent threshold of r2 > 0.2 had the worst imputation accuracy across all frequency bins. Low frequency and rare variation had higher imputation accuracy for an r2 threshold of 0.5 compared to 0.8. Within AAC, AMR, and EUR, the low frequency variation had improved imputation accuracy with the lowest r2 threshold of 0.2. However, the imputation accuracy within this low threshold was notably compromised for common variants. This indicates that low frequency variation is better captured by weak correlation structure, but at a cost to common variation in these populations. Analyses performed with r2 > 0.5 had the best balance of performance across all frequency bins with the highest overall imputation accuracy in all super populations except for EAS. (S2 Table) Overall, there were very small differences in imputation accuracy between the different r2 thresholds. There were much larger differences in coverage, including both coverage evaluated with minimum r2 (LD) of 0.5 and 0.8. (Figure 4A) Additionally, the best “performance” using pairwise coverage was highly dependent on the definition of coverage. Specifically, if pairwise coverage was calculated as the proportion of sites that are in LD with r2 > 0.5, then the best minimum r2 threshold in tag SNP selection will be 0.5. This holds true for r2 > 0.8 as well.

Figure 4

Influence of (A) minimum r2 threshold and (B) lower MAF threshold on imputation accuracy and coverage (r2 > 0.5 and r2 > 0.8) within populations from the Americas with an allocation of 1M sites.

Influence of (A) minimum r2 threshold and (B) lower MAF threshold on imputation accuracy and coverage (r2 > 0.5 and r2 > 0.8) within populations from the Americas with an allocation of 1M sites. The impact of minimum minor allele frequency threshold was negligible across variants with MAF > 5% for all non-African populations (S6 Fig). Within populations of African descent, limiting tags to variants with MAF > 5% resulted in increased imputation accuracy for all frequency bins, especially for common variants. Lowering the MAF to 0.5% reduced accuracy in African-descent populations across all frequency bins. For EUR, SAS, and AMR, tags with MAF > 1% had decreased accuracy for variants with MAF 0.5–1% compared to when tags are limited to MAF > 0.5%. (Figure 4B) The lowest limit of MAF (0.5%) showed increased accuracy for rare variation but at a slight cost to the accuracy for common sites (MAF > 5%). We concluded that the best balance for tag SNP selection across all populations among these was MAF > 1% within the population being tagged, as the imputation accuracy was best for MAF > 5% for half of the groups (AAC, AFR, EAS) and best for MAF > 0.5% for the other half (AMR, EUR, SAS). (S2 Table) However, the overall differences in imputation accuracy was minimal, with less than 1% between all lower MAF thresholds across all sites. Again, we observed large differences in pairwise coverage, despite negligible differences when performance is evaluated by imputation accuracy. (S6 Fig) This is particularly striking for African-descent populations (ASW and AFR), where there were large gains of pairwise coverage for MAF > 1%, compared to MAF > 0.5% and MAF > 5%. As previously described, African populations have shorter LD blocks and a greater absolute number of polymorphic variants compared to other populations. (1000 Genomes Project Consortium ) Therefore, pairwise coverage underestimates performance compared to imputation accuracy, as addressed below.

Tagging potential differs between populations

Efficient tag SNP selection is an opportunity to boost power in downstream analyses. In our study, African and out-of-Africa populations exhibited distinct genetic architectures, which resulted in different performance trends. Even when cross-population performance was prioritized, it did not guarantee equal representation of all population groups within the tag SNP set. To determine the contribution of each population, we focused on chromosome 9 (42,215 tags), equivalent to one million sites genome-wide, selected with our novel cross-population prioritization scheme. This tag SNP allocation resulted in including all tags that were informative in at least 3 to all 6 populations in the scaffold. Out of all tags for chromosome 9, 17.96% were informative in all 6 populations. (S3 Table) No tags were included that were informative in only one or two populations. Of tags that were informative in 5 out of the 6 super-populations, only 54% were in LD with any target sites within EAS populations, while 93% were informative in AAC populations. (Figure 5A) This trend is consistent with cross-population tags tending to be less informative in EAS populations compared to the other populations. When tags are informative in 3 out of 6 groups, only 18% were informative in EAS, while 75% were informative in AAC. Tags informative in only 2 of the 6 groups were likely informative in AAC and AFR, the African descent populations, while very few of them were informative for non-African descent groups, consistent with capturing differential LD patterns in African populations.(Henn ) When tags are stratified by MAF (0.5–1%, 1–5%, and >5%), these trends are exaggerated in the low frequency and rare MAF bins. (S7 Fig) As expected, the rare variation (0.5–1% MAF) was highly population-specific with no sites in this frequency bin being informative across all populations, or even 5 out of the 6 populations. (Gravel ) For low frequency variation (1–5%), tags were the least informative within EAS, with only 36% of the tags informative in 5 out of 6 populations.

Figure 5

Tag SNPs informativeness across population. (A) Proportion of sites informative (r2 > 0.5, MAF > 0.01, 1M site scaffold) across a number of populations, with lines corresponding to the index population. For example, for sites that are informative (r2 > 0.5 with any untyped SNP in genome) in five out of the six populations, only slightly more than half are informative in East Asian populations while greater than 90% are informative in African populations. (B) Proportion of sites shared across populations, conditional on index population. For example, for sites informative in African populations, less than half are informative in East Asian, European, and South Asian populations. Conditional performance, or the ability of a tag which is informative in the index population also being informative in an additional population, was also examined and found to be consistent with known population histories. Of tags that are informative within AFR, 94% were informative within AAC, while only 38% were informative within EAS. (Figure 5B) However, among tags that were informative within EAS, 81% were informative within African populations. Once again, the stratified analyses show exaggerated trends for the low frequency and rare MAF bins. (S8 Fig) For the rare variation (0.5–1%), only a very small percentage (<10%) of tags are informative in other populations (AMR, EAS, EUR, SAS) if they were informative within African-descent populations (AFR and AAC). The high level of sharing between AFR and AAC is expected due to the high proportion of African ancestry within African-American and Afro-Caribbean populations. Of tags informative within EUR, 78% are also informative within AMR, largely due to the high proportion of European ancestry within some Hispanic/Latino populations.(Moreno-Estrada ; Gravel ; Moreno-Estrada ). The tags were also not equally informative in each population when it comes to the number of sites they tag with r2 > 0.5. For chromosome 9, it would take 81,416 tags to capture all possible tag-able variation with an r2 > 0.5 within AFR populations, while it would take only 28,473 tags within EAS populations to saturate coverage. However, each tag within the AFR populations captures on average 7.17 other sites, whereas for EAS populations, each tag captures on average 10.27 other SNPs. When restricting the design to a million tag SNP scaffold, each tag captures on average 16.16 other SNPs within EAS populations and 12.16 other SNPs in AFR populations. (Table 1) This reflects the different underlying genetic architecture of these different groups.

Table 1

Performance per tag SNP to capture all variation possible with r2 > 0.8 on chromosome 9, as well as within a one million site genome-wide scaffold allocation through cross-population prioritization

Population	All Possible Tags		One Million Tag Scaffold
Population	Number of Tags	Sites Captured per Tag	Number of Tags	Sites Captured per Tag
AAC	74,255	8.04	36,336	12.97
AFR	81,416	7.17	34,548	12.16
AMR	43,065	9.40	28,691	12.80
EAS	28,473	10.27	16,457	16.16
EUR	35,027	9.48	22,111	13.63
SAS	37,644	9.28	23,480	13.33

Limits of tagging and imputation

Not all of the human genome can be captured through pairwise tagging given existing reference panels. For each super population, we filtered for sites that were polymorphic (MAF > 0.5%) and had no pairwise correlation (r2 > 0.2) with any other site within one megabase. The number of these “lone sites” without any pairwise correlation was dependent upon population. AAC had the greatest number of lone sites, but that is likely due to the significantly decreased sample size compared to the other populations. (Table 2) The lowest number of lone sites was found within AMR. Although these sites have no notable pairwise correlation with any other site in the human genome, haplotypes may be informative and allow the recovery of information for imputation. We again assumed a one million genome-wide tag SNP scaffold allocation with minimum MAF of 1% and minimum r2 threshold of 0.5 and imputed to the entire 1000 Genomes reference panel. As expected, imputation accuracy and ability to recover information was population-specific. The imputation accuracy within AAC was an outlier when compared to other populations, with 80.72% of lone sites being imputed with at least the accuracy of racc2 ≥ 0.5 and over 50% of sites being imputed with even higher accuracy (racc2 ≥ 0.8). Many of these lone sites within AAC were captured with pairwise and haplotype LD within other populations, primarily AFR and to a lesser extent EUR. While there were likely insufficient allele counts for accurate correlation estimation within AAC due to the small sample size, this information could be recovered using a global reference panel. The number of unrecoverable “dark sites”, which had no pairwise correlation and were not recoverable with imputation using haplotype information, was the largest in EAS and is consistent with known demography and population history yielding an excess of highly rare variation compared to other populations.(Gravel )

Table 2

Lone sites by super population and their imputation accuracy for a one million site scaffold

Population	Number of Individuals	Number of Lone Sites	Imputation Accuracy Quality			Number Unrecoverable with r²_acc ≥ 0.2 (%)
Population	Number of Individuals	Number of Lone Sites	r²_acc ≥ 0.2	r²_acc ≥ 0.5	r²_acc ≥ 0.8	Number Unrecoverable with r²_acc ≥ 0.2 (%)
AAC	156	7,509	90.79%	80.72%	51.72%	691 (9.2%)
AFR	495	4,497	63.29%	38.73%	7.03%	1,651 (36.7%)
AMR	341	2,701	48.98%	25.88%	3.78%	1,378 (51.02%)
EAS	503	4,947	44.37%	12.41%	2.14%	2,752 (55.63%)
EUR	501	3,881	51.07%	23.22%	3.74%	1,899 (48.93%)
SAS	477	4,293	51.01%	18.77%	2.26%	2,103 (48.99%)

Pairwise coverage vs. imputation accuracy

When evaluating the performance of a GWAS scaffold, there are numerous factors to take into consideration. These include the number of sites you have allocated to tag SNPs and what your priorities are for balanced representation. To a lesser extent, the benefits and pitfalls of prioritizing low-frequency variants must be weighed. However, we have demonstrated that the influence of these factors is highly dependent on how performance is measured. The notion of genomic “coverage” has historically been estimated using pairwise correlations, and therefore this term will be used to denote the proportion of polymorphic sites that are in pairwise LD (r2 threshold) with at least one tag SNP. We calculated coverage separately per super population at an r2 threshold of 0.5 and 0.8 within minor allele frequency bins identical to the imputation accuracy estimation analyses, assuming a genome-wide tag SNP set of 500,000 and 1,000,000. (Table 3) For a tag SNP set of one million sites, coverage was lowest in AFR with an overall average of 59.15% for all sites with MAF > 0.5% and r2 > 0.5. (S9 Fig) When the r2 threshold is raised to 0.8, the proportion of sites in linkage disequilibrium with at least one tag SNP lowers to 28%. (Figure 6) The highest coverage was found in populations from the Americas (AMR) and East Asia (EAS). For a lower r2 threshold of 0.5, 79.9% of AMR sites with MAF > 0.5% were covered. When using the higher r2 threshold of 0.8, East Asian populations had the highest coverage with 63.08% of sites in LD with at least one tag SNP. This difference is even more marked when looking at a smaller tag SNP set of 500,000 sites. (S10 Fig, S11 Fig) African populations now have an overall coverage of 33.17% with r2 > 0.5 and 14.10% with r2 > 0.8. East Asian populations have the highest coverage with 73.16% of sites covered with r2 > 0.5 and 55.09% with r2 > 0.8.

Table 3

Coverage of 1 million and 500,000 tag SNP set by super population for all polymorphic sites on chromosome 9 with MAF > 0.5%

Super population	Total Number of Polymorphic Sites	Scaffold of 1,000,000 tags			Scaffold of 500,000 tags
		Coverage		Imputation Accuracy	Coverage		Imputation Accuracy
		r² > 0.5	r² > 0.8	Imputation Accuracy	r² > 0.5	r² > 0.8	Imputation Accuracy
AAC	780896	63.64%	30.27%	90.59%	34.03%	14.07%	84.85%
AFR	777207	59.15%	28.05%	89.62%	33.17%	14.10%	83.32%
AMR	503804	79.90%	53.60%	92.77%	61.00%	37.02%	90.09%
EAS	367189	76.95%	63.08%	86.28%	73.16%	55.09%	84.16%
EUR	414184	78.77%	62.65%	91.02%	72.87%	52.86%	88.90%
SAS	455573	74.84%	56.97%	88.09%	67.28%	45.91%	85.46%

Figure 6

Coverage (dashed lines) vs. Imputation Accuracy (solid lines), assuming a genome-wide scaffold size of one million tags. Coverage is shown with an r2 > 0.8. While pairwise tagging values are low, particularly in African-descent populations, multi-marker imputation accuracy remains high across groups. These trends are in striking contrast to those we observed in imputation accuracy. When comparing a tag SNP set of 1 million, pairwise LD coverage is the lowest in populations of African descent (59% with r2 > 0.5) yet imputation’s ability to recover un-typed sites is on average high and consistent with other populations (imputation accuracy of 89.62%) among SNPs with a minor allele frequency above 0.5%. This contrast is also found in East Asian populations, which had one of the highest proportion of polymorphic SNPs with r2 > 0.5 for coverage (76.95%), but the lowest imputation accuracy (86.28%). (Table 3) When sites are stratified by minor allele frequency bins, the differences in trends are even more striking. (Figure 6, S9 Fig) For example, within the lowest frequency bin (0.5–1%) for admixed populations of African-descent, the coverage of sites for a set of 500,000 tag SNPs with r2 > 0.8 falls below 10%, however the imputation accuracy remains relatively high at 77.82%. These trends are consistent and more dramatic when evaluated within a tag SNP set of 500,000 sites. (S10 Fig, S11 Fig) These observations reinforce the necessity of examining imputation accuracy, instead of pairwise coverage, when evaluating the performance of tag SNPs.

Discussion

As genomic researchers shift their focus to rare variant association in large and increasingly heterogeneous populations, it is important to design arrays with this ultimate goal in mind. There are currently two accepted methods of evaluating the performance of a tag SNPs: pairwise LD “coverage” and imputation accuracy. Coverage has historically been used as a term to denote the proportion of polymorphic sites that are in linkage disequilibrium with at least one tag marker above a certain r2 threshold. (Barrett and Cardon 2006; Pe’er ; Li ; Bhangale ) Genotyping arrays are typically compared using this score averaged across the genome. However, as we and others have demonstrated, restricting performance assessment to this definition of pairwise coverage is limited by removing multi-marker information. (Nelson ; Martin ) Evaluating imputation accuracy, particularly via leave-one-out cross validation, is highly computationally intensive, but provides a better assessment of how well untyped variation can be recaptured and a more realistic depiction of array performance than pairwise coverage. Imputation accuracy is also a more useful statistic in a practical sense, especially with the development of deeper and more diverse reference panels, (Prüfer ; Gurdasani ; Sudlow ; 1000 Genomes Project Consortium ; McCarthy ) as performing GWAS with imputed variants is now the expectation. Emerging evidence suggests that rare variants (MAF < 1%) that are poorly tagged by an individual tag SNP will be accessible via imputation, due to added haplotype information, particularly as sample sizes move beyond the thousands into the tens or hundreds of thousands. (Nelson ; Fuchsberger ). Previous tagging strategies have predominantly focused on optimizing performance in a single population. In prioritizing potential tags by their ability to provide linkage disequilibrium information across multiple populations, we were able to demonstrate that cross population tag SNP selection outperforms single population selection. This boost in imputation accuracy exists across all populations and frequency bins. We simulated tag SNP sets for a range of sizes (250,000-2 million), as well as for several minimum minor allele frequencies (0.5%, 1%, 5%) and minimum r2 thresholds (0.2, 0.5, 0.8). For investigators with limited real estate or budget for tag SNP selection, we found that the biggest improvement in imputation accuracy provided with our cross population approach was with the smaller array sizes (250,000) when compared to a naïve design or biased population ascertainment. As expected, the influence of MAF and r2 threshold was population-specific. For African-descent populations, including tag SNPs with a low threshold of r2 ≥ 0.2 resulted in lower imputation accuracy across all bins, while in other populations (EUR, AMR, SAS) tags at r2 ≥ 0.2 led to increased imputation accuracy for low frequency variants to the detriment of common variation. This is due to the lower LD patterns overall in African haplotypes, requiring denser coverage. The best balance was found with a moderate r2 threshold of ≥ 0.5 for those seeking to perform well across all populations. This compromise is also present in choosing the lower MAF threshold. Limiting tag SNP selection to common variants with MAF ≥ 5% produced the highest imputation accuracy across all frequency bins within African-descent populations. However, this threshold decreased imputation accuracy for low frequency and rare variants in all other populations. Therefore, the best balance is once again found in the moderate value of MAF ≥ 1%. Investigators will need to take their priorities into account when selecting the correct thresholds for their populations and if they have a specific target frequency bin. We chose to prioritize all populations equally to provide a design of broad global utility, which was adopted to construct the GWAS scaffold for Illumina Infinium Multi-Ethnic Global Arrays (Illumina) and Global Screening Arrays (Illumina). If a study is comprised of mostly one ancestral group, then the investigators should choose the appropriate thresholds tailored for their study. Consistent with demographic history, the potential to capture variation with a limited allocation is unequal between the different populations in the 1000 Genomes Project. The naïve tagging approach will bias tag SNP selection to be primarily informative within African-descent populations. The absolute number of polymorphic sites within African populations is much larger than other populations, and while LD tends to be lower than in other populations, the high number of potential tags and pairwise correlations overwhelms the other populations’ contributions without controlling for this unique pattern. By prioritizing potential tags that provide information across all populations, the population-level contributions are more balanced without detriment to the African-descent groups (Figure 4). The absolute number of rare variants (MAF < 1%) is larger in African populations, but the frequency spectrum is more skewed toward rare variants in populations with recent bottlenecks and exponential population expansion, such as in East Asians. Contrasting these two populations (AFR and EAS), East Asian populations require fewer sites to saturate coverage, with each potential tag being in LD with more sites. However, far more polymorphic sites across the genome cannot be captured with either pairwise linkage disequilibrium or through haplotype information with imputation accuracy within these populations due to a dearth of LD information. This is amplified by the lack of comprehensive reference panels for many populations, such as East and South Asia. As reference panels are expanded, more variation will be captured to inform tag SNP selection and imputation accuracy, and we expect imputation accuracy to improve for all populations and across the frequency spectrum. (Fuchsberger ). The power to identify relevant disease loci is inherently constrained by sample size and genome coverage. It is important to note that algorithmic development both on association testing and imputation methods have been a productive avenue of research since GWAS began, with new methods providing incremental improvements in statistical power. Here, we demonstrate a complementary strategy to improve statistical power by designing arrays optimized for imputation accuracy. Also, as cosmopolitan biobanks and large-scale multi-ethnic epidemiological studies become more commonplace, it will be important to have available platforms with built in trans-ethnic utility. As global reference panels become deeper and more diverse, more variation will be available for array design. The unified framework presented here will enable investigators to make informed decisions in the development and selection of GWAS scaffolds for future large-scale multi-ethnic studies. This increased representation of multi-ethnic genetic variation will promote the investigation of the genetics of complex disease and the improvement of global health in the next phase of GWAS.

55 in total

1. Efficiency and power in genetic association studies.

Authors: Paul I W de Bakker; Roman Yelensky; Itsik Pe'er; Stacey B Gabriel; Mark J Daly; David Altshuler
Journal: Nat Genet Date: 2005-10-23 Impact factor: 38.330

2. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations.

Authors: Alicia R Martin; Christopher R Gignoux; Raymond K Walters; Genevieve L Wojcik; Benjamin M Neale; Simon Gravel; Mark J Daly; Carlos D Bustamante; Eimear E Kenny
Journal: Am J Hum Genet Date: 2017-03-30 Impact factor: 11.025

Review 3. Linkage disequilibrium in humans: models and data.

Authors: J K Pritchard; M Przeworski
Journal: Am J Hum Genet Date: 2001-06-14 Impact factor: 11.025

4. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.

Authors: Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins
Journal: PLoS Med Date: 2015-03-31 Impact factor: 11.069

5. The complete genome sequence of a Neanderthal from the Altai Mountains.

Authors: Kay Prüfer; Fernando Racimo; Nick Patterson; Flora Jay; Sriram Sankararaman; Susanna Sawyer; Anja Heinze; Gabriel Renaud; Peter H Sudmant; Cesare de Filippo; Heng Li; Swapan Mallick; Michael Dannemann; Qiaomei Fu; Martin Kircher; Martin Kuhlwilm; Michael Lachmann; Matthias Meyer; Matthias Ongyerth; Michael Siebauer; Christoph Theunert; Arti Tandon; Priya Moorjani; Joseph Pickrell; James C Mullikin; Samuel H Vohr; Richard E Green; Ines Hellmann; Philip L F Johnson; Hélène Blanche; Howard Cann; Jacob O Kitzman; Jay Shendure; Evan E Eichler; Ed S Lein; Trygve E Bakken; Liubov V Golovanova; Vladimir B Doronichev; Michael V Shunkov; Anatoli P Derevianko; Bence Viola; Montgomery Slatkin; David Reich; Janet Kelso; Svante Pääbo
Journal: Nature Date: 2013-12-18 Impact factor: 49.962

6. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

7. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip.

Authors: Chris C A Spencer; Zhan Su; Peter Donnelly; Jonathan Marchini
Journal: PLoS Genet Date: 2009-05-15 Impact factor: 5.917

8. Reconstructing the population genetic history of the Caribbean.

Authors: Andrés Moreno-Estrada; Simon Gravel; Fouad Zakharia; Jacob L McCauley; Jake K Byrnes; Christopher R Gignoux; Patricia A Ortiz-Tello; Ricardo J Martínez; Dale J Hedges; Richard W Morris; Celeste Eng; Karla Sandoval; Suehelay Acevedo-Acevedo; Paul J Norman; Zulay Layrisse; Peter Parham; Juan Carlos Martínez-Cruzado; Esteban González Burchard; Michael L Cuccaro; Eden R Martin; Carlos D Bustamante
Journal: PLoS Genet Date: 2013-11-14 Impact factor: 5.917

9. Demography and the age of rare variants.

Authors: Iain Mathieson; Gil McVean
Journal: PLoS Genet Date: 2014-08-07 Impact factor: 5.917

10. The UK10K project identifies rare variants in health and disease.

Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo
Journal: Nature Date: 2015-09-14 Impact factor: 49.962

14 in total

Review 1. The Future of Genomic Studies Must Be Globally Representative: Perspectives from PAGE.

Authors: Stephanie A Bien; Genevieve L Wojcik; Chani J Hodonsky; Christopher R Gignoux; Iona Cheng; Tara C Matise; Ulrike Peters; Eimear E Kenny; Kari E North
Journal: Annu Rev Genomics Hum Genet Date: 2019-04-12 Impact factor: 8.929

2. Genotype imputation performance of three reference panels using African ancestry individuals.

Authors: Candelaria Vergara; Margaret M Parker; Liliana Franco; Michael H Cho; Ana V Valencia-Duarte; Terri H Beaty; Priya Duggal
Journal: Hum Genet Date: 2018-04-10 Impact factor: 4.132

3. Including diverse and admixed populations in genetic epidemiology research.

Authors: Amke Caliebe; Fasil Tekola-Ayele; Burcu F Darst; Xuexia Wang; Yeunjoo E Song; Jiang Gui; Ronnie A Sebro; David J Balding; Mohamad Saad; Marie-Pierre Dubé
Journal: Genet Epidemiol Date: 2022-07-16 Impact factor: 2.344

4. Extent to which array genotyping and imputation with large reference panels approximate deep whole-genome sequencing.

Authors: Sarah C Hanks; Lukas Forer; Sebastian Schönherr; Jonathon LeFaive; Taylor Martins; Ryan Welch; Sarah A Gagliano Taliun; David Braff; Jill M Johnsen; Eimear E Kenny; Barbara A Konkle; Markku Laakso; Ruth F J Loos; Steven McCarroll; Carlos Pato; Michele T Pato; Albert V Smith; Michael Boehnke; Laura J Scott; Christian Fuchsberger
Journal: Am J Hum Genet Date: 2022-08-17 Impact factor: 11.043

5. Multi-Omic Approaches to Identify Genetic Factors in Metabolic Syndrome.

Authors: Karen C Clark; Anne E Kwitek
Journal: Compr Physiol Date: 2021-12-29 Impact factor: 8.915

6. Allelic Heterogeneity at the CRP Locus Identified by Whole-Genome Sequencing in Multi-ancestry Cohorts.

Authors: Laura M Raffield; Apoorva K Iyengar; Biqi Wang; Sheila M Gaynor; Cassandra N Spracklen; Xue Zhong; Madeline H Kowalski; Shabnam Salimi; Linda M Polfus; Emelia J Benjamin; Joshua C Bis; Russell Bowler; Brian E Cade; Won Jung Choi; Alejandro P Comellas; Adolfo Correa; Pedro Cruz; Harsha Doddapaneni; Peter Durda; Stephanie M Gogarten; Deepti Jain; Ryan W Kim; Brian G Kral; Leslie A Lange; Martin G Larson; Cecelia Laurie; Jiwon Lee; Seonwook Lee; Joshua P Lewis; Ginger A Metcalf; Braxton D Mitchell; Zeineen Momin; Donna M Muzny; Nathan Pankratz; Cheol Joo Park; Stephen S Rich; Jerome I Rotter; Kathleen Ryan; Daekwan Seo; Russell P Tracy; Karine A Viaud-Martinez; Lisa R Yanek; Lue Ping Zhao; Xihong Lin; Bingshan Li; Yun Li; Josée Dupuis; Alexander P Reiner; Karen L Mohlke; Paul L Auer
Journal: Am J Hum Genet Date: 2019-12-26 Impact factor: 11.025

7. Summix: A method for detecting and adjusting for population structure in genetic summary data.

Authors: Ian S Arriaga-MacKenzie; Gregory Matesi; Samuel Chen; Alexandria Ronco; Katie M Marker; Jordan R Hall; Ryan Scherenberg; Mobin Khajeh-Sharafabadi; Yinfei Wu; Christopher R Gignoux; Megan Null; Audrey E Hendricks
Journal: Am J Hum Genet Date: 2021-06-21 Impact factor: 11.025

8. Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations.

Authors: Alicia R Martin; Elizabeth G Atkinson; Sinéad B Chapman; Anne Stevenson; Rocky E Stroud; Tamrat Abebe; Dickens Akena; Melkam Alemayehu; Fred K Ashaba; Lukoye Atwoli; Tera Bowers; Lori B Chibnik; Mark J Daly; Timothy DeSmet; Sheila Dodge; Abebaw Fekadu; Steven Ferriera; Bizu Gelaye; Stella Gichuru; Wilfred E Injera; Roxanne James; Symon M Kariuki; Gabriel Kigen; Karestan C Koenen; Edith Kwobah; Joseph Kyebuzibwa; Lerato Majara; Henry Musinguzi; Rehema M Mwema; Benjamin M Neale; Carter P Newman; Charles R J C Newton; Joseph K Pickrell; Raj Ramesar; Welelta Shiferaw; Dan J Stein; Solomon Teferra; Celia van der Merwe; Zukiswa Zingela
Journal: Am J Hum Genet Date: 2021-03-25 Impact factor: 11.025

9. Genetic analyses of diverse populations improves discovery for complex traits.

Authors: Genevieve L Wojcik; Mariaelisa Graff; Katherine K Nishimura; Ran Tao; Jeffrey Haessler; Christopher R Gignoux; Heather M Highland; Yesha M Patel; Elena P Sorokin; Christy L Avery; Gillian M Belbin; Stephanie A Bien; Iona Cheng; Sinead Cullina; Chani J Hodonsky; Yao Hu; Laura M Huckins; Janina Jeff; Anne E Justice; Jonathan M Kocarnik; Unhee Lim; Bridget M Lin; Yingchang Lu; Sarah C Nelson; Sung-Shim L Park; Hannah Poisner; Michael H Preuss; Melissa A Richard; Claudia Schurmann; Veronica W Setiawan; Alexandra Sockell; Karan Vahi; Marie Verbanck; Abhishek Vishnu; Ryan W Walker; Kristin L Young; Niha Zubair; Victor Acuña-Alonso; Jose Luis Ambite; Kathleen C Barnes; Eric Boerwinkle; Erwin P Bottinger; Carlos D Bustamante; Christian Caberto; Samuel Canizales-Quinteros; Matthew P Conomos; Ewa Deelman; Ron Do; Kimberly Doheny; Lindsay Fernández-Rhodes; Myriam Fornage; Benyam Hailu; Gerardo Heiss; Brenna M Henn; Lucia A Hindorff; Rebecca D Jackson; Cecelia A Laurie; Cathy C Laurie; Yuqing Li; Dan-Yu Lin; Andres Moreno-Estrada; Girish Nadkarni; Paul J Norman; Loreall C Pooler; Alexander P Reiner; Jane Romm; Chiara Sabatti; Karla Sandoval; Xin Sheng; Eli A Stahl; Daniel O Stram; Timothy A Thornton; Christina L Wassel; Lynne R Wilkens; Cheryl A Winkler; Sachi Yoneyama; Steven Buyske; Christopher A Haiman; Charles Kooperberg; Loic Le Marchand; Ruth J F Loos; Tara C Matise; Kari E North; Ulrike Peters; Eimear E Kenny; Christopher S Carlson
Journal: Nature Date: 2019-06-19 Impact factor: 69.504

10. Why are rare variants hard to impute? Coalescent models reveal theoretical limits in existing algorithms.

Authors: Yichen Si; Brett Vanderwerff; Sebastian Zöllner
Journal: Genetics Date: 2021-04-15 Impact factor: 4.562