| Literature DB >> 35046991 |
Andrés Jiménez-Kaufmann1, Amanda Y Chong2, Adrián Cortés2, Consuelo D Quinto-Cortés1, Selene L Fernandez-Valverde1, Leticia Ferreyra-Reyes3, Luis Pablo Cruz-Hervert3, Santiago G Medina-Muñoz1, Mashaal Sohail1,4, María J Palma-Martinez1, Gudalupe Delgado-Sánchez3, Norma Mongua-Rodríguez3, Alexander J Mentzer2, Adrian V S Hill2,5, Hortensia Moreno-Macías6,7, Alicia Huerta-Chagoya6, Carlos A Aguilar-Salinas8,9, Michael Torres1, Hie Lim Kim10,11,12, Namrata Kalsi10,11, Stephan C Schuster10,11,12, Teresa Tusié-Luna6,13, Diego Ortega Del-Vecchyo14, Lourdes García-García3, Andrés Moreno-Estrada1.
Abstract
Current Genome-Wide Association Studies (GWAS) rely on genotype imputation to increase statistical power, improve fine-mapping of association signals, and facilitate meta-analyses. Due to the complex demographic history of Latin America and the lack of balanced representation of Native American genomes in current imputation panels, the discovery of locally relevant disease variants is likely to be missed, limiting the scope and impact of biomedical research in these populations. Therefore, the necessity of better diversity representation in genomic databases is a scientific imperative. Here, we expand the 1,000 Genomes reference panel (1KGP) with 134 Native American genomes (1KGP + NAT) to assess imputation performance in Latin American individuals of mixed ancestry. Our panel increased the number of SNPs above the GWAS quality threshold, thus improving statistical power for association studies in the region. It also increased imputation accuracy, particularly in low-frequency variants segregating in Native American ancestry tracts. The improvement is subtle but consistent across countries and proportional to the number of genomes added from local source populations. To project the potential improvement with a higher number of reference genomes, we performed simulations and found that at least 3,000 Native American genomes are needed to equal the imputation performance of variants in European ancestry tracts. This reflects the concerning imbalance of diversity in current references and highlights the contribution of our work to reducing it while complementing efforts to improve global equity in genomic research.Entities:
Keywords: GWAS; Imputation; Latin Americans; Native American ancestry; reference panels; underrepresented populations
Year: 2022 PMID: 35046991 PMCID: PMC8762266 DOI: 10.3389/fgene.2021.719791
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Native American reference panel (NATS). (A) Geographical sampling locations of the NATS reference panel. Colors represent the four data sources: HGDP (61) (Bergström et al., 2020), SGDP (11) (Mallick et al., 2016), INMEGEN (12) (Romero-Hidalgo et al., 2017), and MX Biobank (50) totaling 134 genomes. (B) SNP proportions of the union of 1KGP and NATS (1KGP + NATS) by SNP sharing categories. We show the proportion of SNPs unique to 1KGP, SNPs unique to the NATS panel, and the intersection. (C) Unsupervised ADMIXTURE analysis at K = 3 of the NATS reference panel (far left, N = 134) together with 104 European (CEU), 113 African (YRI), and 347 admixed Latin American (AMR) samples from 1KGP. Genetic ancestry abbreviations: AFR—African, EUR—European, NAT—Native American.
SNPs above the standard quality threshold using both panels after imputing missing variants. We show the average number of SNPs with MAF >= 0.01 and INFO >= 0.3 using both reference panels and the overall proportion of Native American ancestry of the population. p-value was calculated with a two-tailed paired t-test. The average number of SNPs with MAF <0.01 and INFO >0.3 for both panels is shown in Supplementary Table S4.
| Population | SNPs above quality threshold (1KGP) | SNPs above quality threshold (1KGP + NATS) | Increase of SNPs using 1KGP + NATS | Average proportion of Nat. American ancestry |
|---|---|---|---|---|
| Peru (PEL) | 244,818 | 248,087 | 3,269 ( | 0.70 |
| Mexico (MXL) | 265,619 | 268,254 | 2,635 ( | 0.42 |
| Colombia (CLM) | 279,828 | 281,911 | 2,163 ( | 0.18 |
| Puerto Rico (PUR) | 291,035 | 292,734 | 1,699 ( | 0.06 |
FIGURE 2Imputation accuracy by local ancestry and population using both reference panels. (A) Imputation accuracy of the four AMR populations stratified by diploid local ancestry for the MEGA array using 1KGP as reference panel. (B) Imputation accuracy for the Native and European diploid ancestries using 1KGP and 1KGP + NATS as reference panel focusing on rare alleles. Imputation accuracy was calculated with the Pearson squared correlation between imputed and real allele dosages.
FIGURE 3Predicted imputation accuracy according to demographic simulations. (A) Imputation accuracy in the diploid Native American (solid colored lines) and diploid European (thick dashed line) ancestries using different simulated reference panels of incremental sizes. Ref 0 stands for the base reference (as it has 0 additional reference genomes). Given the available demographic model (Browning et al., 2018), a simulated Asian population was used as a proxy for Native American ancestry for the purpose of reproducing a three-way admixture process with similar ancestry proportions of African, European, and Native American sources to that observed in admixed Latino populations (see Methods for details). (B) Increase in imputation accuracy from the base reference in the Native American diploid ancestry at increasing sizes of the reference panel by allele frequency [common (0.5–0.05), low (0.05–0.01), and rare (0.01–0.003)].