| Literature DB >> 35025869 |
Zhi Ming Xu1,2, Sina Rüeger1,2, Michaela Zwyer3,4, Daniela Brites3,4, Hellen Hiza3,4,5, Miriam Reinhard3,4, Liliana Rutaihwa3,4, Sonia Borrell3,4, Faima Isihaka5, Hosiana Temba5, Thomas Maroa5, Rastard Naftari5, Jerry Hella5, Mohamed Sasamalo5, Klaus Reither3,4, Damien Portevin3,4, Sebastien Gagneux3,4, Jacques Fellay1,2,6.
Abstract
Genome-wide association studies rely on the statistical inference of untyped variants, called imputation, to increase the coverage of genotyping arrays. However, the results are often suboptimal in populations underrepresented in existing reference panels and array designs, since the selected single nucleotide polymorphisms (SNPs) may fail to capture population-specific haplotype structures, hence the full extent of common genetic variation. Here, we propose to sequence the full genomes of a small subset of an underrepresented study cohort to inform the selection of population-specific add-on tag SNPs and to generate an internal population-specific imputation reference panel, such that the remaining array-genotyped cohort could be more accurately imputed. Using a Tanzania-based cohort as a proof-of-concept, we demonstrate the validity of our approach by showing improvements in imputation accuracy after the addition of our designed add-on tags to the base H3Africa array.Entities:
Mesh:
Year: 2022 PMID: 35025869 PMCID: PMC8791479 DOI: 10.1371/journal.pcbi.1009628
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Schematic of our add-on tag SNP selection procedures, with steps illustrating.
Step 1) Constructing a Tanzanian reference panel. Identifying candidate target variants, which are derived from poorly imputed variants when the H3Africa array is imputed based on the Tanzanian and AFGR reference panel. Step 2) Selecting add-on tags that optimally tag candidate target variants based on population-specific LD structures, allele frequencies, and probe qualities. Step 3) Evaluating improvements in imputation performance after adding add-on tags to the base H3Africa array. Calculating imputation quality metrics, including INFO score and r2 (correlation between imputed and sequencing-based genotypes). WGS, Whole-Genome Sequencing; AFGR, African Genome Resource; MAF, Minor Allele Frequency; MI, Mutual Information; LD, Linkage Disequilibrium.
Imputation performance of publicly available reference panels when applied to the TB-DAR data based on the H3Africa array content.
Minor allele frequency (MAF) is based on the frequency observed in the TB-DAR cohort. Imputation quality (Subcolumn 1) is measured by either INFO score (AFGR and HRC; Sanger Imputation Server) or r2 (CAAPA; Michigan Imputation Server). Correlation with ground truth (Subcolumn 2) measures the correlation between the imputed dosage and the ground truth WGS dosage using the squared pearson correlation coefficient (r2). Percent of variants imputed (Subcolumn 3) represents the fraction of variants observed in the TB-DAR WGS data that were successfully imputed (Imputation Quality > 0.8).
| AFGR | CAAPA | HRC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Quality (INFO) | Ground Truth | % Variants Imputed | Quality ( | Ground Truth | % Variants Imputed | Quality (INFO) | Ground Truth | % Variants Imputed | ||
| MAF | ||||||||||
| 0.01–0.05 | 0.95 | 0.91 | 88.6 | 0.85 | 0.83 | 67.5 | 0.88 | 0.80 | 68.3 | |
| 0.05–0.1 | 0.98 | 0.96 | 93.4 | 0.93 | 0.91 | 86.8 | 0.96 | 0.90 | 91.9 | |
| 0.1–0.5 | 0.99 | 0.97 | 92.7 | 0.96 | 0.95 | 90.1 | 0.98 | 0.95 | 91.7 | |
Fig 2Genetic differentiation of African populations.
A) Sampling locations of the TB-DAR WGS cohort and populations within the AFGR reference panel, which includes the Sub-Saharan African populations of the 1000 Genomes (1KG) project. Line colors illustrate the degree of differentiation (F) between TB-DAR and 1KG populations. B) Pairwise F measures between 1KG populations and TB-DAR. 1000 Genomes Populations: GWD—Gambian in Western Divisions in the Gambia; MSL—Mende in Sierra Leone; YRI—Yoruba in Ibadan, Nigeria; ESN—Esan in Nigeria; LWK—Luhya in Webuye, Kenya. The map was created programmatically in R using the spData package [58], with the base layer based on public domain maps from Natural Earth (https://www.naturalearthdata.com/).
Fig 3Improvement in imputation performance subsequent to the addition of add-on tags.
Mean INFO score and r2 (between imputed and sequenced ground truth) of target variants designed to be tagged by add-on tags based on three array designs: 1) H3Africa array without any add-on tags 2) The H3Africa array with random add-on tags 3) The H3Africa array with population-specific add-on tags selected based on the proposed approach. Facet grids illustrate results based on two tag SNP selection settings: coverage-guaranteeing within prioritized regions (Setting 1) and efficiency-driven in all other regions (Setting 2). Error bars represent standard error (SE) of the mean imputation quality within each MAF bin.
Fig 4Improvement in imputation performance in an example region.
Example region on chromosome 10 where the incorporation of add-on tags lead to the increase in imputation performance. Facet grids illustrate imputation performance of the H3Africa array without any add-on tags, with random add-on tags, and with add-on tags selected under the proposed approach. Color of dots represent type of variant (existing H3Africa tags, add-on tags, or any other imputed variants.
Performance of add-on tags, categorized based on settings and methods.
Number of probes (Column 2) indicates the total number of Illumina probes that are required to genotype the add-on tags. The mean probe-ability score (Column 3) estimates the genotyping success rate for the selected add-on probes. The number of successfully tagged imputed variants are measured by either any improvement in INFO score (Column 4), or those exceeding INFO score of 0.8 when previously below (Column 5). Per probe and per tag indicate the number of imputed variants with imputation improvements per add-on tag and add-on probe respectively. Standard error (SE) represents variability of the per tag and per probe metric across different genomic regions. %AFGR and %Tanz indicate the proportion of imputed variants with better imputation accuracy based on the AFGR or internal Tanzanian reference panel respectively.
| Method | Number of Probes | Mean Probe-ability Score (± SE) | Targets with Improvment in INFO Score | Additional Targets Exceeding INFO Score of 0.8 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Per Probe (± SE) | Per Tag (± SE) | %AFGR | %Tanz | Per Probe (± SE) | Per Tag (± SE) | %AFGR | %Tanz | ||||
|
| |||||||||||
| Proposed Approach | 2114 | 0.71±0.006 | 19.9±3.2 | 22.5±3.2 | 33.1 | 66.9 | 3.5±0.5 | 4.0±0.5 | 65.8 | 34.2 | |
| Tagger | 2186 | 0.75±0.006 | 18.3±2.9 | 21.4 ±3.0 | 31.8 | 68.2 | 2.8±0.3 | 3.3±0.4 | 62.8 | 37.2 | |
| Random Tags | NA | NA | NA | 18.7±2.4 | 26.8 | 73.2 | NA | 2.3±0.3 | 64.2 | 35.8 | |
|
| |||||||||||
| Proposed Approach | 2688 | 0.87±0.004 | 72.9±2.7 | 78.3±2.7 | 28.4 | 71.6 | 9.2±0.6 | 9.9±0.6 | 70.7 | 29.3 | |
| Tagger | 2905 | 0.73±0.005 | 67.3±2.5 | 78.1±2.7 | 27.6 | 72.4 | 7.9±0.5 | 9.2±0.5 | 67.1 | 32.9 | |
| Random Tags | NA | NA | NA | 65.8±2.5 | 26.9 | 73.1 | NA | 6.6±0.5 | 75.4 | 24.6 | |