| Literature DB >> 34719381 |
Anne-Katrin Emde1, Amanda Phipps-Green2, Murray Cadzow2, C Scott Gallagher1, Tanya J Major2, Marilyn E Merriman2, Ruth K Topless2, Riku Takei2,3, Nicola Dalbeth4, Rinki Murphy4, Lisa K Stamp5, Janak de Zoysa4, Philip L Wilcox6, Keolu Fox7, Kaja A Wasik8, Tony R Merriman9,10, Stephane E Castel11.
Abstract
BACKGROUND: Historically, geneticists have relied on genotyping arrays and imputation to study human genetic variation. However, an underrepresentation of diverse populations has resulted in arrays that poorly capture global genetic variation, and a lack of reference panels. This has contributed to deepening global health disparities. Whole genome sequencing (WGS) better captures genetic variation but remains prohibitively expensive. Thus, we explored WGS at "mid-pass" 1-7x coverage.Entities:
Mesh:
Year: 2021 PMID: 34719381 PMCID: PMC8559369 DOI: 10.1186/s12864-021-07949-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Optimized methods for genotyping from mid-pass whole genome sequencing. a) Outline of strategy for variant calling and imputation using a combination of 100 high- and 1410 mid-pass sequenced genomes. Recall, precision, and non-reference concordance rate (NCR) calculated for imputed genotypes derived from mid-pass sequencing of the genomes of 92 individuals as a function of: pre-imputation call-level filtering using the GQ metric (keeping variants with GQ > X, b-d), binned sequencing coverage (e-g), and both coverage and with (MP + HP, dotted lines) or without (MP, solid lines) inclusion of high-pass sequencing from 100 individuals in joint-calling and imputation (h-j). For runs without high-pass included (MP), data for the 100 individuals was substituted with mid-pass data. Metrics were calculated using previously available genotype calls derived from 30x whole-genome sequencing as a truth set. All metrics plotted were calculated for SNVs only (for indels see Figs. S3, S5, and S6). For boxplots, bottom whisker: Q1–1.5*interquartile range (IQR), top whisker: Q3 + 1.5*IQR, box: IQR, center: median, and outliers are not plotted for ease of viewing
Fig. 2Performance of mid-pass whole genome sequencing across self-reported ethnicities and compared to array genotyping. A) Principal component analysis of imputed genotype data from 1410 mid-pass and 100 high-pass sequenced Polynesian individuals’ genomes. Data points are colored by self-reported ethnicity (EURO, European, MACI, Cook Islands Māori, MANZ, Aotearoa New Zealand Māori, NIUE, Niuean, OTHR, other, PNMI, Mixed Ethnicity Polynesian, PUKA, Pukapukan, SAMO, Samoan, TONG, Tongan, listed in alphabetical order) with symbols corresponding to the broader regional division of Polynesia (East, West or NA, not applicable). B) Performance measured using recall, precision, and non-reference concordance rate (NCR) for mid-pass derived imputed genotype calls across self-reported ethnicities. Metrics were calculated for the genomes of 100 individuals sequenced as part of this study at both and high- and mid-pass using the high-pass genotype calls as a truth set. C) Performance as a function of cohort size for individuals with self-reported Aotearoa New Zealand Māori ethnicity. Individuals were selected such that the smaller cohorts have less European ancestry admixture (Fig. S8). D) Performance calculated from imputed genotypes for 84 individuals binned by sequencing coverage with corresponding array data for comparison and using previously available 30x whole-genome sequencing genotype calls as a truth set. For boxplots, bottom whisker: Q1–1.5*interquartile range (IQR), top whisker: Q3 + 1.5*IQR, box: IQR, center: median, and outliers are not plotted for ease of viewing
Fig. 3Functional annotation of putatively Polynesian enriched variants identified by mid-pass sequencing. Variants are characterized as being absent (orange) or rare (MAF < 1%, green) in 1000 Genomes Phase 3 and common (MAF > 5%) in the study dataset. Breakdown of variants as a function of type (SNV/INDEL, A), class (coding, regulatory, or other, B), and predicted effect (C). Indels located in high-confidence regions of the genome and all SNVs were included in the analysis. Variant counts (y-axis) have been log-transformed for ease of viewing