| Literature DB >> 35562925 |
Elżbieta Kaja1,2,3, Adrian Lejman1,2, Dawid Sielski1,2, Mateusz Sypniewski1,4, Tomasz Gambin5, Mateusz Dawidziuk6, Tomasz Suchocki7,8, Paweł Golik9, Marzena Wojtaszewska1,10, Magdalena Mroczek11, Maria Stępień1,12,13, Joanna Szyda7,8, Karolina Lisiak-Teodorczyk1, Filip Wolbach1, Daria Kołodziejska1, Katarzyna Ferdyn1,14, Maciej Dąbrowski1,15, Alicja Woźna1,16, Marcin Żytkiewicz17, Anna Bodora-Troińska17, Waldemar Elikowski17, Zbigniew J Król2, Artur Zaczyński2, Agnieszka Pawlak2,18, Robert Gil2,18, Waldemar Wierzba2, Paula Dobosz1,2,19, Katarzyna Zawadzka1, Paweł Zawadzki1,16, Paweł Sztromwasser1.
Abstract
Although Slavic populations account for over 4.5% of world inhabitants, no centralised, open-source reference database of genetic variation of any Slavic population exists to date. Such data are crucial for clinical genetics, biomedical research, as well as archeological and historical studies. The Polish population, which is homogenous and sedentary in its nature but influenced by many migrations of the past, is unique and could serve as a genetic reference for the Slavic nations. In this study, we analysed whole genomes of 1222 Poles to identify and genotype a wide spectrum of genomic variation, such as small and structural variants, runs of homozygosity, mitochondrial haplogroups, and de novo variants. Common variant analyses showed that the Polish cohort is highly homogenous and shares ancestry with other European populations. In rare variant analyses, we identified 32 autosomal-recessive genes with significantly different frequencies of pathogenic alleles in the Polish population as compared to the non-Finish Europeans, including C2, TGM5, NUP93, C19orf12, and PROP1. The allele frequencies for small and structural variants, calculated for 1076 unrelated individuals, are released publicly as The Thousand Polish Genomes database, and will contribute to the worldwide genomic resources available to researchers and clinicians.Entities:
Keywords: Polish genomes; allele frequency; allelic distribution; genome; population genomics; variant; whole-genome sequencing
Mesh:
Substances:
Year: 2022 PMID: 35562925 PMCID: PMC9104289 DOI: 10.3390/ijms23094532
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 6.208
Figure 1Polish cohort characteristics. (A) Sample collection locations for 675 individuals enrolled in “The Thousand Polish Genomes Project” (related and unrelated ones) in respect to Polish voivodeships. The figure presents a subset of samples with location data available. (B) Age distribution of 1076 unrelated study participants.
Figure 2Distribution of allele frequencies across different variant types. DEL—large deletions; DUP—duplications; indel—short insertions and deletions; INV—inversions; SNV—substitutions. Variant impact categories follow VEP classification: HIGH—disruptive impact on the protein; MODERATE—non-disruptive variant that might change protein effectiveness; LOW—harmless or unlikely to change protein behaviour; MODIFIER—non-coding sequence variant.
Summary of variant counts in three allele frequency tiers (>0.5%; 0.1–0.5%; <0.1%). Variant impact categories follow VEP classification: HIGH—disruptive impact on the protein; MODERATE—non-disruptive variant that might change protein effectiveness; LOW—harmless or unlikely to change protein behaviour; MODIFIER—noncoding sequence variant.
| IMPACT | |||||
|---|---|---|---|---|---|
| VARIANT_CLASS | AF | HIGH | MODERATE | LOW | MODIFIER |
| deletion | >0.5% | 412 | 603 | 855 | 1,208,322 |
| insertion | 260 | 573 | 977 | 1,380,654 | |
| SNV | 1109 | 35,717 | 41,402 | 10,877,171 | |
| deletion | 0.1–0.5% | 392 | 492 | 316 | 433,985 |
| insertion | 197 | 345 | 376 | 529,654 | |
| SNV | 852 | 23,682 | 18,675 | 4,375,036 | |
| deletion | <0.1% | 2849 | 1988 | 1003 | 1,295,678 |
| insertion | 1382 | 1144 | 826 | 1,037,730 | |
| SNV | 5432 | 119,843 | 80,467 | 17,817,903 | |
|
|
|
|
|
| |
Figure 3Clustering of samples among the global and European subpopulations based on the random forest and PCA predictions. Comparison of (A) the first and second principal components for continental populations, (B) the first and third principal components for continental populations, (C) the first and second principal components for the European subpopulations, and (D) the first and third principal components for the European subpopulations. Analysed populations: Utah residents (CEPH) with Northern and Western European ancestry (CEU), Finnish in Finland (FIN), British in England and Scotland (GBR), Iberian in Spain (IBS), and Toscani in Italy (TSI).
Figure 4Comparison of exonic and non-exonic variant consequences across the allele frequency spectrum. As expected, variants with the largest impact on the encoded protein (e.g., start–stop, frameshift) were depleted among the common and enriched among the rare variation. On the opposite side, the relative abundance levels of non-exonic, UTR, and synonymous variants, which do not alter the amino acid sequence, increased with increasing variant population frequency.
Figure 5Cumulative allele frequencies of selected ClinVar variants in the 32 AR genes with significant (q-value < 0.05) differences in pathogenic variant burden among POL and gnomAD NFE. Note that for two genes with significant differences, the only variant observed in POL was removed from Clinvar after the release we used. These were HTT (rs754013273; RCV001335909.1) and COL18A1 (rs528991245; RCV001329610.1).