| Literature DB >> 33968136 |
Gihan Daw Elbait1, Andreas Henschel1,2, Guan K Tay1,3,4,5, Habiba S Al Safar1,3,6.
Abstract
The ethnic composition of the population of a country contributes to the uniqueness of each national DNA sequencing project and, ideally, individual reference genomes are required to reduce the confounding nature of ethnic bias. This work represents a representative Whole Genome Sequencing effort of an understudied population. Specifically, high coverage consensus sequences from 120 whole genomes and 33 whole exomes were used to construct the first ever population specific major allele reference genome for the United Arab Emirates (UAE). When this was applied and compared to the archetype hg19 reference, assembly of local Emirati genomes was reduced by ∼19% (i.e., some 1 million fewer calls). In compiling the United Arab Emirates Reference Genome (UAERG), sets of annotated 23,038,090 short (novel: 1,790,171) and 137,713 structural (novel: 8,462) variants; their allele frequencies (AFs) and distribution across the genome were identified. Population-specific genetic characteristics including loss-of-function variants, admixture, and ancestral haplogroup distribution were identified and reported here. We also detect a strong correlation between F ST and admixture components in the UAE. This baseline study was conceived to establish a high-quality reference genome and a genetic variations resource to enable the development of regional population specific initiatives and thus inform the application of population studies and precision medicine in the UAE.Entities:
Keywords: Arab genome; UAE reference genome; next generation sequencing; population genetics; population representative sampling; reference genome; structural variants
Year: 2021 PMID: 33968136 PMCID: PMC8102833 DOI: 10.3389/fgene.2021.660428
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Overview of the data processing workflow, resources, and tools used for the joint short and structural variant calling and reference genome construction from the 153 UAE samples.
FIGURE 2Principal component analysis (PCA)/Admixture plot of 1,000 UAE nationals and 1,043 samples from the human genome diversity project (HGDP). From the phylogenetic tree of 1,000 genotype arrays, 120 UAE samples (predominantly cyan color with outline) were selected for whole genome sequencing. The admixture of UAE samples is shown as pie charts, with sector coloring consistent with HGDP population colors. The zoomed-in views display the genetic diversity of the UAE population with admixtures predominated by Middle Eastern, Central/South Asia (large zoomed inset), and Sub-Saharan Africa (small zoomed inset).
The average alignment and genome coverage for the 153 WGS/WES from mapping the raw reads of each sample to the hg19 reference genome.
| Sequencing Type | Number of Samples | Number of reads | Mapped reads | Mapped reads % | Coverage X | Median insert size | GC content | Mean mapping quality |
| 120 | 717,002,612 | 634,706,849 | 87.02 | 28.8 | 397 | 40.5 | 49 | |
| 33 | 48,165,294 | 47,989,908 | 99.6 | 44.9 | 174 | 47.9 | 57 |
The summary of autosomal and sex chromosomes’ variants called by genotyping of the 153 UAE nationals showing the known and novel number of variants based on their overlap with dbSNP (Build151).
| Chr | Integrated 153 UAE genomes | |||||||||
| Variants | Known | Novel | Major alt allele | |||||||
| Number | % | Number | % | Number | % | Novel | % | |||
| 5,371 | 0.3% | |||||||||
| 1,753,942 | 1,614,772 | 92% | 139,170 | 8% | 159,936 | 9% | 443 | 0.3% | ||
| 1,893,949 | 1,744,347 | 92% | 149,602 | 8% | 169,361 | 9% | 401 | 0.2% | ||
| 1,605,197 | 1,482,800 | 92% | 122,397 | 8% | 141,490 | 9% | 342 | 0.2% | ||
| 1,612,929 | 1,495,903 | 93% | 117,026 | 7% | 152,874 | 9% | 323 | 0.2% | ||
| 1,441,391 | 1,332,308 | 92% | 109,083 | 8% | 123,409 | 9% | 261 | 0.2% | ||
| 1,381,217 | 1,279,665 | 93% | 101,552 | 7% | 118,450 | 9% | 286 | 0.2% | ||
| 1,292,726 | 1,192,113 | 92% | 100,613 | 8% | 111,779 | 9% | 364 | 0.3% | ||
| 1,269,138 | 1,173,246 | 92% | 95,892 | 8% | 107,221 | 8% | 218 | 0.2% | ||
| 954,499 | 881,164 | 92% | 73,335 | 8% | 83,312 | 9% | 179 | 0.2% | ||
| 1,116,689 | 1,033,171 | 93% | 83,518 | 7% | 104,361 | 9% | 248 | 0.2% | ||
| 1,090,783 | 1,005,373 | 92% | 85,410 | 8% | 109,961 | 10% | 252 | 0.2% | ||
| 1,081,267 | 998,193 | 92% | 83,074 | 8% | 99,506 | 9% | 253 | 0.3% | ||
| 805,263 | 745,938 | 93% | 59,325 | 7% | 84,011 | 10% | 151 | 0.2% | ||
| 737,568 | 681,590 | 92% | 55,978 | 8% | 65,662 | 9% | 185 | 0.3% | ||
| 642,562 | 593,597 | 92% | 48,965 | 8% | 59,752 | 9% | 126 | 0.2% | ||
| 723,987 | 670,246 | 93% | 53,741 | 7% | 60,943 | 8% | 124 | 0.2% | ||
| 624,153 | 573,102 | 92% | 51,051 | 8% | 53,909 | 9% | 184 | 0.3% | ||
| 627,449 | 581,894 | 93% | 45,555 | 7% | 60,977 | 10% | 127 | 0.2% | ||
| 520,807 | 479,141 | 92% | 41,666 | 8% | 42,096 | 8% | 139 | 0.3% | ||
| 507,037 | 469,093 | 93% | 37,944 | 7% | 41,101 | 8% | 74 | 0.2% | ||
| 312,621 | 290,169 | 93% | 22,452 | 7% | 31,427 | 10% | 75 | 0.2% | ||
| 312,495 | 289,261 | 93% | 23,234 | 7% | 24,838 | 8% | 83 | 0.3% | ||
| 715,067 | 631,707 | 88% | 83,360 | 12% | 60,389 | 8% | 468 | 0.8% | ||
| 15,354 | 9,126 | 59% | 6,228 | 41% | 978 | 6% | 65 | 6.6% | ||
FIGURE 3Allele frequency (AF) histogram of all filtered variants identified in the 153 samples. The histogram shows the number of variants against their AFs in (5% intervals). The highest peak represents the number of variants with rare alternate AFs of <5%, while the less common (lowest number of variants) had the rare or unobserved reference allele with 100% alternate AFs.
Variant call reduction using the UAERG on the sequences from four UAE nationals.
| UAE S015 | UAE S016 | UAE S017 | UAE S018 | ||||||||||||||
| Chr | hg19 | UAERG | Difference | Difference % | hg19 | UAERG | Difference | Difference % | hg19 | UAERG | Difference | Difference % | hg19 | UAERG | Difference | Difference % | |
| 402,961 | 323,321 | 79,640 | 19.76 | 405,401 | 321,405 | 83,996 | 20.72 | 392,219 | 308,402 | 83,817 | 21.37 | 402,864 | 320,065 | 82,799 | 20.55 | ||
| 412,000 | 324,350 | 87,650 | 21.27 | 403,261 | 318,001 | 85,260 | 21.14 | 399,028 | 313,695 | 85,333 | 21.39 | 412,509 | 329,532 | 82,977 | 20.12 | ||
| 333,239 | 260,852 | 72,387 | 21.72 | 343,642 | 277,707 | 65,935 | 19.19 | 325,838 | 259,316 | 66,522 | 20.42 | 346,193 | 280,774 | 65,419 | 18.90 | ||
| 358,300 | 281,472 | 76,828 | 21.44 | 364,340 | 287,789 | 76,551 | 21.01 | 354,895 | 278,450 | 76,445 | 21.54 | 379,594 | 306,839 | 72,755 | 19.17 | ||
| 295,294 | 241,912 | 53,382 | 18.08 | 300,598 | 242,585 | 58,013 | 19.30 | 293,180 | 236,954 | 56,226 | 19.18 | 292,577 | 232,489 | 60,088 | 20.54 | ||
| 286,974 | 232,909 | 54,065 | 18.84 | 300,306 | 242,939 | 57,367 | 19.10 | 280,917 | 225,918 | 54,999 | 19.58 | 289,538 | 233,502 | 56,036 | 19.35 | ||
| 299,330 | 246,200 | 53,130 | 17.75 | 302,922 | 249,647 | 53,275 | 17.59 | 289,748 | 236,851 | 52,897 | 18.26 | 305,122 | 253,825 | 51,297 | 16.81 | ||
| 257,291 | 204,661 | 52,630 | 20.46 | 258,232 | 206,143 | 52,089 | 20.17 | 252,234 | 200,180 | 52,054 | 20.64 | 252,235 | 202,257 | 49,978 | 19.81 | ||
| 237,561 | 198,002 | 39,559 | 16.65 | 235,528 | 193,645 | 41,883 | 17.78 | 228,786 | 187,198 | 41,588 | 18.18 | 243,259 | 202,374 | 40,885 | 16.81 | ||
| 251,860 | 200,937 | 50,923 | 20.22 | 261,363 | 211,786 | 49,577 | 18.97 | 245,953 | 192,623 | 53,330 | 21.68 | 257,948 | 202,587 | 55,361 | 21.46 | ||
| 241,812 | 186,658 | 55,154 | 22.81 | 249,891 | 192,248 | 57,643 | 23.07 | 238,592 | 185,276 | 53,316 | 22.35 | 246,142 | 186,443 | 59,699 | 24.25 | ||
| 231,836 | 180,522 | 51,314 | 22.13 | 241,123 | 193,887 | 47,236 | 19.59 | 225,556 | 175,733 | 49,823 | 22.09 | 233,523 | 183,316 | 50,207 | 21.50 | ||
| 179,637 | 132,873 | 46,764 | 26.03 | 189,984 | 145,993 | 43,991 | 23.16 | 182,911 | 136,660 | 46,251 | 25.29 | 183,926 | 140,351 | 43,575 | 23.69 | ||
| 158,917 | 127,299 | 31,618 | 19.90 | 169,846 | 135,970 | 33,876 | 19.95 | 163,253 | 129,121 | 34,132 | 20.91 | 168,611 | 135,653 | 32,958 | 19.55 | ||
| 148,298 | 117,662 | 30,636 | 20.66 | 159,405 | 130,375 | 29,030 | 18.21 | 149,869 | 118,381 | 31,488 | 21.01 | 149,962 | 119,421 | 30,541 | 20.37 | ||
| 166,647 | 137,621 | 29,026 | 17.42 | 166,194 | 135,931 | 30,263 | 18.21 | 164,002 | 134,748 | 29,254 | 17.84 | 171,105 | 140,772 | 30,333 | 17.73 | ||
| 141,183 | 115,298 | 25,885 | 18.33 | 141,107 | 117,433 | 23,674 | 16.78 | 139,916 | 113,209 | 26,707 | 19.09 | 150,595 | 125,584 | 25,011 | 16.61 | ||
| 141,259 | 108,956 | 32,303 | 22.87 | 139,704 | 108,268 | 31,436 | 22.50 | 137,584 | 105,299 | 32,285 | 23.47 | 144,134 | 111,593 | 32,541 | 22.58 | ||
| 114,518 | 97,214 | 17,304 | 15.11 | 116,982 | 97,186 | 19,796 | 16.92 | 111,115 | 90,530 | 20,585 | 18.53 | 116,586 | 98,619 | 17,967 | 15.41 | ||
| 114,760 | 96,896 | 17,864 | 15.57 | 120,655 | 102,872 | 17,783 | 14.74 | 116,883 | 97,734 | 19,149 | 16.38 | 113,170 | 94,965 | 18,205 | 16.09 | ||
| 93,051 | 78,256 | 14,795 | 15.90 | 95,280 | 80,503 | 14,777 | 15.51 | 88,823 | 73,722 | 15,101 | 17.00 | 91,740 | 76,374 | 15,366 | 16.75 | ||
| 70,197 | 59,724 | 10,473 | 14.92 | 72,992 | 61,902 | 11,090 | 15.19 | 71,356 | 61,994 | 9,362 | 13.12 | 76,564 | 65,426 | 11,138 | 14.55 | ||
| 109,924 | 78,173 | 31,751 | 28.88 | 155,003 | 127,580 | 27,423 | 17.69 | 153,143 | 124,112 | 29,031 | 18.96 | 105,958 | 77,079 | 28,879 | 27.26 | ||
| 23,186 | 22,353 | 833 | 3.59 | 21,726 | 21,614 | 112 | 0.52 | 18,826 | 18,881 | -55 | -0.29 | 23,125 | 23,152 | -27 | -0.12 | ||
Characterization of variants by their GEMINI functional impact.
| Impact severity | Variants impact | All | In repeat regions | In repeat regions “novel” | Not in repeat regions “novel” |
| High | Frameshift | 3,403 | 516 | 198 | 1,468 |
| High | Initiator codon | 16 | 1 | 0 | 1 |
| High | Splice acceptor | 1,063 | 208 | 34 | 182 |
| High | Splice donor | 1,057 | 155 | 18 | 234 |
| High | Start lost | 231 | 9 | 4 | 26 |
| High | Stop gained | 1,537 | 113 | 38 | 426 |
| High | Stop lost | 122 | 12 | 2 | 19 |
| Med | Disruptive inframe insertion | 691 | 370 | 68 | 55 |
| Med | Inframe deletion | 643 | 301 | 37 | 52 |
| Med | Inframe insertion | 590 | 245 | 53 | 76 |
| Med | Missense | 94,941 | 3,849 | 661 | 10,378 |
| Med | Disruptive inframe deletion | 1,423 | 831 | 99 | 113 |
| Low | 3 prime UTR | 223,737 | 49,374 | 5,076 | 16,416 |
| Low | 5 prime UTR premature start codon | 7,215 | 853 | 74 | 627 |
| Low | 5 prime UTR | 44,770 | 7,197 | 1,121 | 4,533 |
| Low | Downstream gene | 999,067 | 531,083 | 52,728 | 41,412 |
| Low | Exon | 40,623 | 14,979 | 1,326 | 2,030 |
| Low | Intergenic | 12,636,385 | 7,615,615 | 707,666 | 392,549 |
| Low | Intragenic | 23 | 13 | 2 | 0 |
| Low | Intron | 10,384,632 | 5,464,131 | 527,433 | 411,413 |
| Low | Start retained | 1 | 0 | 0 | 0 |
| Low | Stop retained | 272 | 4 | 0 | 4 |
| Low | Synonymous | 77,460 | 2,301 | 296 | 4,755 |
| Low | Upstream gene | 1,230,946 | 644,829 | 67,224 | 55,295 |
United Arab Emirates specific variants with high and medium functional impact severity.
| Impact severity | Chr | RS_id | Ref | Alt | Variant type | Functional impact | Gene symbol | Disease association (GeneCards) | UAE AF | gnomAD AF_all |
| High | 1 | rs765451626 | CTG | C | Indel | Frameshift | SPEN | Breast liposarcoma; breast cancer; and brain cancer | 0.05592 | 0.00334 |
| 1 | rs753994746 | CAGCTT | C | Indel | Frameshift | ESPN | Deafness (autosomal recessive 36) with or without vestibular involvement and Usher syndrome type I | 0.08609 | 0.00210 | |
| 2 | rs527478913 | TCGCA | T | Indel | Frameshift | NRP2 | Wallerian degeneration; capillary hemangioma; and hirschsprung disease 1 | 0.05556 | 0.00391 | |
| 3 | rs749453662 | G | GTT | Indel | Frameshift | ZNF717 | None | 0.34545 | 0.00307 | |
| 11 | rs368342230 | TG | T | Indel | Frameshift | MUC6 | 0.05229 | 0.00000 | ||
| 11 | rs376177791 | G | GT | Indel | Frameshift | MUC6 | Pancreatic ductal carcinoma; tumor of exocrine pancreas; endocervical adenocarcinoma; signet ring cell adenocarcinoma; and gastric cancer | 0.05882 | 0.00001 | |
| 11 | rs780061827 | AAT | A | Indel | Frameshift | MUC6 | 0.27124 | 0.00000 | ||
| 11 | rs769713098 | G | GCA | Indel | Frameshift | MUC6 | 0.28431 | 0.00000 | ||
| 19 | rs770233746 | GGCTT | G | Indel | Frameshift | MUC16 | Clear cell adenocarcinoma and ovarian cyst | 0.29934 | 0.00000 | |
| X | rs1325813675 | C | CTT | Indel | Splice acceptor | STAG2 | Neurodevelopmental disorder; X-linked; with craniofacial abnormalities; and Xq25 duplication syndrome | 0.16807 | 0.00298 | |
| X | rs1325813675 | C | CTTT | Indel | Splice acceptor | STAG2 | 0.12832 | 0.00327 | ||
| X | rs139484145 | A | G | SNP | Stop lost | ARSD | Chondrodysplasia Punctata (tibia-metacarpal) and atrial septal defect 2 | 0.32895 | 0.00014 | |
| Medium | 2 | rs143372458 | C | T | SNP | Missense | ANKRD23 | Tibial muscular dystrophy; total anomalous pulmonary venous return 1; and dilated cardiomyopathy | 0.05298 | 0.00464 |
| 2 | rs146511220 | C | G | SNP | Missense | ACADL | Acyl-CoA dehydrogenase, very long-chain, and deficiency of acyl-CoA dehydrogenase deficiency | 0.05556 | 0.00180 | |
| 2 | rs141080282 | G | A | SNP | Missense | LTBP1 | Geleophysic dysplasia and brachydactyly, Type C | 0.06250 | 0.00453 | |
| 2 | rs142955097 | A | G | SNP | Missense | WDR35 | Short-rib thoracic dysplasia 7 with or without polydactyly and cranioectodermal dysplasia 2 | 0.07843 | 0.00148 | |
| 2 | rs55660827 | A | G | SNP | Missense | MCM6 | Lactose intolerance, adult type, and lactose intolerance | 0.17974 | 0.00027 | |
| 2 | rs1400511133 | GGGC | G | Indel | Disruptive inframe deletion | GDF7 | Barrett esophagus | 0.23077 | None | |
| 4 | rs370593066 | A | C | SNP | Missense | DCLK2 | Lissencephaly; band heterotopia | 0.07190 | 0.00085 | |
| 5 | rs144066680 | A | C | SNP | Missense | GPR151 | Brachial plexus lesion | 0.05229 | 0.00074 | |
| 7 | rs1351676248 | G | T | SNP | Missense | ZSCAN21 | Retinitis pigmentosa 58 | 0.08553 | 0.00005 | |
| 7 | rs75910050 | G | A | SNP | Missense | FAM220A | Pancreatic squamous cell carcinoma | 0.05882 | 0.00094 | |
| 11 | rs675 | T | C | SNP | Missense | APOA4 | Carotenemia and demyelinating polyneuropathy | 0.07292 | ||
| 16 | rs149365469 | C | T | SNP | Missense | ACD | Dyskeratosis congenital (autosomal dominant 6) and hoyeraal hreidarsson syndrome | 0.05229 | 0.00075 | |
| 16 | rs1799917 | A | C | SNP | Missense | GNAO1 | Epileptic encephalopathy, early infantile, 17, and neurodevelopmental disorder with involuntary movements | 0.05556 | 0.00000 | |
| 17 | rs140375987 | C | T | SNP | Missense | CASC3 | Bone marrow cancer; myeloma (multiple); chronic polyneuropathy; spherocytosis, Type 5; and Kabuki syndrome 1 | 0.05882 | 0.00111 | |
| 17 | rs758821377 | CTGT | C | Indel | Inframe deletion | KDM6B | Neurodevelopmental disorder with coarse facies and mild distal skeletal abnormalities; myelodysplastic syndrome; and kidney cancer; brain cancer | 0.22000 | 0.00060 | |
| 19 | rs8107444 | A | T | SNP | Missense | ZNF28 | None | 0.07813 | 0.00004 | |
| 19 | rs1427739410 | GGGC | G | Indel | Inframe deletion | BTBD2 | None | 0.56818 | None | |
| 20 | rs778174473 | AGGGCCA GGGCCG | A | Indel | Disruptive inframe deletion | TAF4 | Huntington disease | 0.16667 | 0.00069 | |
| X | rs78034736 | G | T | SNP | Missense | ARSD | Chondrodysplasia punctata, tibia-metacarpal type; and atrial septal defect 2 | 0.30921 | 0.00006 | |
| X | rs73632978 | G | A | SNP | Missense | ARSD | 0.32026 | 0.00014 | ||
| X | rs67272620 | A | T | SNP | Missense | ARSD | 0.32353 | 0.00008 | ||
| X | rs67359049 | C | T | SNP | Missense | ARSD | 0.32353 | 0.00007 | ||
| X | rs73632975 | A | T | SNP | Missense | ARSD | 0.32353 | 0.00004 | ||
| X | rs73632976 | C | T | SNP | Missense | ARSD | 0.32353 | 0.00006 | ||
| X | rs370769167 | C | T | SNP | Missense | ARSD | 0.32680 | 0.00001 | ||
| X | rs115332247 | C | A | SNP | Missense | ARSD | 0.32680 | 0.00002 | ||
| X | rs73632977 | A | T | SNP | Missense | ARSD | 0.32680 | 0.00012 | ||
| X | rs73632953 | T | C | SNP | Missense | ARSD | 0.32895 | 0.00015 | ||
| X | rs73632954 | A | G | SNP | Missense | ARSD | 0.32895 | 0.00002 | ||
| X | rs143238998 | A | C | SNP | Missense | ARSD | 0.32895 | 0.00001 | ||
| X | rs150899882 | C | A | SNP | Missense | ARSD | 0.32895 | 0.00001 | ||
| X | rs113556864 | CCCAC GCCGG | C | Indel | Disruptive inframe deletion | ARSD | 0.32895 | 0.00001 |
FIGURE 4Circos plot of the spatial distribution of short variants and structural variants (SVs) across all chromosomes (outer ring). From outer to the inner rings: (A) short variants called from the 153 UAE samples (light red), novel (red), and novel UAE population-specific (dark red) variations which indicate regional variability characteristics (note that scales are modified for visibility), (B) loss of function (yellow) and UAE population specific variants (black line), (C) SVs consensus set (dark purple), of which the (D) through (G) rings show insertions (dark green), deletions (blue), duplications (gray), translocation (dark yellow), and inversions (orange), respectively. The heatmap in the innermost plot of the figure displays the frequency of SVs.