| Literature DB >> 33202901 |
Jina Kim1,2, Joohon Sung1,2,3, Kyudong Han4,5,6, Wooseok Lee4,5, Seyoung Mun4,5,7, Jooyeon Lee3, Kunhyung Bahk1,2, Inchul Yang8, Young-Kyung Bae8, Changhoon Kim9, Jong-Il Kim10,11, Jeong-Sun Seo9,12,13.
Abstract
The current human reference genome (GRCh38), with its superior quality, has contributed significantly to genome analysis. However, GRCh38 may still underrepresent the ethnic genome, specifically for Asians, though exactly what we are missing is still elusive. Here, we juxtaposed GRCh38 with a high-contiguity genome assembly of one Korean (AK1) to show that a part of AK1 genome is missing in GRCh38 and that the missing regions harbored ~1390 putative coding elements. Furthermore, we found that multiple populations shared some certain parts in the missing genome when we analyzed the "unmapped" (to GRCh38) reads of fourteen individuals (five East-Asians, four Europeans, and five Africans), amounting to ~5.3 Mb (~0.2% of AK1) of the total genomic regions. The recovered AK1 regions from the "unmapped reads", which were the estimated missing regions that did not exist in GRCh38, harbored candidate coding elements. We verified that most of the common (shared by ≥7 individuals) missing regions exist in human and chimpanzee DNA. Moreover, we further identified the occurrence mechanism and ethnic heterogeneity as well as the presence of the common missing regions. This study illuminates a potential advantage of using a pangenome reference and brings up the need for further investigations on the various features of regions globally missed in GRCh38.Entities:
Keywords: human reference genome; missing information; occurrence mechanism; precise ethnic genome
Mesh:
Year: 2020 PMID: 33202901 PMCID: PMC7697454 DOI: 10.3390/genes11111350
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1A systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. The degree of match divided AK1 scaffolds into three distinct patterns of synteny by LASTZ [14]. The x axis (and vertical pop-up axis for Group 1) represents the percent of matches between AK1 scaffold and GRCh38.p12 chromosomes, and the y axis represents the count of scaffolds. GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12.
Figure 2The process of realigning unmapped reads of GRCh38 to AK1.
Statistics of the three groups of AK1 scaffolds according to a systematic comparison between AK1 scaffolds (n = 2382) and GRCh38.p12. Fix, the patches represent changes (error corrections or assembly improvements) to GRCh38 genome; Random, the unlocalized contigs of GRCh38; GRCh38.p12, Genome Reference Consortium Human Build 38 patch release 12; * Size of sum of minor contributing chromosomes.
| All | Group 1 | Group 2 | Group 3 | |
|---|---|---|---|---|
| Number of Scaffolds | 2832 | 945 | 467 | 1420 |
| Total scaffold size (Scaffold N50) | 2904 Mb | 2,697 Mb | 165 Mb | 41 Mb |
| Size matched with GRCh38.p12 (%) | 2851 Mb (98.2) | 2691 Mb (99.8) | 160 Mb (96.2) | 0 |
| by Sequence types | 2839 Mb | 2681 Mb | 158 Mb | 0 |
| Fix | 8047 kb | 7831 kb | 216 kb | 0 |
| Random | 2783 kb | 1906 kb | 878 kb | 0 |
| Unknown chromosomes | 1005 kb | 648 kb | 358 kb | 0 |
| Scaffolds matched multiple chromosomes of GRCh38.p12 | 487 | 343 | 144 | 0 |
| Total size of scaffolds contributed from multiple chromosomes * | 22.2 Mb | 21.1 Mb | 1.1 Mb | 0 |
The read counts of unmapped reads by samples.
| Sample ID | Ancestry | Population | Total Number of Unmapped Reads (K) | Unpaired Reads, Counts (K) (%) | Mapped on AK1, | Suggestive Microbial Origin, Read Count | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Overall | Mapping Quality > 10 | |||||||||
| HG02922 | AFR | Esan | 59,751 | Average | 36,871 (61.7) | 205 (0.9) | Mean % | 110 (0.5) | Mean % | 318 |
| HG03052 | Mende | 34,958 | 21,174 (60.6) | 127 (0.9) | 67 (0.5) | 401 | ||||
| NA19625 | African-American SW | 48,718 | 34,396 (70.6) | 121 (0.8) | 63 (0.4) | 353 | ||||
| HG01879 | African-Caribbean | 35,674 | 198,064 (55.5) | 165 (1.0) | 78 (0.5) | 1191 | ||||
| NA19017 | Luhya | 33,965 | 20,442 (60.2) | 96 (0.7) | 56 (0.4) | 2188 | ||||
| HG00419 | EAS | South. Han Chinese | 34,935 | Average | 22,398 (64.1) | 131 (1.0) | Mean % | 66 (0.5) | Mean % | 527 |
| NA18525 | Han Chinese | 15,620 | 8,759 (56.1) | 51 (0.7) | 34 (0.5) | 517 | ||||
| HG01595 | Kinh Vietnamese | 59,355 | 31,507 (53.1) | 265 (1.0) | 140 (0.5) | 3405 | ||||
| NA18939 | Japanese | 27,950 | 15,520 (55.5) | 127 (1.0) | 66 (0.5) | 522 | ||||
| HG00759 | Dai Chinese | 44,510 | 21,418 (48.1) | 234 (1.0) | 117 (0.5) | 512 | ||||
| NA20502 | EUR | Tuscan | 26,343 | Average | 19,640 (74.6) | 57 (0.9) | Mean % | 33 (0.5) | Mean % | 1557 |
| HG00096 | British | 29,915 | 16,773 (56.1) | 108 (0.8) | 64 (0.5) | 1878 | ||||
| HG01500 | Spanish | 31,331 | 15,726 (50.2) | 164 (1.1) | 76 (0.5) | 2423 | ||||
| HG00268 | Finnish | 19,255 | 12,139 (63.0) | 58 (0.8) | 36 (0.5) | 289 | ||||
| Total Average (Mean ± sd) | 35,877 ± 13,193 | 21,184 ± 8091 (59.0%) | 137± 65 (0.92%) | 71 ±31 (0.49%) | 1149 ± 988 | |||||
Suggestive microbial origin was analyzed by GATK-pathSeq. African-American SW, African-American Southwes; .
The distribution of repetitive sequences on the putative missing regions on AK1 scaffolds. The estimated missing regions by unmapped reads, 110 regions (≥X10, ≥2 indiv) and 38 regions (≥X10, ≥7 indiv), were investigated on the distribution of repetitive sequences with Repeat Masker. Mean% (SD).
| Family | 110 Regions (More than Ten Reads Are Mapped in More than Two Samples | 38 Regions (More than Ten Reads Are Mapped in More than Seven Samples) | |
|---|---|---|---|
| Mean % (SD) | Mean % (SD) | ||
| SINE | All | 8.01(9.85) | 2.54 (5.41) |
| ALUs | 6.41 (12.25) | 0.27 (1.63) | |
| MIRs | 1.6 (6.65) | 2.27 (7.37) | |
| LINE | All | 7.34 (13.35) | 3.64 (13.80) |
| LINE1 | 5.13 (15.50) | 3.64 (13.80) | |
| LINE2 | 2.21 (10.77) | 0 | |
| L3/CR1 | 0 | 0 | |
| LTR | All | 2.47(4.79) | 0.56 (2.50) |
| ERVL | 0.88 (5.86) | 0 | |
| ERVL-MaLRs | 0.98 (4.35) | 0.56 (2.50) | |
| ERV-class I | 0.60 (3.93) | 0 | |
| ERV-class II | 0 | 0 | |
| DNA | All | 0.14 (0.70) | 0 |
| hAT-Charlie | 0.14 (0.70) | 0 | |
| TcMar-Tigger | 0 | 0 | |
| Unclassified | 0.48 (5.01) | 0 | |
| Small RNA | 0.05 (0.51) | 0 | |
| Satellite | 8.94 (26.92) | 7.85 (26.82) | |
| Simple repeats | 17.62 (33.73) | 10.82 (29.95) | |
| Low complexity | 11.80 (31.59) | 0.52 (2.00) | |
SINE = Short interspersed nuclear elements; MIR = Mammalian-wide interspersed repeats; LINE = Long interspersed nuclear elements; LTR = Long terminal repeat; ERVL = Endogenous retrovirus-L; ERVL-MaLRs = Endogenous retrovirus-L-Mammalian apparent LTR Retrotransposons; ERV = Endogenous retroviruses.
Characteristics and verifications of the presence of the estimated globally missing regions on Group 1 scaffolds. The common candidate regions globally missing with ±2 kb of flanking sequences were searched and 20 of 31 globally missing regions (shared by ≥7 individuals) were verified by PCR.
| AK1 Genome Information | Sequence Comparison Using UCSC BLAT | Validated by PCR | Hg38 Position | Verified Actual Indel Size (bp) | Breakpoint Structure | Mechanism | Microhomology (bp) | Microhomology Sequence or Homologous Sequence | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | Scaffold of AK1 | The Estimated Location of Globally Missing Region (≥7 indiv) | Human (GRCh38) | Chimp (panTro6) | Gorilla (gorGor4) | Eur 1 | Eur 2 | Eur 3 | Eur 4 | |||||||
| Start | End | |||||||||||||||
| G1-1 | KV784719.1 | 30,209,977 | 30,210,924 | X | O | O | O | O | O | O | chr13:48910547-48914294 | 1198 | Unique-Unique | NHEJ | 0 | |
| G1-2 | KV784719.1 | 79,001,655 | 79,002,640 | X | N | O | X | X | X | X | chr13:97337324-97340428 | 1333 | SR-SR | NHEJ | 4 | TGTG |
| G1-3 | KV784719.1 | 93,452,303 | 93,455,222 | - | O | O | X | X | X | X | N/A | N/A | N/A | N/A | N/A | N/A |
| G1-4 | KV784719.1 | 93,470,705 | 93,471,918 | - | O | O | X | X | X | X | N/A | N/A | N/A | N/A | N/A | N/A |
| G1-5 | KV784720.1 | 27,885,647 | 27,886,104 | X | O | O | O | O | O | O | chr4:79781761-79785451 | 767 | Alu-Unique | NHEJ | 2 | CT |
| G1-6 | KV784723.1 | 8,349,171 | 8,349,628 | X | O | O | O | O | O | O | chr4:181366776-181370046 | 1192 | Unique-Unique | NHEJ | 0 | |
| G1-7 | KV784723.1 | 10,288,012 | 10,288,493 | X | O | O | Del | Del | Pol | Del | chr4:179430209-179433860 | 827 | Unique-Unique | NHEJ | 4 | ATTT |
| G1-8 | KV784723.1 | 34,400,763 | 34,401,227 | X | O | O | X | X | X | X | chr4:155347518-155351075 | 901 | Unique-Unique | NAHR | 38 | TTTCTTGTCTCCTGCCTTCTGCCAAGCCTTAGTCACAA |
| G1-9 | KV784731.1 | 15,610,509 | 15,611,959 | X | O | N | O | O | O | O | chr5:6446724-6450554 | 1636 | SR-Unique | NHEJ | 4 | CTGC |
| G1-10 | KV784736.1 | 6,179,476 | 6,184,176 | X | O | O | O | O | O | O | chr6:67607329-67611067 | 4961 | Alu-L1 | NHEJ | 4 | AAAA |
| G1-11 | KV784736.1 | 18,433,040 | 18,435,697 | X | O | O | O | O | O | O | chr6:79899617-79903449 | 2892 | Unique-Unique | NHEJ | 5 | GGACT |
| G1-12 | KV784738.1 | 33,432,222 | 33,432,240 | X | O | N | X | X | X | X | chr10:2389608-2395439 | 4163 | Unique-Unique | NHEJ | 5 | CCCTC |
| G1-13 | KV784747.1 | 1,225,842 | 1,227,344 | X | O | O | Del | Del | Pol | O | chr6:28174388-28177850 | 2035 | Unique-Unique | NHEJ | 2 | AG |
| G1-14 | KV784754.1 | 50,234,036 | 50,235,663 | X | O | O | O | O | O | O | chr8:136025060-136028726 | 1957 | Unique-Alu | NHEJ | 5 | ATCTC |
| G1-15 | KV784761.1 | 2,374,855 | 2,374,857 | X | - | - | O | O | O | O | chr18:13980325-13983782 | 543 | Unique-Unique | NHEJ | 4 | TCCT |
| G1-16 | KV784762.1 | 646,396 | 646,455 | X | N | N | X | X | X | X | chr19:869056-876703 | 2372 | G-rich-G-rich | NHEJ | 4 | GGGG |
| G1-17 | KV784762.1 | 942,159 | 943,260 | X | O | O | O | O | O | O | chr19:1160489-1162472 | 3127 | Alu-Alu | NAHR | 25 | CCTGTAATCCCAGCACTTTGGGAGG |
| G1-18 | KV784774.1 | 387,226 | 387,651 | X | O | O | X | X | X | X | chrX:47084676-47092500 | 2920 | SR-Alu | NHEJ | 3 | ATG |
| G1-19 | KV784797.1 | 27,753,978 | 27,754,392 | X | O | O | O | O | O | O | chr1:93874952-93876859 | 2521 | Unique-Alu | NHEJ | 0 | |
| G1-20 | KV784800.1 | 13,617,523 | 13,617,941 | X | O | O | Pol | Pol | O | Pol | chr10:63781277-63784929 | 763 | Alu-Unique | NHEJ | 4 | AGAA |
| G1-21 | KV784803.1 | 15,594,978 | 15,595,455 | X | O | O | O | Pol | Del | Del | chr14:88710100-88713185 | 1390 | LTR-LTR | NHEJ | 6 | GAACTG |
| G1-22 | KV784803.1 | 21,188,206 | 21,188,829 | X | O | O | Del | Del | O | O | chr14:83119034-83122153 | 1504 | L1-Unique | NHEJ | 3 | AGA |
| G1-23 | KV784804.1 | 4,078,861 | 4,078,900 | X | O | O | O | O | O | O | chr17:40521389-40524617 | 820 | Unique-Alu | NHEJ | 1 | G |
| G1-24 | KV784806.1 | 65,330,325 | 65,332,270 | X | O | O | O | O | O | O | chr2:21821760-21825542 | 2160 | L1-Unique | NHEJ | 1 | T |
| G1-25 | KV784811.1 | 3,734,091 | 3,735,143 | X | O | O | O | O | O | O | chr7:68760760-68763395 | 2414 | Alu-Unique | NHEJ | 3 | AAG |
| G1-26 | LPVO02000186.1 | 2,132,760 | 2,132,810 | X | O | O | Pol | Pol | O | Pol | chr3:95822539-95830080 | 2497 | L1-Unique | NHEJ | 0 | |
| G1-27 | LPVO02000191.1 | 8,716,140 | 8,716,258 | X | O | N | X | X | X | X | chr3:194273873-194277269 | 720 | G-rich-G-rich | NHEJ | 2 | GG |
| G1-28 | LPVO02000230.1 | 3,020,537 | 3,020,573 | X | X | N | X | X | X | X | chr5:181099166-181102877 | 615 | SR-SR | NHEJ | 3 | CCT |
| G1-29 | LPVO02000423.1 | 11,658,530 | 11,658,908 | X | O | O | X | X | X | X | chr11:101923894-101927461 | 806 | Alu-Alu | NHEJ | 8 | GTGCAGTG |
| G1-30 | LPVO02000423.1 | 13,811,264 | 13,811,292 | X | O | O | Del | Pol | Del | Pol | chr11:104076897-104080443 | 579 | LTR-Unique | NHEJ | 2 | TT |
| G1-31 | LPVO02000621.1 | 1,217,413 | 1,217,481 | X | N | N | X | X | X | X | chrX:2318537-2323680 | 4923 | Alu-Alu | NAHR | 24 | GTGGAGGTTGCAGTGAGCCGAGAT |
The Estimated Location of Globally Missing Region start/end (≥7 indiv) = Start/End postion of the sequence mapped by unmapped reads of more 7 samples; Eur, European;X, Not exist; O, Same as AK1; “-“, Not matched to the primate reference genome; N, Matched but ambiguous sequences (Ns) were included; Del, Deletion; Pol, Polymorphic; SR, Simple Repeat; NHEJ, Non-homologous end-joining; NAHR, Non-allelic homologous recombination; N/A = Not available.
Figure 3The example of globally missing regions on GRCh38 investigated with UCSC Genome browser and the experimental verification of the existence of the regions. The region (Group 1) with a high depth with 7 or more samples was discovered in the inserted sequences (yellow block). The G1-26 region (Insertion into chr3:95,825,553–95,825,555) was near L1M2. The yellow block is the estimated insertion against GRCh38 on the chain file. The grey blocks are repetitive sequences. The pink block is the sequence only on the GRCh38 genome. Chimp, chimpanzee; Eur, European.