| Literature DB >> 32487205 |
Alaina Shumate1,2, Aleksey V Zimin1,2, Rachel M Sherman1,3, Daniela Puiu1,3, Justin M Wagner4, Nathan D Olson4, Mihaela Pertea1,2, Marc L Salit5, Justin M Zook4, Steven L Salzberg6,7,8,9.
Abstract
BACKGROUND: Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases.Entities:
Mesh:
Year: 2020 PMID: 32487205 PMCID: PMC7265644 DOI: 10.1186/s13059-020-02047-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Sequence data for assembly of the HG002 genome, all taken from the Genome In A Bottle Project
| Sequencing technology | Number of reads | Mean read length (bp) | Total sequence (bp) | Genome coverage |
|---|---|---|---|---|
| Illumina | 883,914,482 | 249 | 219,763,641,914 | 71x |
| ONT | 2,090,962 | 33,889 | 70,861,178,054 | 23x |
| PacBio HiFi | 9,270,502 | 9567 | 88,695,245,383 | 29x |
Comparison of chromosome lengths and gaps between Ash1 and GRCh38. Chromosome lengths exclude all “N” characters. Every sequence of Ns was counted as a gap except for leading and trailing Ns. Several GRCh38 chromosomes begin or end with lengthy sequences of Ns numbering millions of bases; these were not counted as gaps here
| Chr | Ash1 v1.7 | GRCh38.p13 | ||||
|---|---|---|---|---|---|---|
| Length (bp) | Gap length | No. of gaps | Length (bp) | Gap length | No. of gaps | |
| 1 | 232,280,045 | 18,214,772 | 193 | 230,481,014 | 18,455,408 | 164 |
| 2 | 241,581,444 | 1,282,527 | 66 | 240,548,237 | 1,625,292 | 24 |
| 3 | 199,411,976 | 76,238 | 57 | 198,100,142 | 125,417 | 20 |
| 4 | 190,408,510 | 301,999 | 18 | 189,752,667 | 441,888 | 16 |
| 5 | 181,608,321 | 176,942 | 62 | 181,265,378 | 202,881 | 35 |
| 6 | 170,304,801 | 502,300 | 23 | 170,078,523 | 607,456 | 13 |
| 7 | 160,669,899 | 205,711 | 66 | 158,970,135 | 355,838 | 15 |
| 8 | 144,953,907 | 151,700 | 15 | 144,768,136 | 250,500 | 10 |
| 9 | 122,110,712 | 16,459,698 | 110 | 121,790,553 | 16,534,164 | 41 |
| 10 | 134,496,302 | 289,022 | 41 | 133,262,998 | 514,424 | 42 |
| 11 | 135,108,547 | 191,392 | 72 | 134,533,742 | 482,880 | 15 |
| 12 | 135,338,731 | 36,440 | 82 | 133,137,819 | 117,490 | 25 |
| 13 | 98,916,572 | 129,842 | 57 | 97,983,128 | 371,200 | 18 |
| 14 | 90,842,875 | 254,999 | 49 | 90,568,149 | 315,569 | 23 |
| 15 | 91,928,716 | 336,427 | 34 | 84,641,325 | 339,864 | 17 |
| 16 | 82,665,194 | 8,252,197 | 64 | 81,805,944 | 8,412,401 | 19 |
| 17 | 83,177,337 | 171,631 | 30 | 82,920,216 | 267,225 | 34 |
| 18 | 81,463,364 | 66,719 | 72 | 80,089,605 | 163,680 | 59 |
| 19 | 67,231,982 | 98,278 | 16 | 58,440,758 | 106,858 | 7 |
| 20 | 65,005,954 | 106,299 | 121 | 63,944,257 | 329,910 | 88 |
| 21 | 40,375,064 | 758,589 | 80 | 40,088,622 | 1,601,361 | 47 |
| 22 | 42,624,612 | 729,999 | 117 | 39,159,782 | 1,138,686 | 42 |
| X | 153,528,413 | 671,671 | 38 | 154,893,034 | 1,127,861 | 27 |
| Y | 27,085,372 | 33,413,257 | 33 | 26,415,048 | 30,792,367 | 54 |
| Total | 2,973,118,650 | 82,878,649 | 2,937,639,212 | 84,680,620 | ||
The proportion of variant sites in the Ashkenazi reference genome that agree with major alleles from the gnomAD large-scale survey of the Ashkenazi population. Column headers show the frequency ranges of the Ashkenazi alternative alleles (ALT) from the gnomAD database. Row 3 shows the proportion of positions in Ash1 that agree with the gnomAD major allele where gnomAD differs from GRCh38
| Frequency (f) in Ashkenazi population | [0.25, 0.5] | (0.5, 0.6] | (0.6, 0.7] | (0.7, 0.8] | (0.8, 0.9] | (0.9, 1.0] | Total |
|---|---|---|---|---|---|---|---|
| Total no. of sites at Ashkenazi ALT allele frequency (f) | 1,706,379 | 442,352 | 369,541 | 300,969 | 252,859 | 424,967 | |
| Proportion of Ash1 sites that match gnomAD Ashkenazi allele | 0.317 | 0.759 | 0.846 | 0.910 | 0.955 | 0.982 |
Eleven genes from GRCh38, 4 of them protein coding, that map to a different chromosome on Ash1. Genes are sorted by their position on GRCh38. Genes that appear to have moved in a block via a single translocation are highlighted in colored rows. Subtelomeric coordinates are indicated by (T) next to the coordinates. Abbreviations: NC noncoding
Fig. 1Snapshot showing alignments of long PacBio reads to the Ash1 genome, centered on the left end of the location in chromosome 20 (position 65,079,275) where a translocation occurred between chromosome 15 (GRCh38) and 20 (Ash1). The top portion of the figure shows the coordinates on chr20. Below that is a histogram of read coverage, and the individual reads fill the bottom part of the figure. The indels in the reads, shown as colored bars on each read, are due to the relatively high error rate of the long reads
Ninety-four genes that are completely or mostly missing in Ash1. The Mapping status column shows “unmapped” if the gene is entirely missing from Ash1, and “partial” if less than 50% of the gene appears in Ash1. Forty of the genes are protein-coding and 54 are noncoding. All of the protein-coding genes are members of multi-gene families. Abbreviations: NC, noncoding
| CHESS ID | Gene name | Gene type | GRCh38 location | Mapping status |
|---|---|---|---|---|
| CHS.5 | LOC105379212 | NC | chr1:51943-53959 | Unmapped |
| CHS.6 | OR4F5 | Protein | chr1:69091-70008 | Unmapped |
| CHS.8 | LOC729737 | NC | chr1:134773-140566 | Unmapped |
| CHS.461 | PRAMEF9 | Protein | chr1:13175281-13179132 | Unmapped |
| CHS.2763 | LOC107985199 | Protein | chr1:143318207-143319096 | Unmapped |
| CHS.2764 | LOC105371172 | NC | chr1:143323047-143327009 | Unmapped |
| CHS.3550 | FCGR3B | Protein | chr1:161623196-161631963 | Unmapped |
| CHS.4311 | LOC103021295 | NC | chr1:205957925-205958388 | Unmapped |
| CHS.30466 | LIMS3-LOC440895 | NC | chr2:109898432-109968577 | Unmapped |
| CHS.32660 | LOC728323 | NC | chr2:242088633-242169503 | Unmapped |
| CHS.39504 | GTF2IP18 | NC | chr3:198185965-198189923 | Unmapped |
| CHS.39507 | Unnamed | NC | chr3:198219778-198222386 | Unmapped |
| CHS.45102 | LOC107986552 | NC | chr6:109026-111100 | Unmapped |
| CHS.52504 | OR4F21 | Protein | chr8:166086-167024 | Unmapped |
| CHS.52763 | LOC100133267 | Protein | chr8:12064389-12071747 | Unmapped |
| CHS.54931 | DDX11L5 | NC | chr9:11987-14525 | Unmapped |
| CHS.54937 | LINC01388 | NC | chr9:100804-114246 | Unmapped |
| CHS.54939 | FOXD4 | Protein | chr9:116231-118417 | Unmapped |
| CHS.56331 | LOC107987034 | Protein | chr9:104234781-104235568 | Unmapped |
| CHS.56391 | Unnamed | Protein | chr9:107257286-107261972 | Unmapped |
| CHS.7894 | OR51A2 | Protein | chr11:4954772-4955713 | Unmapped |
| CHS.11017 | PRB2 | Protein | chr12:11391540-11395564 | Unmapped |
| CHS.14171 | PRR20A | Protein | chr13:57140918-57143939 | Unmapped |
| CHS.14613 | METTL21C | Protein | chr13:102685747-102704311 | Unmapped |
| CHS.14764 | LOC102724510 | NC | chr13:111754561-111757459 | Unmapped |
| CHS.18131 | GOLGA6L5P | NC | chr15:84506168-84516847 | Unmapped |
| CHS.18488 | OR4F4 | Protein | chr15:101922142-101923059 | Unmapped |
| CHS.19166 | NPIPA3 | Protein | chr16:14704711-14726338 | Unmapped |
| CHS.20776 | LOC107987239 | NC | chr16:90220197-90225200 | Unmapped |
| CHS.19681 | TP53TG3B | Protein | chr16:33358385-33363478 | Unmapped |
| CHS.20874 | LOC105377826 | NC | chr17:61388-97400 | Unmapped |
| CHS.20875 | LOC101929823 | NC | chr17:97711-133841 | Unmapped |
| CHS.20876 | LOC101929828 | NC | chr17:110296-111566 | Unmapped |
| CHS.22187 | KRTAP9-6 | Protein | chr17:41265378-41265860 | Unmapped |
| CHS.23950 | LOC102724130 | NC | chr18:11103-15928 | Unmapped |
| CHS.23951 | Unnamed | NC | chr18:14195-14958 | Unmapped |
| CHS.23952 | LOC105371950 | NC | chr18:42666-4701 | Unmapped |
| CHS.34254 | LOC102724184 | NC | chr21:5011163-5017158 | Unmapped |
| CHS.34255 | LOC105379484 | NC | chr21:5011976-5012684 | Unmapped |
| CHS.34256 | LOC102723996 | Protein | chr21:5022044-5046678 | Unmapped |
| CHS.34276 | LOC102724370 | NC | chr21:6070758-6073132 | Unmapped |
| CHS.34887 | LOC107987302 | NC | chr21:43434853-43442401 | Unmapped |
| CHS.34888 | LINC00319 | NC | chr21:43450024-43453893 | Unmapped |
| CHS.34889 | LINC00313 | NC | chr21:43462094-43478223 | Unmapped |
| CHS.34912 | PWP2 | Protein | chr21:44107262-44131181 | Unmapped |
| CHS.34913 | C21orf33 | Protein | chr21:44133612-44145723 | Unmapped |
| CHS.34914 | LOC105377138 | Protein | chr21:44158746-44160189 | Unmapped |
| CHS.35279 | LOC105377190 | NC | chr22:21359596-21360702 | Unmapped |
| CHS.58009 | GAGE12J | Protein | chrX:49322030-49329387 | Unmapped |
| CHS.58010 | GAGE13 | Protein | chrX:49331603-49338952 | Unmapped |
| CHS.58011 | GAGE12B | Protein | chrX:49341183-49529921 | Unmapped |
| CHS.58270 | FAM226B | NC | chrX:72777073-72779095 | Unmapped |
| CHS.58374 | LOC102724150 | NC | chrX:89403129-89455254 | Unmapped |
| CHS.58376 | TGIF2LX | Protein | chrX:89921941-89922883 | Unmapped |
| CHS.58675 | RHOXF2B | Protein | chrX:120072264-120077742 | Unmapped |
| CHS.58694 | CT47A12 | Protein | chrX:120877490-120932399 | Unmapped |
| CHS.58695 | CT47A11 | Protein | chrX:120933840-120937260 | Unmapped |
| CHS.58696 | CT47A10 | Protein | chrX:120938701-120942121 | Unmapped |
| CHS.58697 | CT47A9 | Protein | chrX:120943561-120946981 | Unmapped |
| CHS.58854 | CT45A2 | Protein | chrX:135811668-135820062 | Unmapped |
| CHS.58856 | CT45A8 | Protein | chrX:135846497-135854588 | Unmapped |
| CHS.58857 | CT45A9 | Protein | chrX:135863418-135871812 | Unmapped |
| CHS.1790 | LOC107984964 | NC | chr1:61637114-61650098 | Partial |
| CHS.2787 | LOC105371206 | NC | chr1:144153168-144170705 | Partial |
| CHS.3547 | HSPA7 | NC | chr1:161601221-161608551 | Partial |
| CHS.3548 | FCGR2C | NC | chr1:161562688-161604463 | Partial |
| CHS.4366 | LOC105372881 | NC | chr1:207365822-207373252 | Partial |
| CHS.5223 | Unnamed | NC | chr1:248535005-248536680 | Partial |
| CHS.30144 | LOC105374854 | NC | chr2:88825277-88886154 | Partial |
| CHS.31297 | PHOSPHO2-KLHL23 | Protein | chr2:169694454-169751886 | Partial |
| CHS.39506 | Unnamed | NC | chr3:198198959-198219542 | Partial |
| CHS.50848 | NSUN5P2 | NC | chr7:72948293-72954763 | Partial |
| CHS.50952 | LOC541473 | NC | chr7:75391949-75395461 | Partial |
| CHS.54613 | LOC107986982 | Protein | chr8:140620807-140625255 | Partial |
| CHS.54936 | PGM5P3-AS1 | NC | chr9:72674-88826 | Partial |
| CHS.55501 | ZNF658B | NC | chr9:39443815-39464526 | Partial |
| CHS.55736 | LOC105376078 | NC | chr9:70669974-70714251 | Partial |
| CHS.56296 | LOC105376181 | NC | chr9:100901764-100906823 | Partial |
| CHS.6710 | LOC105378410 | NC | chr10:87189779-87194905 | Partial |
| CHS.8878 | PGA3 | Protein | chr11:61203307-61216278 | Partial |
| CHS.14172 | PRR20B | Protein | chr13:57147488-57150509 | Partial |
| CHS.17645 | LOC105376718 | NC | chr15:66858141-66867024 | Partial |
| CHS.18489 | LOC107987229 | NC | chr15:101936986-101939014 | Partial |
| CHS.18491 | FAM138E | NC | chr15:101954885-101956355 | Partial |
| CHS.20774 | LOC105371423 | NC | chr16:90186142-90219472 | Partial |
| CHS.34257 | LOC105372832 | NC | chr21:5055735-5062892 | Partial |
| CHS.34279 | LOC102724428 | Protein | chr21:6111134-6123778 | Partial |
| CHS.34916 | LOC105377139 | NC | chr21:44172147-44191773 | Partial |
| CHS.34917 | Unnamed | NC | chr21:44175401-44179738 | Partial |
| CHS.57466 | Unnamed | NC | chrX:3891438-3902000 | Partial |
| CHS.58012 | GAGE12C | Protein | chrX:49532177-49539541 | Partial |
| CHS.58377 | LOC105373292 | NC | chrX:90234591-90265462 | Partial |
| CHS.59131 | WASIR1 | NC | chrX:156014615-156017057 | Partial |
| CHS.59270 | VCY1B | Protein | chrY:14056222-14056958 | Partial |
Comparison of protein-coding sequences between Ash1 and GRCh38. Here, “insertion” means an insertion in Ash1 relative to GRCh38, and other terms are similarly referring to changes in Ash1 compared to GRCh38. “Truncated” indicates the transcript was only partially mapped. “Stop gained” refers to premature stop codons caused by a SNP
| Variant type | Number of coding sequences |
|---|---|
| Identical | 92,600 |
| Mis-sense variant | 26,566 |
| In-frame deletion | 956 |
| In-frame insertion | 605 |
| Frameshift variant | 2158 |
| Start lost | 169 |
| Stop gained | 416 |
| Stop lost | 58 |
| Truncated | 564 |
| Unmapped | 138 |