| Literature DB >> 27882922 |
Yun Sung Cho1,2,3, Hyunho Kim4, Hak-Min Kim1,2, Sungwoong Jho3, JeHoon Jun3,4, Yong Joo Lee4, Kyun Shik Chae5, Chang Geun Kim5, Sangsoo Kim6, Anders Eriksson7, Jeremy S Edwards8, Semin Lee1,2, Byung Chul Kim1,2, Andrea Manica7, Tae-Kwang Oh9, George M Church10, Jong Bhak1,2,3,4.
Abstract
Human genomes are routinely compared against a universal reference. However, this strategy could miss population-specific and personal genomic variations, which may be detected more efficiently using an ethnically relevant or personal reference. Here we report a hybrid assembly of a Korean reference genome (KOREF) for constructing personal and ethnic references by combining sequencing and mapping methods. We also build its consensus variome reference, providing information on millions of variants from 40 additional ethnically homogeneous genomes from the Korean Personal Genome Project. We find that the ethnically relevant consensus reference can be beneficial for efficient variant detection. Systematic comparison of human assemblies shows the importance of assembly quality, suggesting the necessity of new technologies to comprehensively map ethnic and personal genomic structure variations. In the era of large-scale population genome projects, the leveraging of ethnicity-specific genome assemblies as well as the human reference genome will accelerate mapping all human genome diversity.Entities:
Mesh:
Year: 2016 PMID: 27882922 PMCID: PMC5123046 DOI: 10.1038/ncomms13637
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Schematic overview of KOREF assembly procedure.
(a) Short and long insert size libraries by Illumina whole-genome sequencing strategy. (b) Contig assembly using K-mers from short insert size libraries. (c) Scaffold assembly using long insert size libraries. (d) Super-scaffold assembly using OpGen whole-genome mapping approach. (e) Gap closing using PacBio long reads and Illumina TSLR. (f) Assembly assessment using BioNano consensus maps. (g) Chromosome sequence building using whole-genome alignment information into the human reference (GRCh38). (h) Common variants substitution using 40 Korean whole-genome sequences.
KOREF build statistics along the assembly steps.
| N90 | 8.59 | 89,240 | 3.09 | 178 | 3.86 | 140 | 3.53 | 143 | 81.54 | 19 |
| N80 | 14.62 | 63,987 | 6.45 | 116 | 9.45 | 92 | 9.26 | 93 | 103.05 | 16 |
| N70 | 20.42 | 47,417 | 10.45 | 81 | 14.47 | 67 | 14.53 | 67 | 136.43 | 13 |
| N60 | 26.58 | 35,099 | 16.16 | 59 | 19.56 | 49 | 19.36 | 50 | 137.59 | 11 |
| N50 | 33.38 | 25,446 | 19.85 | 42 | 25.93 | 36 | 26.08 | 36 | 155.88 | 8 |
| Longest | 334.16 | — | 81.91 | — | 101.22 | — | 101.48 | — | 251.92 | — |
| Gaps | 0% | — | 1.65% | — | 1.75% | — | 1.06% | — | 9.44% | — |
| Total (≥200 bp) | 2.87 Gb | 230,514 | 2.92 Gb | 68,170 | 2.92 Gb | 68,103 | 2.94 Gb | 68,451 | 3.12 Gb | 24 |
| Total (≥10 Kb) | 2.52 Gb | 82,254 | 2.88 Gb | 1,243 | 2.88 Gb | 1,176 | 2.90 Gb | 1,369 | 3.12 Gb | 24 |
*unplaced scaffolds were excluded.
Systematic comparison of assembly quality.
| GRCh38C | 3,209,286,105 | 67.79/16 | — | 212,777,868 (6.63%) | 1,564,209,365 (48.74%) | 20,135 |
| KOREF_CS,L,M | 3,211,075,818 | 26.46/35 | 88.47 (scaffolds) | 149,353,191 (4.65%) | 1,452,404,484 (45.23%) | 17,758 |
| CHM1_PacBio_r2L | 2,996,426,293 | 26.90/30 | 88.02 | 205,559,250 (6.86%) | 1,541,211,387 (51.43%) | 17,657 |
| CHM1_1.1S,B | 3,037,866,619 | 50.36/20 | — | 157,426,845 (5.18%) | 1,417,977,130 (46.68%) | 18,040 |
| NA12878_singleL,M | 3,176,574,379 | 26.83/37 | 88.26 | 168,652,649 (5.31%) | 1,545,168,387 (48.64%) | 6,610 |
| NA12878_AllpathsS | 2,786,258,565 | 12.08/67 | 82.89 | 90,343,965 (3.24%) | 1,250,655,296 (44.89%) | 16,995 |
| HuRefC | 2,844,000,504 | 17.66/48 | 85.85 | 134,317,812 (4.72%) | 1,411,487,301 (49.63%) | 16,968 |
| MongolianS | 2,881,945,563 | 7.63/111 | 86.54 | 121,384,034 (4.21%) | 1,399,420,366 (48.56%) | 17,189 |
| YH_2.0S | 2,911,235,363 | 20.52/39 | 86.31 | 127,254,909 (4.37%) | 1,397,013,571 (47.99%) | 17,125 |
| AfricanS | 2,676,008,911 | 0.062/11,689 | 69.47 | 55,830,170 (2.09%) | 968,988,149 (36.21%) | 9,167 |
NGS, next-generation sequencing. Major sequencing and mapping data used in the assembly are marked by superscript letters: C, chain-terminating Sanger sequences; B, indexed BAC end sequences; L, long reads; M, genome maps; S, NGS short reads.
Summary of SVs in eight human assemblies compared with GRCh38.
| KOREF_CS,L,M | 9,838 | 8,392 (85.7) | 6,992 (71.1) | 912 (9.3) | 6,691 (68.3) | 955 (9.7) |
| MongolianS | 12,830 | 10,775 (87.7) | 8,929 (69.6) | 1,242 (9.7) | 9,101 (74.1) | 834 (6.8) |
| YH_2.0S | 5,027 | 4,664 (93.8) | 4,119 (81.9) | 633 (12.6) | 3,063 (61.6) | 148 (3.0) |
| CHM1_PacBio_r2L | 3,454 | 3,130 (92.0) | 2,340 (67.7) | 1,002 (29.0) | 2,448 (72.0) | 301 (8.8) |
| CHM1_1.1S,B | 3,926 | 3,258 (83.7) | 2,848 (72.5) | 394 (10.0) | 2,800 (71.9) | 487 (12.5) |
| NA12878_singleL,M | 4,859 | 4,171 (86.7) | 3,339 (68.7) | 1,041 (21.4) | 3,492 (72.6) | 400 (8.3) |
| NA12878_AllpathsS | 5,179 | 4,649 (91.0) | 4,014 (77.5) | 378 (7.3) | 3,787 (74.1) | 269 (5.3) |
| AfricanS | 10,772 | 10,026 (94.0) | 8,362 (77.6) | 425 (3.9) | 8,935 (83.8) | 212 (2.0) |
NGS, next-generation sequencing. Major sequencing and mapping data used in the assembly are marked by superscript letters: B, indexed BAC end sequences; L, long reads; M, genome maps; S, NGS short reads.
Figure 2SVs among human assemblies.
(a) The correlation between N50 length of fragments (scaffolds or contigs) and fraction of novel SVs. (b) The correlation between N50 length of fragments and fraction of SVs shared with the CHM1 PacBio read mapping method. (c) Exclusively shared SVs among human assembly sets. SVs shared (reciprocally 50% covered) by only denoted assemblies were considered in this figure. (d) An example of SV that was shared by nine human assemblies. Grey regions denote structural differences shared among all the assemblies, and horizontal lines indicate homologous sequence regions.
Figure 3Variants difference depending on the reference genome.
Variants (SNVs and small indels) numbers within the regions shared by KOREFs, GRCh38 and GRCh38_C were compared using whole-genome re-sequencing data from three different ethnic groups (Africans: Mandenka, Yoruba, San, Mbuti and Dinka; Caucasians: Sardinian, French and three CEPH/Utah (CEU); East-Asians: Mongolian, two Chinese, two Japanese and five Koreans). (a) Number of homozygous SNVs. (b) Number of homozygous small indels. (c) Number of heterozygous SNVs. (d) Number of heterozygous small indels. (e) The number of variants (referenced by GRCh38 and KOREF_C) at different levels of sharedness. (f) The number of reference-specific variants at different levels of sharedness.