| Literature DB >> 29618732 |
Jungeun Kim1, Jessica A Weber2, Sungwoong Jho1, Jinho Jang3,4, JeHoon Jun1,5, Yun Sung Cho5, Hak-Min Kim3,4, Hyunho Kim5, Yumi Kim5, OkSung Chung1,5, Chang Geun Kim6, HyeJin Lee1, Byung Chul Kim7, Kyudong Han8, InSong Koh9, Kyun Shik Chae6, Semin Lee3,4, Jeremy S Edwards10, Jong Bhak11,12,13,14.
Abstract
High-coverage whole-genome sequencing data of a single ethnicity can provide a useful catalogue of population-specific genetic variations, and provides a critical resource that can be used to more accurately identify pathogenic genetic variants. We report a comprehensive analysis of the Korean population, and present the Korean National Standard Reference Variome (KoVariome). As a part of the Korean Personal Genome Project (KPGP), we constructed the KoVariome database using 5.5 terabases of whole genome sequence data from 50 healthy Korean individuals in order to characterize the benign ethnicity-relevant genetic variation present in the Korean population. In total, KoVariome includes 12.7M single-nucleotide variants (SNVs), 1.7M short insertions and deletions (indels), 4K structural variations (SVs), and 3.6K copy number variations (CNVs). Among them, 2.4M (19%) SNVs and 0.4M (24%) indels were identified as novel. We also discovered selective enrichment of 3.8M SNVs and 0.5M indels in Korean individuals, which were used to filter out 1,271 coding-SNVs not originally removed from the 1,000 Genomes Project when prioritizing disease-causing variants. KoVariome health records were used to identify novel disease-causing variants in the Korean population, demonstrating the value of high-quality ethnic variation databases for the accurate interpretation of individual genomes and the precise characterization of genetic variations.Entities:
Mesh:
Year: 2018 PMID: 29618732 PMCID: PMC5885007 DOI: 10.1038/s41598-018-23837-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Statistics of KoVariome.
|
| |
| No. of samples (Male/Female) | 50 (31/19) |
| Total NGS yield | 5.5 tera bases |
| Average sequenced depth | 31x |
| Average mapped read rates | 95% |
|
| |
| Total No. of SNVs | 12,735,004 |
| No. of known variants in 1000GPa | 8,967,464 |
| No. of known variants in dbSNPb | 10,286,599 |
| Average No. of SNV per sample | 3,813,311 |
| Average No. of Coding SNVsc | 20,097 |
| Average No. of non-synonymous SNVsc | 10,394 |
| Average No. of SNVs with high effectsc | 287 |
|
| |
| Total No. of indels | 1,743,117 |
| No. of known variants in 1000GPa | 848, 471 |
| No. of known variants in dbSNPb | 1,307,000 |
| Average No. of indel per sample | 503,553 |
| Average No. of Coding indelsc | 258 |
| Average No. of LOF indelsc | 157 |
Variants deposited ina 1000GP and bthe dbSNP (ver. 146). cpredicted with SNPEff.
Figure 1Status of KPGP variomes analyzed using 50 unrelated Korean individuals. (A) Accumulation of novel SNV alleles. The number of novel SNV alleles were defined as newly identified nucleotides compared with previously constructed SNVs in KoVariome. (B) Genetic distance according to the familial relationships. Abbreviations: Monozygotic Twin (MT), Parent and child (PC), Brothers (Br), Grandparents vs. grand children (GPC), Uncle vs. Nephew (UN), and Cousins (Co).
Figure 2Genetic features of KoVariome. (A) Two dimensional classification of KoVariome. SNVs and indels observed in 1000GP data were classified based on the minor allele frequencies (MAF); ’1000GP Common’: MAF >=5% in all five continents, ‘1000GP Low frequency’: MAF>=0.1% in any continent, and ‘1000GP Rare’; MAF < 0.1% in all five continents. The five continental populations included African (AFR), European (EUR), Native American (AMR), South Asian (SAS), and East Asian (EAS). The second group was classified by the number of variants in KoVariome; ‘Frequent in KoVariome’ (>=3) and ‘Rare in KoVariome’ (<3). (B) The Venn diagrams represent the number of variants enriched in specific continents for both SNVs (left) and indels (right). The enrichment was analyzed by Fisher’s exact test based on odds ratio > 3 and p-value < 0.05. The total numbers of enriched variants in the Korean (KOR) population are denoted in the white space of the Venn diagram. The numbers next to the continental population abbreviations represent the total number of enriched variants in that 1000GP continental group. The numbers within each ellipse denote the number of variants enriched both in KOR and a specific continent (left) and the number of variants enriched exclusively in the represented continent (right); with their relative percentages listed in parentheses below. (C) Rare variant ratios (RVRs) observed in each genomic region. RVRs were calculated by dividing the number rare variants by the number of frequent variants in KoVariome.
ClinVar annotation of the KoVariome frequent SNVs.
| Chr. | Position | Ref | Alt | rs Noa | Gene | Codon Changes | Disease | Inheritance Typeb | Noc | MAFd |
|---|---|---|---|---|---|---|---|---|---|---|
| 17 | 33,445,518 | A | C | rs200564819* |
| Splice-site | Familial breast-ovarian cancer 4 | n.a | 5 | 0.05 |
| 1 | 161,599,571 | T | C | rs2290834 |
| I106V | Neutrophil-specific antigens na1/na2 | UNKNOWN | 3 | 0.15 |
| 8 | 100,844,596 | G | T | rs386834119 |
| Splice-site | Cohen syndrome | AR | 13 | 0.26 |
| 2 | 158,630,626 | C | G | rs121912678 |
| R206P | Fibrodysplasia ossificans progressive | AD | 14 | 0.14 |
| 1 | 209,961,970 | C | G | rs200166664 |
| R400P | Van der Woude syndrome | AD | 14 | 0.14 |
| 11 | 18,290,859 | C | T | rs1136743 |
| A70V | Systemic amyloidosis | AR | 22 | 0.66 |
AR: autosomal recessive; AD: autosomal dominant; Chr.: chromosome; Ref. reference allele; Alt. alternative allele.
aKoVariome frequent SNVs with the Reference SNP cluster IDs (rs number) in ClinVar. We were only included pathogenic or likely pathogenic (*) SNVs.
bInheritance types were searched against OMIM database with rs numbers and phenotypes represented in ClinVar database. ‘n.a.’ represents there are no data in the OMIM database. ‘UNKNOWN’ represents inheritance type for corresponding phenotype was not reported in OMIM database.
cNo. of alternative allele in Korean population, dminor allele frequencies (MAF) in KoVariome.
Figure 3Individual variants describing functional effects. (A) Classification of individual variants based on frequency in 1000GP and KoVariome. Gray represents the portion of individual variants classified in the ‘1000GP common’ and ‘1000GP Low frequency’. Blue represents the portion of the individual variants classified in the ‘Frequent in KoVariome’. Red represents rare variants in both 1000GP and KoVariome ‘Rare in Both’. (B) Individual variants in the ‘Rare in Both’ were classified by gene coordinates. To more clearly represent the portion functionally important rare variants, 98% of the rare variants in the non-coding regions were not represented. (C) Number of pathogenic variants for each individual. Red and blue bars represent the number of pathogenic variants previously reported in dbSNP and novel, respectively.
Statistics of individual SNVs.
| Statistics of individual variants | No. of SNVs | (%) |
|---|---|---|
| 1000GP common and 1000GP low frequency SNPs | 3.4 M | (88.70) |
| Frequent SNVs in KoVariome | 0.4M | (9.39) |
| 1000GP rare and KoVariome rare SNVs | 47,957 | (1.26) |
|
| ||
| Protein-Coding | 326 | (40.72) |
| Synonymous SNVs | 107 | (13.37) |
| Non-synonymous SNVs | 219 | (27.36) |
| Splice-site SNVs | 7 | (0.87) |
| RNA-Coding | 80 | (9.93) |
|
| ||
| Median No. of pathogenic rare SNVsa | 137 | (65.06) |
aPathogenicity of the rare SNVs were predicted by at least one program among SIFT, Polyphen2, PROVEAN, MetaSVM, and MetaLE.
Known pathogenic rare variants associated with disease.
| Individual ID | rs No. | Genotype | Codon change | Inheritance typea | gene | ClinVarTraits |
|---|---|---|---|---|---|---|
| KPGP-00001 | rs563607795 | A/G | L385P | n.a. |
| Thiamine metabolism dysfunction syndrome |
| KPGP-00001 | rs199769221* | G/C | R116P | AD |
| Hereditary pancreatitis |
| KPGP-00032 | rs387907164 | T/C | C32R | AR |
| UV-sensitive syndrome 3 |
| KPGP-00033 | rs119490107 | C/A | D234Y | UNKNOWN |
| Carcinoma of colon |
| KPGP-00039 | rs199476197 | A/C | H331P | AR |
| Bietti crystalline corneoretinal dystrophy |
| KPGP-00088 | rs28940280 | G/A | D279N | AR |
| Ceroid lipofuscinosis neuronal 5 |
| KPGP-00122 | rs587782989 | C/T | R464H | AD |
| Spinocerebellar ataxia 40 |
| KPGP-00124 | rs142808899 | C/T | G303R | AR |
| Smith-Lemli-Opitz syndrome |
| KPGP-00127 | rs111033744 | A/G | Y100C | AR |
| Galoctosemia |
| KPGP-00127 | rs137852972 | T/C | N88S | AD |
| Silver spastic paraplegia syndrome |
| KPGP-00129 | rs137853022 | C/T | R696Q | AR |
| Familial dysautonomia |
| KPGP-00129 | rs386833823* | G/A | S238F | AR |
| Lysinuric protein intolerance |
| KPGP-00131 | rs200088377 | G/A | P191L | n.a. |
| Delayed puberty |
| KPGP-00136 | rs121908099 | G/A | R405Q | AR |
| Cholestanol storage disease |
| KPGP-00136 | rs750218942 | C/G | Splice-site | AR |
| Xeroderma pigmentosum |
| KPGP-00136 | rs727502791 | G/A | R158* | AD |
| Aortic aneurysm (familial thoracic 9) |
| KPGP-00136 | rs545215807 | G/A | G109S | AR |
| VLCAD deficiency |
| KPGP-00139 | rs387907033 | G/C | G401A | AR |
| Spinocerebellar ataxia |
| KPGP-00139 | rs748486078 | G/A | S95L | UNKNOWN |
| Candidiasis |
| KPGP-00144 | rs119480073 | C/T | R801 | AR |
| Myoglobinuria |
| KPGP-00144 | rs104895438 | G/A | A612T | AD |
| Sarcoidosis |
| KPGP-00205 | rs121913050 | G/A | R153H | UNKNOWN |
| XFE progeroid syndrome |
| KPGP-00220 | rs121918673 | G/C | S439R | AD |
| Diabetes mellitus type 2 |
| KPGP-00266 | rs104894085 | G/A | Q258* | AR |
| Cholesterol monooxygenase deficiency |
| KPGP-00227 | rs121909569 | A/G | S148P | AD, AR |
| Antithrombin III deficiency |
| KPGP-00228 | rs121434426 | G/A | Q356* | UNKNOWN |
| Fanconi anemia |
| KPGP-00232 | rs121909385 | T/C | L623P | AR |
| Familial hypokalemia hypomagnesemia |
| KPGP-00233 | rs672601312 | G/T | E127* | AR |
| Immunodeficiency 38 with basal ganglia calcification |
| KPGP-00233 | rs749462358 | C/T | E924K | n.a. |
| Not provided |
| KPGP-00245 | rs137854500 | C/T | D1289N | AR |
| Tangier disease |
| KPGP-00254 | rs201968272 | G/A | R237Q | AR |
| Warsaw breakage syndrome |
| KPGP-00325 | rs121912749 | C/T | G130R | AD |
| Spherocytosis type 4 |
Abbreviations: Chr. chromosome; Ref. reference allele; Alt. alternative allele; AD: autosomal dominant; AR: autosomal recessive.
*The clinical significance of SNV locus was defined as likely pathogenic in the ClinVar database.
aInheritance type were searched against OMIM database with rs numbers and phenotypes in the ClinVar database. ‘n.a.’ represents there are no data in the OMIM database. ‘UNKNOWN’ represents inheritance type for corresponding phenotype was not reported in OMIM database.
Figure 4Properties of structural variants discovered in KoVariome. (A) The boxplot represents the number of variants per Korean individual by variant type (n = 50). The lower and upper hinges of the boxes correspond to the 25th and 75th percentiles and the whiskers represent the 1.5x inter-quartile range (IQR) extending from the hinges. Abbreviations of the variants: inversions (INV), intra-chromosomal translocation (ITX), insertions (INS), and deletions (DEL). (B) Length of the variants present in the individual genome. See variant types and boxplot definition in A. (C) Frequency of variants in KoVariome. (D) The upper graph represents the number of SVs identified at specific length ranges. The KoVariome specific variants were defined by comparing SVs in the Database of Genomic Variants (DGV) with 70% reciprocal overlap. The lower graph represents the portion of repeats distributed in the variants. Repeat classes were defined by the repeat annotations provided in the UCSC Genome bioinformatics. Simple repeats contained both microsatellites and low complexity (e.g., AT-rich). Abbreviations of repeats: short interspersed element (SINE), long interspersed element (LINE), and long terminal repeat (LTR).
Figure 5Properties of copy number variations in KoVariome. (A) The number CNVs in the Korean population and the portion of the repeats in a specific length range. The conserved CNVs were defined by searching the Database of Genomic Variants (DGV) with 70% reciprocal overlaps. See the abbreviations of repeats in Fig. 4B. Korean enriched CNVs were identified by searching the CNVs reported in the 1000GP. No. represents the number of CNVs predicted in KoVariome. The heatmap represents the odds ratio of the CNVs compared to the CNV ratio in a specific 1000GP continental group. Associated genes were identified by searching the OMIM database. Abbreviations of continent group: European (EUR), African (AFR), Native American (AMR), South Asian (SAS), and East Asian (EAS).
Copy number variations conserved in 50 Korean individuals.
| Chr. | Start | End | CNV Types | Average copy number | Genesa |
|---|---|---|---|---|---|
| chr2 | 132,964,050 | 133,121,849 | Dup. | 4.02 |
|
| chr10 | 46,222,900 | 46,946,499 | Del. | 1.0 |
|
| chr10 | 46,946,200 | 47,150,299 | Dup. | 4.22 |
|
| chr10 | 47,147,400 | 47,384,499 | Del. | 1.0 |
|
| chr15 | 21,885,000 | 21,944,149 | Dup. | 6.4 |
|
aGenes in the identified CNV region. Chr. Chromosome; Dup. duplication; Del. deletions.