| Literature DB >> 28706299 |
Soo Heon Kwak1, Jeesoo Chae2,3, Seongmin Choi4, Min Jung Kim2,3, Murim Choi3, Jong-Hee Chae5, Eun-Hae Cho6, Tai Ju Hwang7, Se Song Jang2,3, Jong-Il Kim2,3,8, Kyong Soo Park1,9,10, Yung-Jue Bang1.
Abstract
Ethnically specific data on genetic variation are crucial for understanding human biology and for clinical interpretation of variant pathogenicity. We analyzed data obtained by deep sequencing 1303 Korean whole exomes; the data were generated by three independent whole exome sequencing projects (named the KOEX study). The primary focus of this study was to comprehensively analyze the variant statistics, investigate secondary findings that may have clinical actionability, and identify loci that should be cautiously interpreted for pathogenicity. A total of 495 729 unique variants were identified at exonic regions, including 169 380 nonsynonymous variants and 4356 frameshift insertion/deletions. Among these, 76 607 were novel coding variants. On average, each individual had 7136 nonsynonymous single-nucleotide variants and 74 frameshift insertion/deletions. We classified 13 pathogenic and 13 likely pathogenic variants in 56 genes that may have clinical actionability according to the guidelines of the American College of Medical Genetics and Genomics, and the Association for Molecular Pathology. The carrier frequency of these 26 variants was 2.46% (95% confidence interval 1.73-3.46). To identify loci that require cautious interpretation in clinical sequencing, we identified 18 genes that are prone to sequencing errors, and 671 genes that are highly polymorphic and carry excess nonsynonymous variants. The catalog of identified variants, its annotation and frequency information are publicly available (http://koex.snu.ac.kr). These findings should be useful resources for investigating ethnically specific characteristics in human health and disease.Entities:
Mesh:
Year: 2017 PMID: 28706299 PMCID: PMC5565953 DOI: 10.1038/emm.2017.142
Source DB: PubMed Journal: Exp Mol Med ISSN: 1226-3613 Impact factor: 8.718
Brief description of studies and WES procedures
| N | |||||
|---|---|---|---|---|---|
| Type 2 diabetes mellitus whole-exome sequencing study | 910 | 415/495 | Hiseq 2000 (Illumina) | SureSelect v4+UTR (71 Mb) | 103.7 (66.9, 175.0) |
| Phenotypically normal parents of rare disease patients | 191 | 98/93 | Hiseq 2500 (Illumina) | NimbleGen SeqCapV2 (44 Mb) | 65.2 (38.4, 118.1) |
| Hemophilia case study | 202 | 202/0 | Hiseq 2000 (Illumina) | SureSelect v5+UTR (75 Mb) | 58.9 (32.3, 120.0) |
Abbreviations: F, female; M, male; Mb, mega base pair; N, sample size; UTR, untranslated regions.
The present study is based on whole-exome sequence data of 1303 Koreans participating in three cohort studies. Collected data yield median on-target coverage between 58.9 × and 103.7 × by each cohort, producing high-quality sequencing data.
Overall and per-sample variant statistics
| CDS | 284 991 | 145 136 | 34 162 | 229 168 | 25 722 | 30 101 | 16 070 |
| Nonsynonymous | 169 380 | 91 616 | 20 917 | 142 224 | 13 680 | 13 476 | 7136 |
| Synonymous | 107 148 | 48 484 | 12 310 | 79 703 | 11 462 | 15 983 | 8591 |
| Splice site | 1665 | 1091 | 177 | 1498 | 96 | 71 | 38 |
| Stop gain/loss | 3642 | 2454 | 405 | 3363 | 177 | 102 | 48 |
| Not in dbSNP 147 | 73 241 | 63 651 | 6807 | 73 137 | 104 | 0 | 68 |
| UTR | 181 064 | 83 062 | 20 659 | 1 340 722 | 19 814 | 27 178 | 13 130 |
| CDS | 8057 | 4854 | 939 | 6979 | 634 | 444 | 224 |
| Frameshift | 4356 | 2888 | 483 | 3920 | 276 | 160 | 74 |
| In-frame | 3221 | 1701 | 402 | 2663 | 308 | 250 | 120 |
| Splice site | 235 | 131 | 22 | 190 | 26 | 19 | 15 |
| Stop gain/loss | 121 | 75 | 15 | 107 | 10 | 4 | 2 |
| Not in dbSNP147 | 3366 | 2959 | 260 | 3337 | 25 | 4 | 3 |
| UTR | 21 617 | 9449 | 2365 | 15 606 | 2652 | 3359 | 1577 |
| HGMD-DM | 2897 (253) | 1279 (146) | 351 (34) | 2292 (222) | 431 (23) | 174 (8) | 84.1 (3.0) |
| ClinVar-P | 500 (36) | 226 (21) | 57 (7) | 389 (34) | 74 (2) | 37 (0) | 19.2 (0.1) |
Abbreviations: AC, allele count; CDS, coding sequence; ClinVar-P, Pathogenic variant in the ClinVar database; HGMD-DM, Human Gene Mutation Database disease-causing variants (either low or high confidence); MAF, minor allele frequency; UTR, untranslated region.
Variant statistics according to functional annotation and frequency bin are shown.
For functional CDS variants, the number of SNVs is shown with the number of INDELs given in parentheses. Calculations for UTR variants were performed with SNUH project 1 and the Green Cross project.
Figure 1Population stratification and principle component analyses. The merged data of 1303 Korean participants and the 1000 Genomes Project were investigated (a) for population structure analysis using the ADMIXTURE and (b) for principal component analysis. The Korean participants of our study clustered with East Asians and were separated from other populations. ACB, African Caribbeans in Barbados; AFR, African; AMR, American; ASW, Americans of African Ancestry in SW USA; BEB, Bengali from Bangladesh; CDX, Chinese Dai in Xishuangbanna, China; CEU, Utah Residents (CEPH) with Northern and Western Ancestry, USA; CHB, Han Chinese in Beijing, China; CHS, Southern Han Chinese; CLM, Colombians from Medellin, Colombia; EAS, East Asian; ESN, Esan in Nigeria; EUR, European; FIN, Finnish in Finland; GBR, British in England and Scotland, UK; GIH, Gujarati Indian from Houston, Texas, USA; GWD, Gambian in Western Divisions in the Gambia; IBS, Iberian Population in Spain; ITU, Indian Telugu from the UK; JPT, Japanese in Tokyo, Japan; KHV, Kinh in Ho Chi Minh City, Vietnam; KOR, Korean; LWK, Luhya in Webuye, Kenya; MSL, Mende in Sierra Leone; MXL, Mexican Ancestry from Los Angeles, USA; PEL, Peruvians from Lima, Peru; PJL, Punjabi from Lahore, Pakistan; PUR, Puerto Ricans from Puerto Rico; SAS, South Asian; STU, Sri Lankan Tamil from the UK; TSI, Toscani in Italy; YRI, Yoruba in Ibadan, Nigeria.
Allele count and carrier frequency of P or LP variants in 56 clinically actionable genes
| P | 12 | 1.32% (0.73–2.32%) | 2 | 1.05% (0.04–3.98%) | 1 | 0.50% (0.01–3.03%) | 15 | 1.15% (0.68–1.91%) |
| LP | 13 | 1.43% (0.81–2.45%) | 2 | 1.05% (0.04–3.98%) | 2 | 0.99% (0.04–3.77%) | 17 | 1.30% (0.80–2.10%) |
| P+LP | 25 | 2.75% (1.85–4.04%) | 4 | 2.09% (0.63–5.45%) | 3 | 1.49% (0.30–4.48%) | 32 | 2.46% (1.73–3.46%) |
| VUS | 830 | 91.2% (89.2–92.9%) | 153 | 80.1% (73.8–85.2%) | 185 | 91.6% (86.9–94.8%) | 1168 | 89.6% (87.9–91.2%) |
Abbreviations: LP, likely pathogenic; P, pathogenic; VUS, variant of uncertain significance.
Allele count and carrier frequency of P or LP variants are shown for each WES project. Data are shown as the number, frequency (95% confidence interval).
The carrier frequency of VUS was calculated using the proportion of subjects with at least one VUS. A modified Wald test was used to calculate 95% confidence intervals.
Figure 2Evidence attributes for the 26 P or LP variants in 56 ACMG genes. Individual evidence attributes of the 26 variants classified as P or LP are shown. Exonic functions of the variants are shown on the right. Classification was based on the ACMG guidelines (Calls) or determined using the ClinVar database. ACMG, American College of Medical Genetics and Genomics.
Figure 3Characterization of highly misinterpretable loci. (a) Example of a sequencing error prone gene, MUC6, showing excess coverage (gray) and imbalanced allelic fraction (blue) of VQSR filtered variants. (b) Distribution of variants with significant deviation from HWE (green), indicating sequencing error prone loci. (c–f) The frequency distribution of genes according to the burden of nonsynonymous variants and their cutoff values for highly polymorphic genes. The cutoff values for (c) excess number of entire nonsynonymous variants was 1.88, (d) excess rate of entire nonsynonymous variants was 0.78 × 10−3, (e) excess number of rare nonsynonymous variants was 0.064 and (f) excess rate of rare nonsynonymous variants was 0.027 × 10−3. The dashed line indicates the cut-off value for each category. The Venn diagrams show how the highly polymorphic genes in Koreans overlap with the 1000 Genomes Project (1KGP). VQSR, Variant Quality Score Recalibration.