| Literature DB >> 33833373 |
Seyoung Mun1,2,3, Songmi Kim1,2, Wooseok Lee3, Keunsoo Kang4, Thomas J Meyer5,6, Bok-Ghee Han7, Kyudong Han8,9,10, Heui-Soo Kim11.
Abstract
Advances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE-TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33833373 PMCID: PMC8102501 DOI: 10.1038/s12276-021-00586-y
Source DB: PubMed Journal: Exp Mol Med ISSN: 1226-3613 Impact factor: 8.718
Summary of KPGP9 de novo assembly statistics.
| Step | Size | Sequences | Total Size (bp) | N50 | N90 | Longest | Scaffold genome coverage (%) |
|---|---|---|---|---|---|---|---|
| Contig | – | 1,439,891 | 2,860,663,260 | 49,974 | 8719 | 604,770 | |
| Scaffolds (filtering steps) | 2 K | 1,344,398 | 2,895,066,623 | 300,579 | 54,077 | 2,055,172 | |
| 5 K | 1,327,495 | 2,921,071,090 | 1,027,730 | 172,103 | 6,646,083 | ||
| 10 K | 1,321,040 | 2,956,383,059 | 8,116,997 | 977,239 | 45,450,559 | ||
| 20 K | 1,320,902 | 2,961,961,259 | 11,947,405 | 1,320,382 | 77,015,144 | ||
| 40 K | 1,182,591 | 2,963,912,249 | 13,083,805 | 1,336,383 | 75,957,870 | ||
| Gap filled | >500 | 31,505 | 2,862,402,237 | 13,235,598 | 1,787,215 | 75,412,104 | 92.30 |
Summary of TASV insertions in the KPGP9 individual genome.
| Class | Subfamily | Total candidates | Filtered out | dbRIP | Validation by PCR | False positive | Not working | Total confirmed | Contributions to KPGP9 (loss) |
|---|---|---|---|---|---|---|---|---|---|
| 2823 | 2669 | 50 | 104 | 11 | 2 | 84 (7) | 26,333 bp (−35 bp) | ||
| 541 | 528 | 3 | 10 | 2 | – | 8 | 2580 bp | ||
| 1991 | 1923 | 19 | 49 | 5 | – | 40 (4) | 13,568 bp (−54 bp) | ||
| 351 | 332 | 5 | 14 | – | – | 12 (2) | 3946 bp (−3270 bp) | ||
| 556 | 528 | 5 | 23 | 17 | 2 | 4 | 987 bp | ||
| 192 | 183 | – | 9 | 6 | – | 2 (1) | 798 bp (−9 bp) | ||
| 895 | 880 | 5 | 10 | 3 | 3 | 3 (1) | 1150 bp (−1 bp) | ||
| 354 | 339 | 2 | 13 | 11 | 1 | 1 | 135 bp | ||
| 168 | 167 | 0 | 1 | 1 | – | 0 | – | ||
| 127 | 125 | 1 | 1 | – | – | 1 | 111 bp | ||
| LINE-1 | L1HS | 3124 | 3117 | 3 | 4 | 1 | – | 4 | 6399 bp |
| LTR | HERV-K | 224 | 222 | – | 2 | 1 | – | 1 | 684 bp |
| HERVK14C | 87 | 85 | – | 2 | 1 | – | 1 | 2725 bp | |
| LTR5_HS | 653 | 649 | – | 4 | – | – | 4 | 2541 bp | |
| SVA | SVA_A | 1,050 | 1,048 | – | 2 | 2 | – | 0 | – |
| SVA_E | 615 | 613 | – | 2 | 2 | – | 0 | – | |
| SVA_F | 797 | 788 | 4 | 5 | 3 | – | 2 | 2275 bp | |
aNon-Classical Alu Insertion.
Summary of TASV deletions in the KPGP9 individual genome.
| Class | Total candidates | Filtered out | Subclass | Validation by PCR | False positive | Not working | Total confirmed | Contributions to KPGP9 |
|---|---|---|---|---|---|---|---|---|
| ARMD | 3321 | 3081 | NAHR | 131 | 73 | 7 | 51 | 42,909 bp |
| NHEJ | 109 | 32 | 11 | 26 | 7313 bp | |||
| L1RMD | 1355 | 1315 | NAHR | 7 | 5 | 0 | 2 | 3177 bp |
| NHEJ | 33 | 18 | 5 | 10 | 29,373 bp | |||
Clinical implication of genes-associated with novel TASVs in the KPGP9 genome.
| Gene symbol | Genes | Types of TASV (density of the given TE) | Disease (# OMIM. Phenotype) | Inheritance | UMLS ID | Source | No. of SNPs | No. of publications |
|---|---|---|---|---|---|---|---|---|
| Peripherin 2 | Retinitis Pigmentosa 7 (#608133) | Autosomal dominant inheritance | umls:C1842475 | CLINVAR, CTD_human, MGD, UNIPROT | 9 | 6 | ||
| Patterned dystrophy of retinal pigment epithelium (#169150, #608161) | Autosomal dominant inheritance | umls:C1868569 | CLINVAR, CTD_human, UNIPROT | 7 | 4 | |||
| CHOROIDAL DYSTROPHY, CENTRAL AREOLAR 2 (#613105) | Autosomal dominant inheritance | umls:C2751290 | BeFree, CLINVAR, CTD_human, UNIPROT | 4 | 2 | |||
| Dystrophin | (6.24%) | Muscular Dystrophy, Duchenne (# 310200) | X-linked recessive inheritance | umls:C0013264 | BeFree, CLINVAR, CTD_human, GAD, LHGDN, MGD, ORPHANET, UNIPROT | 175 | 535 | |
| Becker Muscular Dystrophy (# 300376) | X-linked recessive inheritance | umls:C0917713 | BeFree, CLINVAR, GAD, MGD, ORPHANET, UNIPROT | 136 | 205 | |||
| DMD-associated dilated cardiomyopathy | X-linked recessive inheritance | umls:C3668940 | BeFree, CLINVAR, CTD_human, GAD, UNIPROT | 213 | 35 | |||
| Erythrocyte membrane protein band 4.1 | Two (31.54%) | Hereditary Elliptocytosis 1 | Autosomal dominant inheritance | umls:C2678497 | CLINVAR, CTD_human, MGD | 1 | 29 | |
| Mitochondrial poly(A) polymerase | (31.51%) | SPASTIC ATAXIA 4, AUTOSOMAL RECESSIVE (#613672) | Autosomal recessive inheritance | umls:C3150925 | CLINVAR, CTD_human, ORPHANET, UNIPROT | 1 | 1 | |
| Xanthine dehydrogenase | (4.74%) | Xanthinuria, Type I | Autosomal recessive inheritance | umls:C0268118 | BeFree, CLINVAR, CTD_human, ORPHANET, UNIPROT | 2 | 8 | |
| Phosphodiesterase 8B | (7.07%) | Striatal Degeneration, Autosomal Dominant (#609161) | Autosomal dominant inheritance | umls:C1836694 | BeFree, CLINVAR, CTD_human, ORPHANET | 0 | 1 | |
| Calcium/calmodulin-dependent serine protein kinase | (13.84%) | Mental Retardation and Microcephaly With Pontine And Cerebellar Hypoplasia (#300749) | X-linked dominant inheritance | umls:C2677903 | CLINVAR, CTD_human, ORPHANET, UNIPROT | 20 | 1 | |
| Fat mass and obesity associated | (10.15%) | Diabetes Mellitus, Non-Insulin-Dependent | Autosomal dominant inheritance | umls:C0011860 | BeFree, CTD_human, GAD, GWASCAT | 61 | 116 | |
| Obesity/BODY MASS INDEX QUANTITATIVE TRAIT (# 612460) | Polygenic inheritance | umls:C0028754 | BeFree, CTD_human, GAD, GWASCAT | 36 | 345 | |||
| Growth Retardation, Developmental Delay, Coarse Facies, And Early Death (#612938) | Autosomal recessive multiple congenital | umls:C2752001 | CLINVAR, CTD_human, ORPHANET, UNIPROT | 2 | 1 | |||
| Potassium channel, inwardly rectifying subfamily J, member 6 | (4.7%) | KEPPEN-LUBINSKY SYNDROME (#614098) | Undefined | umls:C3279800 | BeFree, CLINVAR, ORPHANET, UNIPROT | 2 | 1 | |
| Raf-1 proto-oncogene, serine/threonine kinase | SVA_F insertion (0%) | Noonan Syndrome/CARDIOMYOPATHY, DILATED (#615916) | Autosomal dominant inheritance | umls:C0028326 | BeFree, CLINVAR, CTD_human, GAD, LHGDN, ORPHANET | 5 | 25 | |
| Fatty acid 2-hydroxylase | NAHR-ARMDs (22.04%) | Leukodystrophy, Dysmyelinating, And Spastic Paraparesis With Or Without Dystonia (#612319) | Autosomal recessive inheritance | umls:C3496228 | CTD_human, MGD, ORPHANET, UNIPROT | 0 | 3 | |
| Euchromatic histone-lysine N-methyltransferase 1 | NAHR-ARMDs (20.66%) | Kleefstra Syndrome (#610253) | Autosomal dominant inheritance | umls:C0795833 | BeFree, CLINVAR, CTD_human, MGD, UNIPROT | 16 | 8 | |
| Insulin receptor | NAHR-ARMDs (40.47%) | Diabetes Mellitus, Non-Insulin-Dependent | Autosomal dominant inheritance | umls:C0011860 | BeFree, CLINVAR, GAD, RGD, UNIPROT | 11 | 114 | |
| Insulin Resistance (#262190) | Autosomal dominant inheritance | umls:C0021655 | CLINVAR, CTD_human, GAD, RGD | 3 | 8 | |||
| Donohue Syndrome (#246200) | Autosomal dominant inheritance | umls:C0265344 | BeFree, CLINVAR, CTD_human, GAD, ORPHANET, UNIPROT | 17 | 46 | |||
| Rabson-Mendenhall Syndrome | Autosomal recessive inheritance | umls:C0271695 | BeFree, CLINVAR, ORPHANET, UNIPROT | 4 | 19 | |||
| Hyperinsulinemic Hypoglycemia, Familial, 5 (#609968) | Autosomal dominant inheritance | umls:C1864952 | CLINVAR, CTD_human, ORPHANET, UNIPROT | 3 | 1 | |||
| Serpin peptidase inhibitor, clade F (alpha-2 antiplasmin, pigment epithelium derived factor), member 2 | NHEJ-ARMDs (38.39%) | ALPHA-2-PLASMIN INHIBITOR DEFICIENCY | Autosomal recessive inheritance | umls:C2752081 | CLINVAR, MGD, ORPHANET, UNIPROT | 2 | 1 | |
| Propionyl CoA carboxylase, beta polypeptide | NHEJ-L1RMDs (40.53%) | Propionic acidemia (#606054) | Autosomal recessive inheritance | umls:C0268579 | BeFree, CLINVAR, CTD_human, ORPHANET, UNIPROT | 24 | 23 | |
aThe genes annotated in OMIM phenotype data.
Fig. 1Novel AluYa5 insertion events in the 3’UTRs of the SIRPB1 and XPR1 genes.
Polymorphic insertion testing was performed for two AluYa5 insertions located in the 3’UTR regions of the a SIRPB1 and b XPR1 genes. A screenshot of the UCSC human genome browser (hg19) shows the location of each AluYa5 insertion in the 3’UTR of each gene; adjacent repeats are displayed in gray boxes. The results of PCR amplification from a panel of 80 geographically diverse individuals and 38 Korean samples are shown in this figure (Supplementary Table S4). The upper and lower bands denote the presence of an AluYa5 insertion and its absence, respectively. c The blue box shows the identical sequences of miRNA (hsa-miR-619) binding to the AluYa5 elements inserted in the 3’UTR regions of the SIRPB1 and XPR1 genes.
Fig. 2Size distribution of TASV deletion events and Alu subfamily composition involved in ARMD events.
a Size distribution of genomic deletions by insertion mechanism. The number of TASV deletions in 500 bp bins is shown on the y-axis. b The composition of Alu subfamilies involved in NAHR-ARMD (navy) and NHEJ-ARMD (red) events. c The x-axis indicates Alu subfamily contributions to the ARMDs observed in this study. The total number of events by mechanism is shown on the y-axis.
Fig. 3Comparisons of TASV insertions among three genome datasets.
a The numbers of TASV insertions were compared between KPGP9, another Korean genome (AK1), and TCGA genome data. Only 23 loci were shared among the three genome datasets, and 94 were unique to KPGP9. b The red and orange dots represent the margin of error in the TASV insertion points of AK1 and TCGA data, respectively, based on the TASV insertion points detected in the KPGP genome. The left side shows the margin of error for 88 common TASV insertions to KPGP, and the right side shows the margin of error for 23 TASV insertions shared in both genome datasets.
Fig. 4Polymorphic Alu insertions in Asian populations.
The PCR results for two Alu insertions, a AluYa5_78 and b AluYb8_17, are shown. A screenshot of the UCSC human genome browser (hg19) shows the polymorphic AluYb8 insertion at intron 23 of the CASK gene; adjacent repeats are displayed in gray boxes. PCR amplification was conducted from 80 DNA samples representing four different populations (South American, European, African American, and Asian) and 38 Korean DNA samples (Supplementary Table S5). The upper bands (marked with asterisks) indicate the presence of an Alu insertion, and the lower bands indicate its absence at the corresponding genomic locus.