| Literature DB >> 34296264 |
Brett Trost1, Livia O Loureiro1, Stephen W Scherer1,2.
Abstract
Over the past 30 years (the timespan of a generation), advances in genomics technologies have revealed tremendous and unexpected variation in the human genome and have provided increasingly accurate answers to long-standing questions of how much genetic variation exists in human populations and to what degree the DNA complement changes between parents and offspring. Tracking the characteristics of these inherited and spontaneous (or de novo) variations has been the basis of the study of human genetic disease. From genome-wide microarray and next-generation sequencing scans, we now know that each human genome contains over 3 million single nucleotide variants when compared with the ~ 3 billion base pairs in the human reference genome, along with roughly an order of magnitude more DNA-approximately 30 megabase pairs (Mb)-being 'structurally variable', mostly in the form of indels and copy number changes. Additional large-scale variations include balanced inversions (average of 18 Mb) and complex, difficult-to-resolve alterations. Collectively, ~1% of an individual's genome will differ from the human reference sequence. When comparing across a generation, fewer than 100 new genetic variants are typically detected in the euchromatic portion of a child's genome. Driven by increasingly higher-resolution and higher-throughput sequencing technologies, newer and more accurate databases of genetic variation (for instance, more comprehensive structural variation data and phasing of combinations of variants along chromosomes) of worldwide populations will emerge to underpin the next era of discovery in human molecular genetics.Entities:
Mesh:
Year: 2021 PMID: 34296264 PMCID: PMC8490016 DOI: 10.1093/hmg/ddab209
Source DB: PubMed Journal: Hum Mol Genet ISSN: 0964-6906 Impact factor: 6.150
Figure 1Types of variation found in the human genome and the primary technologies used to detect them (43). The types of variation, and various (sometimes synonymous) terms used to describe them, are grouped as ‘sequence variation’ and ‘structural variation’, the latter encompassing chromosomal/genome variation. The lower end-size of structural variation is typically defined to fall in the 50–1000 nt range, but definitions vary (9,172). FISH, fluorescence in situ hybridization (here also encompassing spectral karyotyping); PFGE, pulse field gel electrophoresis; NGS, next-generation sequencing (including both short-read and long-read technologies, the latter being particularly useful for identifying intermediate-size structural variation). There are many other important technologies used to discover and map genetic variation and we include those that have been most impactful for the original discoveries discussed in this review, including those that are still used by clinical diagnostic laboratories. Important references are provided in Tables 1 and 2 and the main text.
Important WGS studies examining the extent of variation in a genome
| Indels | CNVs/SVs | Non-SNV variation (Mb) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Study | Genome | Ancestry | Sex | DNA | SNVs | Ins | Del | Ins/Dup | Del | Inv | Unbalanced | Balanced | Technologies |
| Levy | HuRef (Venter) | EUR | M | Blood | 3 213 401 | 275 512 | 283 961 | 30 | 32 | 90 | – | – | CMA, SS |
| Pang | – | 412 304 | 383 775 | 9915 | 13 867 | 167 | 39.5 | 9.3 | |||||
| Wheeler | Watson | EUR | M | Blood | 3 322 093 | 65 677 | 157 041 | 9 | 14 | – | – | – | 454, CMA |
| Lupski | III-4 (Lupski) | EUR | M | Blood | 3 420 306 | – | – | 123 | 111 | – | – | – | CMA, SOLiD |
| Ebert | NA12878 | EUR | F | LCL | 3 643 864 | 367 945 | 372 590 | 13 954 | 8931 | 108 | 15.3 | 21.7 | PB, S-seq |
| Bentley | NA18507 | AFR | M | Blood | 4 139 196 | 176 221 | 228 195 | 2345 | 5704 | – | – | – | IL |
| Schuster | KB1 | AFR | M | Blood | 4 053 781 | – | – | – | – | – | – | – | 454, IL |
| Ebert | HG03125 | AFR | F | LCL | 4 470 531 | 438 853 | 449 021 | 16 355 | 10 775 | 120 | 17.5 | 22.2 | PB, S-seq |
| Chaisson | NA19240 | AFR | F | LCL | – | 419 842 | 370 245 | 17 026 | 12 421 | 129 | 39.8 | 19.6 | 10X, BN, Hi-C, IL, ON, PB, S-seq |
| Kim | AK1 | EAS | M | Blood | 3 453 653 | 75 141 | 95 061 | 581 | 656 | – | – | – | CMA, IL |
| Seo | LCL | 3 472 576 | 169 314 total | 10 077 | 7358 | 71 | 13.6 | 13.5 | 10x, BN, IL, PB | ||||
| Shi | HX1 | EAS | M | Blood | 3 518 309 | 16625 690 total | 10 284 | 9891 | – | 11.0 | – | BN, IL, PB | |
| Ebert | HG00512 | EAS | M | LCL | 3 620 202 | 367 796 | 370 030 | 14 055 | 8937 | 122 | 15.5 | 21.0 | PB, S-seq |
| Chaisson | HG00514 | EAS | F | LCL | – | 335 762 | 297 565 | 15 566 | 10 291 | 121 | 39.3 | 14.1 | 10X, BN, Hi-C, IL, ON, PB, S-seq |
| Takayama | JG1 | EAS | M | Blood | 2 501 575 | – | – | 8697 | 6190 | – | – | – | BN, IL, PB |
| Ebert | HG02492 | SAS | M | LCL | 3 565 097 | 372 637 | 347 792 | 13 993 | 8994 | 108 | 16.3 | 20.8 | PB, S-seq |
| Chaisson | HG00733 | AMR | F | LCL | – | 343 950 | 304 170 | 16 566 | 10 607 | 128 | 31.6 | 17.9 | 10X, BN, Hi-C, IL, ON, PB, S-seq |
| Ebert | HG00731 | AMR | M | LCL | 3 693 860 | 379 989 | 379 972 | 14 009 | 8867 | 107 | 15.6 | 20.0 | PB, S-seq |
We selected studies spanning the start of personal genome sequencing in 2007 until 2021, including those from diverse populations analyzed using different technologies. The size definitions used to categorize indels (insertions and deletions) and CNVs (insertions/duplications and deletions) varied between studies, leading to significant differences in numbers presented. The Levy et al. study (HuRef/Venter genome) provides a composite analysis, demonstrating that relative to the reference genome, ~ 1.3% of nucleotides were affected by indels and CNVs compared with 0.1% by SNVs. More recent studies further support the idea that non-SNV variation affects several times more nucleotides than SNVs (58,124,141). Where reported, balanced SVs (inversions in most studies) encompass between 9.3 and 22.2 Mb (average 18 Mb). Data from these studies are typically also accessible in public repositories (159,173–175).
The total number of base pairs affected by non-SNV sequence changes. Unbalanced changes include insertions and deletions of all sizes, whereas balanced changes include inversions.
Abbreviations: AFR, African; AMR, Admixed American; EAS, East Asian; EUR, European; SAS, South Asian.
Additions of genetic material are typically described as insertions when detected by comparisons between assembled genomes and as duplications when detected using chromosomal microarray analysis.
The technologies used for sequencing, assembly and variant detection. Abbreviations: 10x, 10x Genomics linked reads (60–62); 454, 454 Life Sciences pyrosequencing (32,33); BN, Bionano Genomics optical mapping (65); CMA, chromosomal microarray analysis (20–26); IL, Illumina (Solexa) sequencing (34); ON, Oxford Nanopore Technologies sequencing (137–140); PB, Pacific Biosciences sequencing (59); SOLiD, sequencing by oligonucleotide ligation and detection (35); SS, Sanger sequencing (49); S-seq, strand-seq (135,136).
Values represent homozygous indels; 292 102 heterozygous indels (not stratified by insertions and deletions in the paper) were also detected.
Insertions and deletions detected using assembly comparison are listed under indels, whereas those detected using other methods are listed under CNVs/SVs.
Detected by Illumina sequencing.
Reflects simple inversions as tabulated in Supplementary Table 9 of Chaisson et al. (141)
Excludes indels.
Composite of three different Japanese males.
LCL, lymphoblastoid cell line.
Important genome-wide studies examining de novo variation across a generation
| Study | Families | Phenotype | Technology | DNM rate (events/generation) | Paternal age effect | Maternal age effect |
|---|---|---|---|---|---|---|
| Sebat | 264 | ASD | CMA | 0.01 CNVs | – | – |
| Itsara | 2197 | ASD | CMA | Varies by size | – | – |
| Roach | 1 | See note | WGS | 70 SNVs | – | – |
| Conrad | 2 | NA | WGS | 42 SNVs | – | – |
| Michaelson | 10 | ASD | WGS | 58 SNVs | 1.0 SNVs | – |
| Kong | 78 | ASD, SCZ | WGS | 63 SNVs | 2.0 SNVs | – |
| Campbell | 5 | NA | WGS | 35 SNVs | – | – |
| Gilissen | 50 | ID | WGS | 82 SNVs, 0.16 CNVs | – | – |
| Francioli | 250 | NA | WGS | 43 SNVs | 1.1 SNVs | – |
| Wong | 693 | PTB | WGS | 39 SNVs | 0.64 SNVs | 0.35 SNVs |
| Goldmann | 816 | PTB | WGS | 45 SNVs | 0.91 SNVs | 0.24 SNVs |
| Yuen | 200 | ASD | WGS | 51 SNVs, 4 indels, 0.05 CNVs | – | – |
| Yuen | 1239 | ASD | WGS | 74 SNVs, 13 indels | – | – |
| Jónsson | 1548 | Various | WGS | 65 SNVs, 5 indels | 1.51 SNVs+indels | 0.37 SNVs+indels |
| Maretty | 50 | NA | WGS | 64 SNVs, 6 indels | – | – |
| An | 1902 | ASD | WGS | 62 SNVs, 6 indels | – | – |
| Kessler | 1465 | Various | WGS | 64 SNVs | 1.35 SNVs | 0.42 SNVs |
| Collins | 970 | Various | WGS | 0.29 SVs | – | – |
| Belyeu | 2396 | ASD | WGS | 0.16 SVs | Not significant | Not significant |
| Mitra | 1637 | ASD | WGS | 53 tandem repeat indels | Significant | – |
We selected studies that tested for genome-wide de novo mutation events from population control or disease datasets. Each study has strengths and weaknesses in design, data capture and experimental validation. Four comprehensive studies (90–93) report an average of 64 SNV, 7 indel and 0.05 CNV events per generation.
The phenotype or disease of participants in the study. ‘NA’ means that only healthy controls were used or that no disease phenotype was indicated. ASD, autism spectrum disorder; ID, intellectual disability; PTB, preterm birth; SCZ, schizophrenia.
The technology used for variant detection. CMA, chromosomal microarray analysis; WGS, whole-genome sequencing.
DNM rates are reported in terms of events per generation because this measure is generalizable across variant types (i.e. also including indels and SVs). As mentioned in the text, after adjusting for the proportion of the genome assessed, estimates of per-nucleotide mutation rates for de novo SNVs are consistently reported as ~1.2 × 10−8 per generation.
The estimated number of additional de novo variants per year of parental age.
CNVs > 99 kb in unaffected individuals only.
CNVs > 30 kb: 0.012; CNVs > 500 kb: 0.0065.
The two siblings in this study each had two recessive disorders.
This study also estimated mutation rates based on heterozygous positions within autozygous segments, giving a per-nucleotide mutation rate of 1.2 × 10−8 per generation.
CNVs > 10 kb.
Includes 0.15 deletions, 0.1 insertions, 0.04 duplications and 0.001 inversions.
Value is for healthy individuals; DNM rate was significantly higher in ASD-affected individuals (0.21 SVs/generation).
Value is for healthy individuals; DNM rate was slightly but significantly higher in ASD-affected individuals (55 tandem repeat indels/generation).
Paternal age effect was statistically significant, but no slope given.