| Literature DB >> 23577066 |
Hui Shen1, Jian Li, Jigang Zhang, Chao Xu, Yan Jiang, Zikai Wu, Fuping Zhao, Li Liao, Jun Chen, Yong Lin, Qing Tian, Christopher J Papasian, Hong-Wen Deng.
Abstract
Whole genome sequencing studies are essential to obtain a comprehensive understanding of the vast pattern of human genomic variations. Here we report the results of a high-coverage whole genome sequencing study for 44 unrelated healthy Caucasian adults, each sequenced to over 50-fold coverage (averaging 65.8×). We identified approximately 11 million single nucleotide polymorphisms (SNPs), 2.8 million short insertions and deletions, and over 500,000 block substitutions. We showed that, although previous studies, including the 1000 Genomes Project Phase 1 study, have catalogued the vast majority of common SNPs, many of the low-frequency and rare variants remain undiscovered. For instance, approximately 1.4 million SNPs and 1.3 million short indels that we found were novel to both the dbSNP and the 1000 Genomes Project Phase 1 data sets, and the majority of which (∼96%) have a minor allele frequency less than 5%. On average, each individual genome carried ∼3.3 million SNPs and ∼492,000 indels/block substitutions, including approximately 179 variants that were predicted to cause loss of function of the gene products. Moreover, each individual genome carried an average of 44 such loss-of-function variants in a homozygous state, which would completely "knock out" the corresponding genes. Across all the 44 genomes, a total of 182 genes were "knocked-out" in at least one individual genome, among which 46 genes were "knocked out" in over 30% of our samples, suggesting that a number of genes are commonly "knocked-out" in general populations. Gene ontology analysis suggested that these commonly "knocked-out" genes are enriched in biological process related to antigen processing and immune response. Our results contribute towards a comprehensive characterization of human genomic variation, especially for less-common and rare variants, and provide an invaluable resource for future genetic studies of human variation and diseases.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23577066 PMCID: PMC3618277 DOI: 10.1371/journal.pone.0059494
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary information of population-based whole-genome sequencing studies.
| Aligned Bases (Gb) | Coverage depth | Genome Covered | SNPs (Total) | Indels/Subs (Total) | |
| Current study (N = 44) | 188.08 | 65.8× | 96.0% | 3,307,678 (10,871,465) | 492,486 (3,209,732) |
| Korean WGS (N = 10) | 74.09 | 26.1× | NA | 3,602,372 (8,367,302) | 332,561 (1,191,599) |
| Duke WGS | NA | 31.1× | 97.5% | 3,473,639 (10,530,094) | 609,795 (2,736,907) |
| 1000 G_LC2 (N = 179) | 10.51 | 3.56× | 86.0% | 3,019,919 (14,894,361) | 361,669 (1,330,158) |
| 1000 G_HCT3 (N = 6) | 118.5 | 41.6× | 79.0% | 3,001,156 (5,907,699) | 352,474 (682,148) |
Numbers shown are average numbers per individual except where indicated otherwise.
For autosomes only; 2 1000 G_LC: Low-coverage samples from the 1000 Genomes Project; 3 1000 G_HCT: High-coverage trios from the 1000 Genomes Project.
Summary of identified SNPs and indels/block substitutions.
| Variant Type | Total No. of Variants | Average No. of Variants in individual genomes |
| SNP | 10,871,465 | 3,307,678 |
| Intergenic | 6,674,155 | 2,054,900 |
| Intragenic | 4,197,310 | 1,252,778 |
| Intron | 3,473,672 | 1,043,427 |
| UTR | 614,753 | 181,267 |
| Splicing acceptor site | 6,644 | 1,898 |
| Splicing donor site | 1,547 | 398 |
| Coding domain | 79,762 | 18,796 |
| Synonymous | 36,538 | 9,612 |
| Non-synonymous | 43,224 | 9,184 |
| Missense | 42,549 | 9,082 |
| Nonsense | 616 | 87 |
| Nonstop | 59 | 15 |
| Indels | 2,815,215 | 421,088 |
| Coding domain | 3,606 | 381 |
| Frameshift | 2,344 | 217 |
| Frameshift-preserving | 1,262 | 164 |
| Block substitutions | 394,517 | 71,398 |
| Coding domain | 1,870 | 274 |
| Synonymous | 30 | 20 |
| Frameshift | 264 | 17 |
| Missense | 1,546 | 234 |
| Nonsense | 29 | 3 |
| Nonstop | 1 | <1 |
Figure 1Summary characterizations of the identified variants.
A–B, Venn diagram showing SNPs and indels identified in the present study overlapping with those archived in the dbSNP (v131) and the 1000 Genomes Project Phase 1 data sets (released on 5/21/2011). To account for differences in placement of many indels between different data sets, indels were considered to match if they were within 25 bp distance and of the same size. Only SNPs and indels mapped to autosomes and X chromosome were analyzed. C, Genome-wide distribution of novel SNPs. Total number of novel SNPs (compared to dbSNP v131 and the 1000 Genomes Project pilot phase) were calculated in non-overlap 1-megabases (Mb) windows across the human genome and plotted in ideograms using Idiographica [41]. The diversities were illustrated by colors, with red indicating higher numbers or proportions and blue indicating lower numbers or proportions. Genomic regions in which no SNPs were identified or no reference sequences could be determined are shown in grey. D, Allele frequency spectrum of novel SNPs.
Top 10 GO terms significantly enriched or depleted for deleterious coding variants, and enriched for “knocked-out” genes.
| GO Accession # | Biological Process | P-value | |
|
| |||
| GO:0050907 | detection of chemical stimulus involved in sensory perception | 2.97E–09 | |
| GO:0007606 | sensory perception of chemical stimulus | 3.47E–08 | |
| GO:0050911 | detection of chemical stimulus involved in sensory perception of smell | 1.03E–07 | |
| GO:0009593 | detection of chemical stimulus | 3.22E–07 | |
| GO:0007608 | sensory perception of smell | 7.21E–07 | |
|
| |||
| GO:0044260 | cellular macromolecule metabolic process | 1.38E–19 | |
| GO:0009987 | cellular process | 1.31E–18 | |
| GO:0016070 | RNA metabolic process | 1.52E–17 | |
| GO:0043170 | macromolecule metabolic process | 1.96E–15 | |
| GO:0006139 | nucleobase-containing compound metabolic process | 2.04E–15 | |
|
| |||
| GO:0002474 | antigen processing and presentation of peptide antigen via MHC class I | 1.79E–23 | |
| GO:0019882 | antigen processing and presentation | 1.18E–21 | |
| GO:0048002 | antigen processing and presentation of peptide antigen | 1.65E–18 | |
| GO:0006611 | protein export from nucleus | 2.95E–12 | |
| GO:0006955 | immune responses | 3.56E–08 | |
P-values were computed for significance of enrichment by Gorilla ( http://cbl-gorilla.cs.technion.ac.il/ ).
Figure 2Identification of “knocked-out” genes.
A, Frequency spectrum of observed “knocked-out” genes. Genes containing homozygous LoF variants were expected to be silent or knocked-out. Numbers of “knocked-out” genes were counted with respect to the frequency of “knock-out” occurrence in the 44 genomes.
Figure 3The number of novel SNPs and indels discovered as the number of sequenced genomes increased.
We evaluated how many additional “new” A) SNPs and B) indels, respectively, were identified per genome as the number of sequenced genomes increased, considering both variants archived in databases (dbSNP v131 and the 1000 Genome Project Phase 1 data) and variants “discovered” in previously considered genomes. The 44 genomes were added into the analyses in a random order. With 1000 permutations, the average numbers of novel variants added per genome are shown, along with the best fitting trendline for each plot.