| Literature DB >> 18704161 |
Pauline C Ng1, Samuel Levy, Jiaqi Huang, Timothy B Stockwell, Brian P Walenz, Kelvin Li, Nelson Axelrod, Dana A Busam, Robert L Strausberg, J Craig Venter.
Abstract
There is much interest in characterizing the variation in a human individual, because this may elucidate what contributes significantly to a person's phenotype, thereby enabling personalized genomics. We focus here on the variants in a person's 'exome,' which is the set of exons in a genome, because the exome is believed to harbor much of the functional variation. We provide an analysis of the approximately 12,500 variants that affect the protein coding portion of an individual's genome. We identified approximately 10,400 nonsynonymous single nucleotide polymorphisms (nsSNPs) in this individual, of which approximately 15-20% are rare in the human population. We predict approximately 1,500 nsSNPs affect protein function and these tend be heterozygous, rare, or novel. Of the approximately 700 coding indels, approximately half tend to have lengths that are a multiple of three, which causes insertions/deletions of amino acids in the corresponding protein, rather than introducing frameshifts. Coding indels also occur frequently at the termini of genes, so even if an indel causes a frameshift, an alternative start or stop site in the gene can still be used to make a functional protein. In summary, we reduced the set of approximately 12,500 nonsilent coding variants by approximately 8-fold to a set of variants that are most likely to have major effects on their proteins' functions. This is our first glimpse of an individual's exome and a snapshot of the current state of personalized genomics. The majority of coding variants in this individual are common and appear to be functionally neutral. Our results also indicate that some variants can be used to improve the current NCBI human reference genome. As more genomes are sequenced, many rare variants and non-SNP variants will be discovered. We present an approach to analyze the coding variation in humans by proposing multiple bioinformatic methods to hone in on possible functional variation.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18704161 PMCID: PMC2493042 DOI: 10.1371/journal.pgen.1000160
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Number of coding SNPs in HuRef.
| Synonymous | 10,413 | |
| Heterozygous | Novel | 551 |
| dbSNP | 5,183 | |
| Homozygous | Novel | 98 |
| dbSNP | 4,581 | |
| Nonsynonymous | 10,389 | |
| Heterozygous | Novel | 557 |
| dbSNP | 5,047 | |
| Homozygous | Novel | 215 |
| dbSNP | 4,570 |
*: All heterozygous novel nonsynonymous SNPs were manually inspected (see Methods).
Figure 1The allele frequencies of heterozygous and homozygous nsSNPs in HuRef.
For heterozygous SNPs, the minor allele frequency is plotted. For homozygous nsSNPs, the frequency for the observed allele in HuRef is plotted.
Figure 2The percentage of nsSNPs predicted to affect protein function, by category.
A higher fraction of heterozygous, novel, and rare nsSNPs are predicted to affect function compared to homozygous and common nsSNPs. Rare nsSNPs have allele frequencies <0.05; common nsSNPs have allele frequencies > = 0.05.
Figure 3The size distribution of coding indels.
Coding indels are predominantly the size of 3n, where n is an integer. 3n coding indels do not cause frameshifts, whereas non-3n coding indels do.
Figure 4Location of coding indels.
On the x-axis is the relative protein location of the coding indel, which is the first amino acid position of the indel divided by the protein length. A relative protein location near zero indicates that the indel is located near the N-terminus of the protein and a relative protein location near one indicates that the indel is located near the C-terminus of the protein. Indels occur frequently at the N- and C-termini of proteins.
Figure 5An example of a homozygous indel located near an exon boundary.
The HuRef assembly has a homozygous insertion of A at chr11: 44881936. This insertion resides inside a coding exon of the gene TP53I11, but is near a 2 bp intron. With this new base inserted, a single amino acid is introduced into the protein sequence, which is the more likely scenario instead of a 2 bp intron.
HuRef nsSNPs with known disease associations.
| SNP | Gene | Genotype-Phenotype |
| rs6265 | BDNF | MAF = 0.18 |
| V74M | Increased risk for eating disorder OR = 1.6 | |
| Met/Met: Inferior episodic memory | ||
| Met/Met: Later onset of Parkinson's disease | ||
| Met: In bipolar patients, less adaptive to change | ||
| Predicted to affect protein function. | ||
| rs1800556 | ACADS | MAF = 0.17 |
| R171W | ACADS with 171W has residual activity (45%). Because this is polymorphic in the control population, this is a predisposition allele that can cause SCAD deficiency if additional factors are present | |
| Predicted to affect protein function. | ||
| rs1805389 | LIG4 | MAF = 0.02; MAF = 0.07 in |
| A3V | 1.5-fold reduction in risk of developing multiple myeloma for heterozygotes | |
| Predicted to be functionally neutral. | ||
| rs13073139 | BTD | MAF = 0.17 |
| A171T | Biotinidase deficiency, asymptomatic in heterozygous form | |
| A171T is predicted to be functionally neutral. | ||
| D444H is predicted to affect protein function. | ||
| rs2303067 | SPINK5 | MAF = 0.48. Associated with allergies, atopic dermatitis, asthma, and total serum IgE. Paternally derived alleles tended to be less often associated with disease than maternal alleles |
| E422K | Predicted to be functionally neutral. | |
| rs11556045 | HEXB | MAF = 0.22. Observed in a patient with juvenile Sandhoff disease, but the patient had another mutation which activated a cryptic splice site. This SNP is unlikely to be the causative variant. |
| K121R | Predicted to be functionally neutral. | |
| rs4880 | SOD2 | MAF = 0.44. Risk of prostate cancer depends on genotype, vitamin E uptake, and smoking status. For the heterozygote, there is an increased risk of prostate cancer. Val/Val: OR = 1. Val/Ala OR = 1.17 Ala/Ala = 1.28 |
| A16V | This risk is increased with smoking and low vitamin E uptake | |
| Within hereditary haemochromatosis patients, carriers of the Val allele have a higher prevalence of cardiomyopathy | ||
| Mutant protein has 30–40% lower activity. | ||
| Predicted to be functionally neutral. |
All of these SNPs were heterozygous in HuRef.
OR = odds ratio. We note that these associations should be interpreted with caution because there are disagreements in the published literature [94], [96]–[98]. Additional genotype-phenotype relationships for this individual can also be found in Table 13 of [16].
Three nsSNPs have been shown to cause reduced protein activity. Of these, two are predicted to affect function.
Three nsSNPs associated with disease but for which enzymatic assays have not yet been carried out (to the best of our knowledge) and are presumed to be involved in disease. It is possible that these nsSNPs are not the etiological variants but instead they could be in linkage disequilibrium with the etiological variants. One of these three nsSNPs is predicted to affect protein function.
Diversity Rates for Autosomal Chromosomes.
| SNP Diversity (×10−4) | Indel Diversity (×10−4) | SNP Diversity (×10−4) | |
| Total | 6.2 | 0.9 | 6.5 |
|
| |||
| CDS | 3.6 | 0.08 Filtered: 0.09 | 4.0 |
| CDS of disease genes | 2.9 | 0.06 Filtered: 0.04 | 3.0 |
| Constitutive exons | 3.5 | 0.08 Filtered: 0.08 | 4.0 |
| Alternative exons | 4.6 | 0.1 Filtered: 0.06 | 4.8 |
| 5′UTR | 4.4 | 0.3 | 4.9 |
| 3′UTR | 4.7 | 0.7 | 5.1 |
| Splice sites | 4.0 | 0.6 | 4.6 |
| Promoter ( 1 kb upstream) | 5.4 | 0.8 | 6.1 |
| Introns | 5.6 | 0.9 | 6.2 |
|
| |||
| All | 4.3 | 0.5 | 4.9 |
| Intronic conserved | 3.8 | 0.5 | 4.3 |
| Intergenic conserved | 4.7 | 0.5 | 5.2 |
|
| |||
| All | 6.6 | 1.2 | 7.5 |
| Alu | 7.2 | 2.6 | 7.5 |
| MIR | 5.6 | 0.3 | 6.1 |
| MER | 6.3 | 0.5 | 7.0 |
| LTR | 7.2 | 0.4 | 8.0 |
| L1 | 6.6 | 0.6 | 7.2 |
| L2 | 5.6 | 0.4 | 6.3 |
| Simple repeats (xxx)n | 8.8 | 15 | 19 |
The diversity rates based on Dr. Venter's and Dr. Watson's genomes are probably underestimated by ∼25%, the percentage of heterozygotes missed due to low read coverage [16],[20].
For coding indels, we show the diversity values before and after filtering. Some diversity values for indels are higher after filtering because homozygous indels were re-classified as heterozygous indels.
For Dr. Watson's genome, we assume that the entire genome was covered by reads, which will also lead to an underestimate (see Methods). There is a large difference between the diversity values for Dr. Watson's and Dr. Venter's genome for simple repeats. This may be due to the methodological differences between Sanger and 454 technologies.
Figure 6The Ka/Ks ratios of Commonly-Affected genes and Rarely-Affected Genes.
Commonly-Affected genes have a higher Ka/Ks ratio than Rarely-Affected genes, which suggests that Commonly-Affected genes are under weaker selection.
Characterization of Dr. Venter's and Dr. Watson's exomes. Numbers for Dr. Watson's exome are taken from [20].
| Dr. Venter's Exome | Dr. Watson's Exome | |
| Total Number of Nonsynonymous SNPs | 10,389 | 10,569 |
| Number of Novel Nonsynonymous SNPs | 772 (7% of total nsSNPs) | 1,573 (15% of total nsSNPs) |
| % nsSNPs predicted to affect protein function | 14% (7,781 predicted on) | 20% (3,898 predicted on) |
| Number of Coding Indels | 739 | 345 |
*: Different prediction algorithms were used [30],[33], and this may account for the difference between the two exomes.
**: Indels of size 2 bp and greater were considered; 1 bp indels were discarded. If we removed 1 bp indels from Dr. Venter's exome in order to compare with Dr. Watson's exome, Dr. Venter would have 423 coding indels.
Figure 7A summary of the nonsilent coding variants and their observed trends.