Literature DB >> 18852891

Distribution and effects of nonsense polymorphisms in human genes.

Yumi Yamaguchi-Kabata1, Makoto K Shimada, Yosuke Hayakawa, Shinsei Minoshima, Ranajit Chakraborty, Takashi Gojobori, Tadashi Imanishi.   

Abstract

BACKGROUND: A great amount of data has been accumulated on genetic variations in the human genome, but we still do not know much about how the genetic variations affect gene function. In particular, little is known about the distribution of nonsense polymorphisms in human genes despite their drastic effects on gene products. METHODOLOGY/PRINCIPAL
FINDINGS: To detect polymorphisms affecting gene function, we analyzed all publicly available polymorphisms in a database for single nucleotide polymorphisms (dbSNP build 125) located in the exons of 36,712 known and predicted protein-coding genes that were defined in an annotation project of all human genes and transcripts (H-InvDB ver3.8). We found a total of 252,555 single nucleotide polymorphisms (SNPs) and 8,479 insertion and deletions in the representative transcripts in these genes. The SNPs located in ORFs include 40,484 synonymous and 53,754 nonsynonymous SNPs, and 1,258 SNPs that were predicted to be nonsense SNPs or read-through SNPs. We estimated the density of nonsense SNPs to be 0.85x10(-3) per site, which is lower than that of nonsynonymous SNPs (2.1x10(-3) per site). On average, nonsense SNPs were located 250 codons upstream of the original termination codon, with the substitution occurring most frequently at the first codon position. Of the nonsense SNPs, 581 were predicted to cause nonsense-mediated decay (NMD) of transcripts that would prevent translation. We found that nonsense SNPs causing NMD were more common in genes involving kinase activity and transport. The remaining 602 nonsense SNPs are predicted to produce truncated polypeptides, with an average truncation of 75 amino acids. In addition, 110 read-through SNPs at termination codons were detected. CONCLUSION/SIGNIFICANCE: Our comprehensive exploration of nonsense polymorphisms showed that nonsense SNPs exist at a lower density than nonsynonymous SNPs, suggesting that nonsense mutations have more severe effects than amino acid changes. The correspondence of nonsense SNPs to known pathological variants suggests that phenotypic effects of nonsense SNPs have been reported for only a small fraction of nonsense SNPs, and that nonsense SNPs causing NMD are more likely to be involved in phenotypic variations. These nonsense SNPs may include pathological variants that have not yet been reported. These data are available from Transcript View of H-InvDB and VarySysDB (http://h-invitational.jp/varygene/).

Entities:  

Mesh:

Substances:

Year:  2008        PMID: 18852891      PMCID: PMC2561068          DOI: 10.1371/journal.pone.0003393

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Genetic variations in the human genome are maintained by a balance of mutation, selection and random genetic drift. Some of the polymorphisms cause phenotypic variations and diseases. Therefore, many studies have attempted to identify causative variants of genetic diseases and the relationships between genetic variations and phenotypic effects. Genetic variations within linked loci are inherited to the same gamete. Based on the linkage of genetic variations, loci that contain disease-causing genes have been mapped by using polymorphic markers. At present, about 14 million clusters of genetic polymorphisms have been identified in the human genome [1]. On average, two haploid genomes are estimated to differ by one single nucleotide polymorphism (SNP) in every 1200–1500 bp [2]. SNPs have been recently used to conduct genome-wide association studies to find genomic regions that are susceptible to diseases and phenotypic variations [3], [4], [5], [6]. In this approach, usually, causative polymorphisms for diseases or phenotypic variations are identified after the identification of susceptible genomic regions by using SNP markers. Such SNPs are called landmark SNPs, and the indirect relationships between polymorphisms and phenotypic variations were examined to identify genomic regions where causative genes are located. Another approach in finding pathological variants is to extract polymorphisms that alter amino acids in functional genes or affect gene expression or splicing, using a comprehensive set of functional elements of the human genome. Several studies have analyzed nonsynonymous SNPs to predict pathological variants [7], [8], [9], [10], [11], [12], [13], [14]. A large number of nonsynonymous SNPs also have been examined for associations with diseases[15], [16]. Although many pathological mutations have been identified [17], [18], the number of such variants is small compared to the number of known polymorphisms, and it is still unclear which polymorphisms have biological effects. In a study of consanguineous marriage [19], it was estimated that each person has deleterious alleles that are equivalent to a few lethal genes. Gene-centric SNP surveys have shown that the ratio of nonsynonymous to synonymous SNPs is significantly higher in the low frequency class than in the common frequency class [20], [21], [22]. These results suggest that a large fraction of the low frequency nonsynonymous SNPs are deleterious. To understand the molecular basis of the effects of human genetic variations on phenotypic variations, a prediction analysis of possible effects of polymorphisms on gene function in all human genes appears to be needed. In this study, to detect polymorphisms affecting gene function, we analyzed all publicly available polymorphisms in the Single Nucleotide Polymorphism Database (dbSNP) (build 125) in the exons of all 36,712 protein-coding genes that were defined in an annotation project of all human genes and transcripts (H-InvDB ver3.8)[23], [24]. In summary with representative transcripts (one transcript from one gene), we detected 53,754 nonsynonymous SNPs and 1,417 SNPs causing changes between amino acids and stop codons. Among possible point mutations in ORFs, nonsense mutations cause the most drastic changes of gene products. In fact, several reports have shown that nonsense mutations cause genetic diseases [25], [26], [27], [28]. Truncation of a polypeptide by a premature termination codon causes a drastic change in the gene product. Furthermore, it is known that a nonsense mutation can cause decay of mRNA resulting in the absence of the gene product. This process, called ‘nonsense-mediated decay (NMD)’ limits the synthesis of abnormal proteins[29], [30], [31]. On the other hand, the loss of a termination codon in a transcript also appears to cause decay of mRNA (referred to as non-stop decay) and thus to prevent translation[32], [33]. In spite of the severe effects of nonsense mutations, the distribution of nonsense SNPs in human genes is little understood. In this study, we examined the density of nonsense SNPs in human genes, and showed that nonsense SNPs exist at a lower density than nonsynonymous SNPs, possibly due to the more severe effects of premature stop codons than amino acid changes. About a half of nonsense SNPs are predicted to cause NMD. The correspondence between known pathological variants and nonsense SNPs suggests that nonsense SNPs causing NMD are more likely to be involved in phenotypic variations.

Results

Selection and classification of polymorphisms in exon regions

We analyzed 9,235,997 polymorphisms (dbSNP build 125) in the human genome with exon positions and predicted ORFs that were revealed in our annotation project of human genes (H-InvDB) (Figure 1). In all of the 36,712 protein-coding loci in the genome, we detected 252,555 SNPs and 8,479 insertions and deletions (indels) that exist in exon regions of the representative transcript (one transcript from one gene) (Table 1). The polymorphisms in the exon regions were further classified according to the predicted ORFs. We detected 96,164 SNPs within the ORFs, 51,881 SNPs in the 5′UTR regions and 104,510 SNPs in the 3′UTR regions. Among the SNPs in the ORFs, 40,484 were synonymous and 53,754 were nonsynonymous (Further analyses of nonsynonymous SNPs are described in Results S1.). Most of the indels were detected in the UTR regions. The ORF regions contained 1,258 SNPs that cause changes between amino acids and stop codons (Table S1). Of the 1,258 SNPs, 1,183 SNPs were regarded as nonsense SNPs, while 75 were found to have stop codons as ancestral alleles. We also detected 247 SNPs at termination codon sites, 88 of which were synonymous. The remaining 159 SNPs were changes between stop codons and amino acids. After checking ancestral alleles, 110 of the 159 SNPs were inferred to be read-through SNPs, while the other 49 were inferred to changes to stop codons.
Figure 1

Analysis of polymorphisms with gene structure.

Top: Scheme of analysis pipeline of polymorphisms with gene structure. Bottom: Screen shots taken from ‘Transcript View’ in H-InvDB that show classified SNPs and their positions (blue bars) in the CASP12 gene.

Table 1

SNPs and indels in exon, intron and other genomic regions.

ExonIntronOther genomic regions
SNPs249,1823,332,5375,209,127
Indels9,742185,761249,648

Polymorphisms mapped on single positions were analyzed with 36,712 protein-coding genes.

Analysis of polymorphisms with gene structure.

Top: Scheme of analysis pipeline of polymorphisms with gene structure. Bottom: Screen shots taken from ‘Transcript View’ in H-InvDB that show classified SNPs and their positions (blue bars) in the CASP12 gene. Polymorphisms mapped on single positions were analyzed with 36,712 protein-coding genes.

Distribution of polymorphisms in exon regions

Densities of polymorphisms were estimated for 23,717 genes whose functions are clearly defined or suggested (similarity category I–III, see Materials and Methods) and genes annotated as conserved hypothetical proteins (similarity category IV). To estimate the densities of SNPs for synonymous, nonsynonymous and nonsense SNPs in the ORFs, we calculated the numbers of potential nucleotide sites for synonymous, nonsynonymous and nonsense mutations in the coding regions. The fractions of sites (%) in the coding regions for synonymous, nonsynonymous, and nonsense mutations were estimated to be 28.5%, 68.1%, and 3.4%, respectively. Of the three types of SNPs, synonymous SNPs had the highest density, 4.1×10−3 per synonymous site, in ORFs (Table 2). The estimated density of nonsynonymous SNP was 2.1×10−3 per site (Table 2). The lower density of nonsynonymous SNPs compared with synonymous SNPs (51%) is due to the functional constraint of amino acid changes, and is in agreement with previous studies [20], [22], [34]. However, the ratio of the numbers of nonsynonymous SNPs to synonymous SNPs per site is higher in this study compared with previous studies (32–34%) [20], [21], [22], which they focused on specific populations. The higher ratio of nonsynonymous SNPs in this study may be due to the fact that our study is based on pooled data from various populations world wide. This study includes many nonsynonymous SNPs that exist in relatively lower frequencies and are likely to be more population-specific in comparison to synonymous SNPs [20].
Table 2

Classified SNPs in exon regions.

RegionEffects on translationGenes in category I–IVa All protein-coding genesb
5′UTR23454 [3.3×10−3/site]c 51881
ORFTotal85233 [2.7×10−3/site]96164
Synonymous37484 [4.1×10−3 /site]40484
Nonsynonymous46261 [2.1×10−3 /site]53754
AA↔Terd 9381258
Unclassifiede 398421
Stop codonTotal152247
Synonymous6388
Ter↔AAd 89159
3′UTR69691 [3.3×10−3/site]104510
Total178378252555

Representative transcripts in 23,717 genes whose function were defined or suggested (similarity category I–III) and genes annotated as conserved hypothetical proteins (similarity category IV).

Representative transcripts in all protein-coding genes (36,712) including genes in similarity category I–IV plus similarity category V–VII (hypothetical protein, hypothetical short protein, and pseudogene candidate, respectively).

Densities of polymorphisms are shown in brackets as average number of polymorphisms per site. The average lengths of the 5′UTR, ORF and 3′UTR regions in 23717 genes were 303.9 bp, 1343.5 bp, and 877.6 bp, respectively. The densities of SNPs for synonymous, nonsynonymous and nonsense SNPs in ORFs were calculated based on the numbers of potential nucleotide sites for synonymous, nonsynonymous and nonsense mutations in coding regions. The density of nonsense SNPs is shown in Table 3.

SNPs causing changes between amino acids and stop codons.

Representative transcripts in 23,717 genes whose function were defined or suggested (similarity category I–III) and genes annotated as conserved hypothetical proteins (similarity category IV). Representative transcripts in all protein-coding genes (36,712) including genes in similarity category I–IV plus similarity category V–VII (hypothetical protein, hypothetical short protein, and pseudogene candidate, respectively). Densities of polymorphisms are shown in brackets as average number of polymorphisms per site. The average lengths of the 5′UTR, ORF and 3′UTR regions in 23717 genes were 303.9 bp, 1343.5 bp, and 877.6 bp, respectively. The densities of SNPs for synonymous, nonsynonymous and nonsense SNPs in ORFs were calculated based on the numbers of potential nucleotide sites for synonymous, nonsynonymous and nonsense mutations in coding regions. The density of nonsense SNPs is shown in Table 3.
Table 3

SNPs causing changes between amino acids and stop codons.

RegionEffects on translationGenes in category I–IVa All protein-coding genesa
ORFNonsense910 [0.85×10−3/site]d 1183
Read-throughb 2875
Stop codonRead-through67110
Nonsensec 2249

These two gene sets are the same as Table 2.

Possible read-through SNPs in which alleles coding stop codons were ancestral type. This may be due to existence of shorter ORFs in the ancestral population.

Possible nonsense SNPs in which alleles coding stop codons were derived alleles. This may be due to existence of longer ORFs in the ancestral population.

The densities of nonsense SNPs in ORFs were calculated based on the numbers of potential nucleotide sites for nonsense mutations in coding regions.

SNPs causing changes between amino acids and stop codons. Among random nucleotide mutations in ORFs, 3.4% would be expected to be nonsense mutations; however, the distribution of nonsense SNPs has not been evaluated or reported. The density of nonsense SNPs was estimated to be 0.85×10−3 per site (Table 3), which is only 21% of the density of synonymous SNPs, and 40% of the density of nonsynonymous SNPs. The reason for the lowest density of nonsense SNPs may be that premature stop codons have more severe effects than amino acid changes. These two gene sets are the same as Table 2. Possible read-through SNPs in which alleles coding stop codons were ancestral type. This may be due to existence of shorter ORFs in the ancestral population. Possible nonsense SNPs in which alleles coding stop codons were derived alleles. This may be due to existence of longer ORFs in the ancestral population. The densities of nonsense SNPs in ORFs were calculated based on the numbers of potential nucleotide sites for nonsense mutations in coding regions. In the exons of the 36,712 loci, 8479 indels were detected, and 1,532 of them were found in ORFs. Among the latter, 1,331 are expected to cause frame shifts, resulting in drastic changes of proteins. The density of indels in ORFs was much lower than in the UTR regions (Table 4). The lower density of indels in the 5′UTRs than in the 3′UTRs suggests that functional constraint for insertions and deletions is higher in the 5′UTR regions than in the 3′UTR regions.
Table 4

Insertions and deletions in exon regions.

Genes in category I–IVa All protein-coding genesa
5′UTR785 [0.11×10−3]b 2005
ORF1120 [0.035×10−3]1532
3′UTR3323 [0.16×10−3]4942
Total5225c 8479

These two gene sets are the same as Table 2.

Densities of polymorphisms are shown in brackets as average number of polymorphisms per site.

Three indels were located on both of ORF and UTR.

These two gene sets are the same as Table 2. Densities of polymorphisms are shown in brackets as average number of polymorphisms per site. Three indels were located on both of ORF and UTR.

Nonsense SNPs

We examined the patterns and the positions of the nonsense SNPs. There are 23 possible ways to change codons into stop codons (nine, seven and seven for the first, second and third positions, respectively), and all 23 were found (Table 5). Nonsense SNPs were more frequent at the first codon position than at the second and third positions (p<0.005, chi-square test). The most frequent type of nonsense mutation is the change from CGA to TGA (Table 5), which is a transitional change at CpG mutation hotspots [35]. However, it is notable that there were frequent transversional mutations such as GAA to TAA and GAG to TAG. Our analyses of nonsense polymorphisms revealed that changes between hydrophilic amino acids and termination codons by nucleotide changes at the first codon positions were very frequent.
Table 5

Frequency of each type of codon change for nonsense SNPs.

TAATAGTGATotal
Aaa→Taa33Aag→Tag31Aga→Tga20
1st Caa→Taa 62 Cag→Tag 162 Cga→Tga 203 748*
Gaa→Taa80Gag→Tag125Gga→Tga32
tCa→tAa27tCg→tAg19tCa→tGa25
2nd tGg→tAg 80 200
tTa→tAa18tTg→tAg18tTa→tGa13
taC→taA25taC→taG25tgC→tgA22
3rd tgG→tgA 85 235
taT→taA19taT→taG27tgT→tgA32
Total2644874321183

Bold letters show nucleotide changes by transition.

P<0.005 by chi-square test.

Bold letters show nucleotide changes by transition. P<0.005 by chi-square test. We examined the positions of 1,183 nonsense polymorphisms in the coding regions. On average, nonsense SNPs were located at 250 codons upstream of the original termination codons. To predict whether a nonsense mutation causes nonsense-mediated decay (NMD) of mRNA, we examined the locations of nonsense SNPs in the exon-intron structure of the genes (Table 6). As a result, of the 1183 nonsense SNPs, 581 were predicted to cause NMD, and thus to prevent translation. The other 602 cases of nonsense SNPs were predicted to result in truncated proteins. For the cases that truncated proteins are produced, the average truncation was estimated to be 75 amino acids.
Table 6

Nonsense SNPs and prediction of NMD.

Predicted to cause NMDa Not for NMDb Total
Known pathological variants8c 08
Other nonsense SNPs5736021175
Total5816021183

This prediction is based on that mRNA would be destroyed if a stop codon occurs in the 5′ side of the boundary, which is 50–55 nucleotides upstream from the 3′ end of the second to last exon. Here, the nonsense SNPs located in the 5′ side of the boundary, which was set at 50 nucleotides upstream from the 3′ end of the second to last exon, were predicted to cause NMD.

This number includes SNPs in genes consisting of only one exon.

P = 0.0033 by Fisher's exact test.

This prediction is based on that mRNA would be destroyed if a stop codon occurs in the 5′ side of the boundary, which is 50–55 nucleotides upstream from the 3′ end of the second to last exon. Here, the nonsense SNPs located in the 5′ side of the boundary, which was set at 50 nucleotides upstream from the 3′ end of the second to last exon, were predicted to cause NMD. This number includes SNPs in genes consisting of only one exon. P = 0.0033 by Fisher's exact test. To see which of these nonsense SNPs were known pathological mutations, we compared them with allelic variants in the Online Mendelian Inheritance in Man (OMIM) database. Only eight of 1,183 nonsense SNPs (rs17602729 in AMPD1, rs283413 in ADH1C, rs10250779 in PGAM2, rs17215500 in KCNQ1, rs497116 in CASP12, rs2228325 in ACTN3, rs3092891 in RB1 and rs28989186 in BUB1B) matched the variants in the OMIM database that are known variants with phenotypic variations (Table 7). This low value suggests that the biological effects of most nonsense SNPs have not yet been reported. Interestingly, each of the eight cases that matched known pathological variants was predicted to cause NMD (Table 7).
Table 7

Nonsense SNPs with known pathological effects.

Acc#ChrGene symbolSNPVariationOMIMBiological effects
M600921 AMPD1 rs17602729Gln12Ter102770AMPD deficiency
M122724 ADH1C rs283413Gly78Ter103730Parkinson disease
BC0737417 PGAM2 rs10250779Trp78Ter261670Myopathy
AF00057111 KCNQ1 rs17215500Arg518Ter607542Long QT syndrome 1
AY35822211 CASP12 rs497116Arg125Ter608633Sepsis susceptibility
M8640711 ACTN3 rs2228325Arg577Ter102574Athletic performance
L4187013 RB1 rs3092891Arg445Ter180200Bilateral retinoblastoma
AF06876015 BUB1B rs28989186Arg194Ter602860Premature chromatid separation trait and mosaic variegated aneuploidy syndrome

SNPs that cause read-though of the original termination codon

Among the 247 SNPs at termination codon sites, 119 SNP-mRNA pairs were found to be read-through mutations. If the allele having the stop codon is the ancestral type, the SNP is regarded as a change causing elongation of the polypeptide. However, an extended polypeptide would be expected only if there is an additional termination codon downstream. For 108 SNP-mRNA pairs, an additional termination codon was found in the 3′UTR region. The average extension was estimated to be 29 amino acids. Interestingly, we found five SNP-mRNA pairs that have no stop codons in the 3′UTR at all (The remaining six SNP-mRNA pairs do not have 3′UTR regions). For example, the T-to-C substitution (rs15941) in the DDR2 gene (X74764) is predicted to be a read-through mutation (from TAG to CGA), and the transcript has no other stop codon in the 3′UTR region. The frequency of this SNP is unknown (it is monomorphic in the four populations in HapMap project [4]). However, if this polymorphism really exists, transcripts having this read-through mutation would not produce a protein. Another example is the T-to-C substitution (rs17850833) in the MFSD3 gene (CR620962), which causes a change from TGA to CGA resulting in a change to arginine.

Functional bias of genes having nonsense SNPs

To see whether there is any functional bias in genes having nonsense SNPs, we examined the frequent biological terms in the genes having nonsense SNPs. We classified the genes having nonsense SNPs into two categories: genes with nonsense SNPs that are predicted to cause NMD and genes with nonsense SNPs that are not predicted to cause NMD. For genes having nonsense SNPs that would cause NMD (Table 8), the molecular functions that are most overrepresented included phosphorylation, ATP binding, iron/calcium ion binding, nucleotide/RNA binding and transporter activity. The localization of these genes was also biased to the cell membrane and the proteinaceous extracellular matrix. On the other hand, the genes having nonsense SNPs predicted to not cause NMD showed less bias in biological function (Table 9).
Table 8

Functional bias of genes having nonsense SNPs causing NMD.

Top levelGene Ontology no.Gene OntologyObserved gene no.a Expected gene no.b Ratio of enrichmentP valuec
Biological process0006118electron transport154.233.555.03×10−5
0006468protein amino acid phosphorylation167.282.204.98×10−3
Cellular component0016020membrane4122.551.825.57×10−4
0005578proteinaceous extracellular matrix81.216.622.17×10−6
Molecular function0005524ATP binding3517.152.041.79×10−4
0004713protein tyrosine kinase activity166.462.481.56×10−3
0004674protein serine/threonine kinase activity166.782.362.51×10−3
0000166nucleotide binding145.612.502.79×10−3
0004672protein kinase activity167.152.244.21×10−3
0003723RNA binding103.113.221.82×10−3
0005506iron ion binding82.004.001.32×10−3
0005509calcium ion binding167.652.097.89×10−3
0005215transporter activity103.442.913.76×10−3
0016491oxidoreductase activity114.242.595.76×10−3
0003779actin binding61.274.742.24×10−3
0004759carboxylesterase activity50.2420.444.19×10−6

Number of genes with a molecular function in the 581 genes in which nonsense SNPs causing NMD were found.

Expected number of genes that have a biological function in a sample of 581 genes, assuming a proportion of genes with a molecular function in all human genes.

Enrichment of a biological term in the genes for nonsense SNPs was statistically evaluated as a upper probability in a hypergeometric distribution.

Table 9

Functional bias of genes having nonsense SNPs not causing NMD.

Top levelGene Ontology no.Gene OntologyObserved gene no.a Expected gene no.b Ratio of enrichmentP valuec
Biological process0007156homophilic cell adhesion61.424.233.05×10−3
0006310DNA recombination30.1915.508.25×10−4
0006414translational elongation30.348.854.48×10−3
0042254ribosome biogenesis and assembly20.1513.778.68×10−3
Cellular component0005853eukaryotic translation elongation factor 1 complex20.1315.506.82×10−3
Molecular function0004194pepsin A activity20.1811.271.30×10−2
0003746translation elongation factor activity20.296.893.35×10−2

Number of genes with a molecular function in the 602 genes in which nonsense SNPs causing NMD were found.

Expected number of genes that have a biological function in a sample of 602 genes, assuming a proportion of genes with a molecular function in all human genes.

Enrichment of a biological term in the genes for nonsense SNPs was statistically evaluated as a upper probability in a hypergeometric distribution.

Number of genes with a molecular function in the 581 genes in which nonsense SNPs causing NMD were found. Expected number of genes that have a biological function in a sample of 581 genes, assuming a proportion of genes with a molecular function in all human genes. Enrichment of a biological term in the genes for nonsense SNPs was statistically evaluated as a upper probability in a hypergeometric distribution. Number of genes with a molecular function in the 602 genes in which nonsense SNPs causing NMD were found. Expected number of genes that have a biological function in a sample of 602 genes, assuming a proportion of genes with a molecular function in all human genes. Enrichment of a biological term in the genes for nonsense SNPs was statistically evaluated as a upper probability in a hypergeometric distribution.

Discussion

In this study, we conducted an extensive analysis of human genome polymorphisms with a comprehensive catalogue of human genes, and detected more than 50,000 polymorphisms that affect proteins. The distribution of polymorphisms showed different densities of polymorphisms among the 5′UTR, ORF and 3′UTR. The density of SNPs was lower in ORFs than in the 5′UTR and 3′UTR. The density of synonymous SNPs in the ORFs was higher than the densities of SNPs in the UTR regions. The reduction in density of SNPs in the UTR regions is consistent that there are functional constraints on nucleotide changes in UTRs related to the transcriptional and translational efficiency[22]. The density of nonsynonymous SNPs was much lower than the densities of other types of SNPs, possibly due to that the nucleotide changes with alteration of amino acids changes are under strong negative selection [36]. It was not known how nonsense SNPs are distributed in protein-coding regions. Here we showed that the density of nonsense SNPs is much lower than that of nonsynonymous SNPs. Although the biological effects of nonsense mutations appear to vary widely depending on their positions and the genes, the low density of nonsense SNPs that we found suggests that nonsense mutations have more disadvantageous effects than nonsynonymous mutations. While nonsense mutations that cause NMD result in ‘loss of function’, nonsense mutations that do not cause NMD produce truncated proteins which could have the dominant effects. The proportion of predicted nonsense SNPs causing NMD in this study is in agreement with a previous study which showed that dbSNP (build 125) has 1301 nonsense SNPs, about half of which were predicted to result in NMD [37]. In order to understand the biological effects of nonsense SNPs, it is important to know whether they do or do not cause NMD, because premature stop codons in a gene can have distinct disease phenotypes depending on the positions of mutations [27], [38]. The molecular functions that were overrepresented in the genes having nonsense SNPs included several molecular functions that were observed in human-specific pseudogenes[39], such as ATP binding, actin binding, calcium ion binding, extracellular matrix, nucleic acid binding and oxidoreductase. This is in accord with that nonsense mutations contribute to ‘pseudogenization’. It is interesting that nonsense SNPs causing NMD were frequently found in genes that encode proteins involved in phosphorylation, cell-cell interaction, signal transduction and transport. This may be because changes in the length of polypeptides caused by nonsense mutations are under strong negative selection in the genes involved in signal transduction or transportation because abnormal translation products could cause dominant effects. Therefore, inactivation of translation by nonsense mutations in those genes could have milder effects than changes of the length of polypeptides. Our results showed a low proportion of matches of nonsense SNPs with known pathological variants in OMIM, suggesting that the effects of most nonsense polymorphisms are unknown or not reported. Furthermore, the correspondence of the nonsense SNPs to the OMIM allelic variants (Table 6, Table 7) suggests that nonsense polymorphisms that are subject to NMD are more likely to be involved in phenotypic variations. There is a possibility that the nonsense SNPs detected here have pathological effects, in particular, if non-dispensable genes have nonsense mutations. First, a defect in one gene by a nonsense mutation or a frame-shifting indels causing a premature termination codon could be a cause of genetic diseases including complex diseases[40]. Second, there is a possibility that nonsense mutations cause recessive lethal alleles that would not be detected as causative variant of diseases. Probably, focusing on nonsense polymorphisms observed in specific populations would be a good way of selection for finding variants with deleterious effects. The effect of single nonsense SNPs can be compensated by the products of other genes having similar functions[41] and the other splicing isoforms of the gene [42]. Thus, single nonsense SNPs may not always cause severe phenotypic effects. In fact, some nonsense SNPs with high allele frequencies were found across populations[43]. There is a report of fixation of an inactive form of caspase 12 by a nonsense mutation (rs497116) in non-African populations[43], and this is an example supporting the ‘less is more hypothesis’[44]. This example suggests that some of nonsense mutations are not disadvantageous and that the increase of frequency of a nonsense allele could be driven by positive selection. Elongation of polypeptides by read-through mutations can affect protein folding and aggregation of proteins, which could affect phenotypic variations. Furthermore, a read-through mutation can cause more severe effects on translation when no additional stop codon follows. Such mutations are subject to ‘non-stop decay’ [32], [33], and would result in no gene product. It has been suggested that non-stop decay and NMD serve to remove toxic, aberrant proteins [29]. It is unclear how frequently such mutations prevent mRNA from producing proteins. Therefore, it would be quite useful to be able to predict the effects of various types of genetic changes on mRNA. Although the present results are based on representative transcripts (one transcript for one gene), the total number of SNPs causing changes between amino acids and stop codons in all the splicing isoforms was much larger (2,234). These variations, which cause changes in the length of a polypeptide or which determine whether a protein is translated, may include pathological variants that have yet not been reported. Therefore, it is important to examine their presence in human populations.

Materials and Methods

Data of human genetic polymorphisms

As data of genetic polymorphisms of human genome, single nucleotide polymorphisms (SNPs) and insertions and deletions (indels) in dbSNP [1] were used in this study. The whole data of human SNPs and indels were downloaded from dbSNP (build 125). We used all SNPs and indels that were mapped on single position in the genome, except for ‘large insertions’ in dbSNP.

Data of human genes

The data of human gene structure were obtained from H-InvDB ver3.8 (http://www.h-invitational.jp/), created by the annotation project of human genes (H-Invitational project) [23], [45]. Our analysis of all human genes that corresponds to H-InvDB (ver 3.8) predicted 36,712 protein coding loci. All protein-coding genes were annotated and classified based on similarity to known genes as follows; Category I, Identical to known human protein; Category II, Similar to known protein; Category III, IPR domain containing protein; Category IV, Conserved hypothetical protein; Category V, Hypothetical protein; Category VI, Hypothetical short protein; Category VII, pseudogene candidate. We used the following three kinds of data of the gene structure: 1) genomic location of exons to the human genome (build 35), 2) predicted ORF regions in transcripts, and 3) original and curated cDNA sequences.

Analysis

1. Analysis of polymorphism with exons and predicted ORFs

Selection of polymorphisms on exon regions. We selected polymorphisms in exon regions by comparing the genomic positions of polymorphisms and the start and end positions of exons that were obtained from mapping cDNA sequences to the human genome (Figure 1). Polymorphisms in introns were also selected in a same way. Conversion of genomic position of polymorphism into nucleotide position in cDNA sequence. To analyze polymorphisms with a predicted ORF, nucleotide positions of polymorphisms in the human genome sequences were converted into the nucleotide positions in cDNA sequences. Because there could be gaps in the alignment of cDNA sequence and the human genome sequence, the nucleotide position was converted considering possible gaps in the alignment. When the cDNA sequence was corrected in ORF prediction because of frame-shifting and remaining intron, the nucleotide position of SNP was modified based on addition or deletion of nucleotides. For a quality control of polymorphism data used for classification, we conformed that one of the nucleotides in each pair of SNP alleles was the same nucleotide at the corresponding position in the cDNA sequence. Classification of polymorphisms with predicted ORF. Polymorphisms within ORF were classified according to their effect on ORF. For SNPs with two alleles, alleles in nucleotide were converted into ‘alleles in codon’ by adding two other nucleotides in the codon from cDNA sequence. When a cDNA sequence was corrected in the annotation process by removing a remaining intron or by correcting a frameshift error, the corrected cDNA sequence was used. If these alleles in codon do not contain any stop codon, the alleles were classified into synonymous and nonsynonymous. In case a stop codon is included in the alleles in codon, they were classified into 1) premature termination (nonsense) codon, 2) read-through of original stop codon, and 3) synonymous at stop codon site, by assuming that the cDNA sequence has an ancestral allele. Indels were classified based on whether they are located in ORF. The indels within ORF were further classified by whether the insertion or deletion causes frame shifting in translation. Inference of direction of nonsense and read-through mutations. Ancestral alleles were obtained from dbSNP (build 128) to check direction of mutations for SNPs causing changes between amino acids and stop codons. For nonsense SNPs in protein-coding regions, we checked whether the ancestral allele codes amino acids. In case that the ancestral allele codes stop codon, we do not regard this SNP as nonsense SNP, but is a read-through mutation assuming that there was a variant having a shorter ORF. For read-though SNPs at termination codon site, we checked whether the ancestral allele codes stop codon. In case that the ancestral allele codes amino acids, we regard this SNP not as a read-through mutation, but as a nonsense mutation in a variant having a longer ORF. Number of sites for synonymous, nonsynonymous and nonsense mutations. To estimate densities of synonymous, nonsynonymous and nonsense SNPs, the numbers of potential synonymous, nonsynonymous and nonsense sites by single nucleotide changes were estimated for the ORF sequences. This is an extension of estimation of the numbers of synonymous and nonsynonymous sites[46]; the number of synonymous sites is calculated as the number of four-fold degenerate sites plus one-third of the number of two-fold degenerate sites. For 61 codons encoding amino acids, the numbers of nucleotide sites that would cause synonymous, nonsynonymous and nonsense mutations by a single nucleotide change were estimated with a model of nucleotide change. Here, the relative occurrence of a transitional mutation versus a transversional mutation (r) was set to be 4.0 (the expected ratio in the numbers of transitional and transversional mutations was 2.0). For example of the TTA codon for leucine, the number of nonsense sites was estimated to be 2.0/(r+2.0), because two types of transversional mutations at the second position cause nonsense mutations.

2. Correspondence to known pathological variants

To check whether the polymorphisms that alter proteins are known pathological variants with phenotypic effect, we examined correspondence of SNPs with data of known pathological variants. We used data of ‘allelic variant’ in the Online Mendelian Inheritance in Man (OMIM) database [18] as information of variants with phenotypic effect. For nonsynonymous and nonsense SNPs, their effects on translation and positions in ORF were compared with the ‘list of alleles’ in OMIM (e.g. described as “TRP324TER” or “ALA279THR” for the NGAS gene).

3. Prediction of nonsense SNPs causing NMD

Some of nonsense mutations cause nonsense-mediated decay (NMD), resulting in prevention of translation. It has been reported that mRNA would be destroyed if a stop codon occurs in the 5′ side of the boundary, which is 50–55 nucleotides upstream from the end of the second to last exon [30], [31]. To predict whether a nonsense SNP causes NMD, we examined whether a nonsense SNP is located in the 3′ side of the boundary, which was set at 50 nucleotides upstream from the end of the second to last exon, in the exon-intron structure. This method is the same as the method in SNP2NMD [37] when ‘NMD distance’ is 50 nucleotides.

5. Functional bias of genes with nonsense SNPs

For each biological term from Gene Ontology (www.geneontology.org), a proportion of genes with the biological function in the genes having nonsense SNPs was compared with that in all human genes (representative transcripts in all human genes in H-InvDB ver 5.0), and the significance of over representation of a molecular function in the genes having nonsense SNPs was evaluated as the upper probability of the hypergeometric distribution. Supplementary results and a table for analyses of nonsynonymous SNPs. (0.70 MB DOC) Click here for additional data file. Nonsense SNPs and read-through SNPs on representative transcripts. (4.24 MB DOC) Click here for additional data file.
  46 in total

1.  An mRNA surveillance mechanism that eliminates transcripts lacking termination codons.

Authors:  Pamela A Frischmeyer; Ambro van Hoof; Kathryn O'Donnell; Anthony L Guerrerio; Roy Parker; Harry C Dietz
Journal:  Science       Date:  2002-03-22       Impact factor: 47.728

2.  Accounting for human polymorphisms predicted to affect protein function.

Authors:  Pauline C Ng; Steven Henikoff
Journal:  Genome Res       Date:  2002-03       Impact factor: 9.043

3.  Complement factor H polymorphism in age-related macular degeneration.

Authors:  Robert J Klein; Caroline Zeiss; Emily Y Chew; Jen-Yue Tsai; Richard S Sackler; Chad Haynes; Alice K Henning; John Paul SanGiovanni; Shrikant M Mane; Susan T Mayne; Michael B Bracken; Frederick L Ferris; Jurg Ott; Colin Barnstable; Josephine Hoh
Journal:  Science       Date:  2005-03-10       Impact factor: 47.728

4.  A haplotype map of the human genome.

Authors: 
Journal:  Nature       Date:  2005-10-27       Impact factor: 49.962

5.  Investigation of protein functions through data-mining on integrated human transcriptome database, H-Invitational database (H-InvDB).

Authors:  Chisato Yamasaki; Kanako O Koyanagi; Yasuyuki Fujii; Takeshi Itoh; Roberto Barrero; Takuro Tamura; Yumi Yamaguchi-Kabata; Motohiko Tanino; Jun-Ichi Takeda; Satoshi Fukuchi; Satoru Miyazaki; Nobuo Nomura; Sumio Sugano; Tadashi Imanishi; Takashi Gojobori
Journal:  Gene       Date:  2005-09-26       Impact factor: 3.688

Review 6.  5-Methylcytosine in eukaryotic DNA.

Authors:  M Ehrlich; R Y Wang
Journal:  Science       Date:  1981-06-19       Impact factor: 47.728

7.  Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction.

Authors:  Kouichi Ozaki; Yozo Ohnishi; Aritoshi Iida; Akihiko Sekine; Ryo Yamada; Tatsuhiko Tsunoda; Hiroshi Sato; Hideyuki Sato; Masatsugu Hori; Yusuke Nakamura; Toshihiro Tanaka
Journal:  Nat Genet       Date:  2002-11-11       Impact factor: 38.330

8.  Mutations in GLIS3 are responsible for a rare syndrome with neonatal diabetes mellitus and congenital hypothyroidism.

Authors:  Valérie Senée; Claude Chelala; Sabine Duchatelet; Daorong Feng; Hervé Blanc; Jack-Christophe Cossec; Céline Charon; Marc Nicolino; Pascal Boileau; Douglas R Cavener; Pierre Bougnères; Doris Taha; Cécile Julier
Journal:  Nat Genet       Date:  2006-05-21       Impact factor: 38.330

9.  At least one intron is required for the nonsense-mediated decay of triosephosphate isomerase mRNA: a possible link between nuclear splicing and cytoplasmic translation.

Authors:  J Zhang; X Sun; Y Qian; J P LaDuca; L E Maquat
Journal:  Mol Cell Biol       Date:  1998-09       Impact factor: 4.272

10.  Widespread purifying selection at polymorphic sites in human protein-coding loci.

Authors:  Austin L Hughes; Bernice Packer; Robert Welch; Andrew W Bergen; Stephen J Chanock; Meredith Yeager
Journal:  Proc Natl Acad Sci U S A       Date:  2003-12-05       Impact factor: 11.205

View more
  17 in total

1.  Whole exome sequencing reveals a MLL de novo mutation associated with mild developmental delay and without 'hairy elbows': expanding the phenotype of Wiedemann-Steiner syndrome.

Authors:  Dora Steel; Vincenzo Salpietro; Rahul Phadke; Matthew Pitt; Giulia Gentile; Ahmed Massoud; Leigh Batten; Anu Bashamboo; Ken Mcelreavey; Anand Saggar; Maria Kinali
Journal:  J Genet       Date:  2015-12       Impact factor: 1.166

2.  Tandem termination signal in plant mRNAs.

Authors:  Alex V Kochetov; Oxana A Volkova; Alexander Poliakov; Inna Dubchak; Igor B Rogozin
Journal:  Gene       Date:  2011-04-22       Impact factor: 3.688

Review 3.  Degradation of mRNAs that lack a stop codon: a decade of nonstop progress.

Authors:  A Alejandra Klauer; Ambro van Hoof
Journal:  Wiley Interdiscip Rev RNA       Date:  2012-06-27       Impact factor: 9.957

4.  A genome-wide survey of the prevalence and evolutionary forces acting on human nonsense SNPs.

Authors:  Bryndis Yngvadottir; Yali Xue; Steve Searle; Sarah Hunt; Marcos Delgado; Jonathan Morrison; Pamela Whittaker; Panos Deloukas; Chris Tyler-Smith
Journal:  Am J Hum Genet       Date:  2009-02-05       Impact factor: 11.025

5.  Effects of premature termination codon polymorphisms in the Drosophila pseudoobscura subclade.

Authors:  Kenneth B Hoehn; Suzanne E McGaugh; Mohamed A F Noor
Journal:  J Mol Evol       Date:  2012-11-07       Impact factor: 2.395

6.  Exceptional diversity, maintenance of polymorphism, and recent directional selection on the APL1 malaria resistance genes of Anopheles gambiae.

Authors:  Susan M Rottschaefer; Michelle M Riehle; Boubacar Coulibaly; Madjou Sacko; Oumou Niaré; Isabelle Morlais; Sekou F Traoré; Kenneth D Vernick; Brian P Lazzaro
Journal:  PLoS Biol       Date:  2011-03-08       Impact factor: 8.029

7.  miRNA-mediated relationships between Cis-SNP genotypes and transcript intensities in lymphocyte cell lines.

Authors:  Wensheng Zhang; Andrea Edwards; Dongxiao Zhu; Erik K Flemington; Prescott Deininger; Kun Zhang
Journal:  PLoS One       Date:  2012-02-14       Impact factor: 3.240

8.  Widespread polymorphism in the positions of stop codons in Drosophila melanogaster.

Authors:  Yuh Chwen G Lee; Josephine A Reinhardt
Journal:  Genome Biol Evol       Date:  2011-11-08       Impact factor: 3.416

9.  Prediction of protein-destabilizing polymorphisms by manual curation with protein structure.

Authors:  Craig Alan Gough; Keiichi Homma; Yumi Yamaguchi-Kabata; Makoto K Shimada; Ranajit Chakraborty; Yasuyuki Fujii; Hisakazu Iwama; Shinsei Minoshima; Shigetaka Sakamoto; Yoshiharu Sato; Yoshiyuki Suzuki; Masahito Tada-Umezaki; Ken Nishikawa; Tadashi Imanishi; Takashi Gojobori
Journal:  PLoS One       Date:  2012-11-26       Impact factor: 3.240

10.  Finding protein-coding genes through human polymorphisms.

Authors:  Edward Wijaya; Martin C Frith; Paul Horton; Kiyoshi Asai
Journal:  PLoS One       Date:  2013-01-22       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.