Literature DB >> 20488869

Novel multi-nucleotide polymorphisms in the human genome characterized by whole genome and exome sequencing.

Jeffrey A Rosenfeld1, Anil K Malhotra, Todd Lencz.   

Abstract

Genomic sequence comparisons between individuals are usually restricted to the analysis of single nucleotide polymorphisms (SNPs). While the interrogation of SNPs is efficient, they are not the only form of divergence between genomes. In this report, we expand the scope of polymorphism detection by investigating the occurrence of double nucleotide polymorphisms (DNPs) and triple nucleotide polymorphisms (TNPs), in which two or three consecutive nucleotides are altered compared to the reference sequence. We have found such DNPs and TNPs throughout two complete genomes and eight exomes. Within exons, these novel polymorphisms are over-represented amongst protein-altering variants; nearly all DNPs and TNPs result in a change in amino acid sequence and, in some cases, two adjacent amino acids are changed. DNPs and TNPs represent a potentially important new source of genetic variation which may underlie human disease and they should be included in future medical genetics studies. As a confirmation of the damaging nature of xNPs, we have identified changes in the exome of a glioblastoma cell line that are important in glioblastoma pathogenesis. We have found a TNP causing a single amino acid change in LAMC2 and a TNP causing a truncation of HUWE1.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20488869      PMCID: PMC2952858          DOI: 10.1093/nar/gkq408

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

While all human genomes are extremely similar to one another, there is variability that allows for the uniqueness of each individual. This variability can take the form of copy number variation, chromosomal rearrangements, or nucleotide polymorphisms. The overwhelming majority of recent studies of human variability have utilized microarray technology because of their relatively cheap cost and ready availability. For example, Genome Wide Association Studies (GWAS) have been performed for numerous diseases (1), with varying levels of success. In a GWAS study, numerous individuals with a specific disease or trait and ethnically matched controls are profiled using a microarray for single nucleotide polymorphisms (SNPs). SNP allelles that are more prevalent in the affected individuals relative to the controls are considered to be associated with illness. For a few diseases, such as age related macular degeneration (2), common SNPs with large effects on risk have been identified. In other cases, even when large sample sizes were utilized, the risk alleles identified by GWAS were only able to explain a small percentage of disease heritability (3). One potential reason for the disappointing results of GWAS studies is because of their limitation to SNPs. The microarray platforms are designed to robustly identify single nucleotide variations (4), but they are not effective in detecting variations involving more than one consecutive nucleotide. If two sequences are identical except for two adjacent nucleotides being altered (e.g. one sequence has AC and the other sequence has GT), this cannot be effectively measured using a microarray. Additionally, considerations regarding DNA melting temperature and the exclusion of repetitive sequences restrict the probes that can be used on a microarray (5). Recently, high-throughput DNA sequencing techniques (6,7) have been developed and have begun to replace microarrays for genome analysis studies (8). These sequencing techniques are free of the single nucleotide mismatch and melting temperature restrictions of microarrays. In addition, sequencing can produce a more comprehensive picture of a genome, than the particular features included on a given microarray. By analyzing raw sequencing reads, multiple nucleotide polymorphisms can be studied just as easily as SNPs. We have used the raw sequencing data from two complete genomes, the Venter/HuRef Genome (9) and the Chinese Genome (10), and eight complete exomes (11), to analyze nucleotide polymorphism beyond the single nucleotide level. We have aligned the sequencing reads to the human reference genome and identified thousands of loci with polymorphisms of 2 or 3 nt. These polymorphisms are denoted as double nucleotide polymorphisms (DNPs) and triple nucleotide polymorphisms (TNPs) (see Figure 1 for examples). For simplicity, as a group, SNPs, DNPs and TNPs are identified as xNPs. These xNPs do not include indels where nucleotides are found to be inserted or deleted in one sequence relative to another sequence. We focus on xNPs where the sequence length remains the same, but one, two or three nucleotides are changed. Indels in human genomes and exomes have been extensively characterized in (12) and (11).
Figure 1.

An example of a SNP, a DNP and a TNP between two DNA sequences.

An example of a SNP, a DNP and a TNP between two DNA sequences. While SNPs are certainly an important source of variation between human genomes, there are a few reasons why DNPs and TNPs have a greater propensity to be involved in disease causing mutations. First, SNPs have a strong propensity to be synonymous (13) whereby they change the nucleotide sequence, but do not alter the amino acid sequence due to the wobble allowed by the genetic code. These synonymous changes are usually silent and do not effect the phenotype, but there are notable exceptions (14,15). In contrast, a DNP or a TNP would effect multiple positions in a codon. Secondly, a SNP can at most result in the change of one amino acid, whereas a DNP or a TNP can change the residue at two adjacent positions and cause a more dramatic change. Before looking for the xNPs in genomic sequence, we first computationally determined their predicted effects on amino acid sequence under an assumption of randomness (Table 1). For example, given all possible permutations of nucleotides in a codon, 24% of SNPs would be expected to result in a synonymous mutation due to nonspecificity in the genetic code. On the other hand, a DNP or a TNP would randomly produce a synonymous mutation only 9 or 0.4% of the time, respectively. The rare possibility of a synonymous TNP can only occur when it overlaps two codons and changes both of them in a way that they still code for the same amino acid. As also displayed in Table 1, when a DNP causes an amino acid change, it is much more likely to be a single change rather than a double. Similarly, a TNP has a greater chance of causing one amino acid change than two changes, but the ratio is smaller. These theoretical results support the premise that DNPs and TNPs can be important sources of genomic variation, and our analysis of real data will be compared against these predicted results.
Table 1.

Theoretical calculations of the percentages of each type of change that will be cause by SNPs, DNPs and TNPs

Type of xNPNumber of nucleotides changedPercentage of xNPS resulting in the same amino acids, %Stop codon read through (from stop to coding), %Percentage of changes resulting in stop codons (premature stop), %Percentage of XNPs resulting in the change of:
1 Amino acid, %2 Amino acids, %
SNPs1244468N/A
DNPs20.9076788
TNPs30.40875133
Theoretical calculations of the percentages of each type of change that will be cause by SNPs, DNPs and TNPs We have found that in the human genome there is a considerable amount of variation with regard to DNPs and TNPs. For the two complete genomes that we analyzed, we found tens of thousands of DNPs and thousands of TNPs throughout the genome. As with all genomic variation, the majority of this variation was found outside of coding regions. Even so, a substantial amount of xNPs are found within coding exons and they have a strong potential to be involved in disease pathology. In order to test this hypothesis, we have applied our technique to the analysis of an exome from a glioblastoma cell line. In this exome, we have found xNPS causing amino acid changes and a truncated protein in genes whose mis-expression have been previously found in glioblastoma.

MATERIALS AND METHODS

Theoretical calculations

For SNPs, each codon was iterated through, and each position in the codon was changed to one of the three possible different nucleotides. The percentage of changes that caused amino acid changes or no change were tallied. For DNPs and TNPs, two adjacent codons were used and every possible set of two (for DNP) and three (for TNP) changes were performed. In order to allow for the querying of each position in each codon, the last positions of the second codon were wrapped onto the first codon. To illustrate, the six positions in the two codons from 5′ to 3′ will be listed as integers from 1 to 6, such that the list of TNPs is: 123, 234, 345, 456, 561, 612.

Sequencing data and alignment

The Chinese genome data was obtained from (10) as raw FASTQ sequencing reads and only those paired ended reads that were 35 bp in length were used in the analysis. The Venter/HuRef raw sequencing reads were obtained from (9). Since these sequencing reads were from an Applied Biosystems 3730xl they were much longer than 36-bp Illumina reads. To allow for comparison, the long reads were cut into non-overlapping 3-bp reads. The eight exome sequences were from (11). The glioblastoma exome sequence was from (16). The sequences were aligned using the Bowtie (17) alignment program and three mismatches were allowed. The reference genome used was hg18. The consensus repeat elements were taken from the UCSC genome browser annotations (18) and the genes used were the CCDS gene set (19).

xNP determination

After the reads were aligned to the genome, any single base mismatch was counted as a putative SNP and two or three consecutive mismatches within reads were marked as putative DNPs or TNPs respectively. For each putative xNP, the number of sequencing reads coding for the xNP or the reference sequence were tallied. The following criteria were used for calling an xNP: if there were no reads matching the reference at that position, there needed to be a minimum of three reads supporting the xNP, and the xNP would be called as homozygous. If there were reads matching the reference, two requirements needed to be met to call a heterozygous xNP: First, there needed to be at least three xNP supporting reads at that location. Second, a binomial distribution was computed at each genomic position, based on the total number of reads at that location and a 50% allele probability. A heterozygous xNP was called if the number of xNP-reads was at least half of the total number of reads at that location minus twice the standard deviation of the binomial distribution. This threshold allowed for <5% false negative rate of calling heterozygotes.

Analysis

The functional categorization of genes with xNPs was performed using DAVID (20). The lethality analysis of the xNPs was performed using Polyphen version 1.1.7 (21) and SIFT version 4.0.3 (22). Additionally, we analyzed the xNPs using PANTHER version 6.1 (23). For a substantial number of the polymorphisms, PANTHER was not able to give a prediction of the probability of it being deleterious. This is because the amino acid substitution occurred in a part of the protein that was not covered by the multi-sequence alignments underlying the predictions. This is a known shortcoming of PANTHER (23). Overall, the percentages of polymorphisms predicted to be deleterious by PANTHER were much lower than the percentages from both Polyphen and SIFT. A strong cause of this was polymorphisms not being scored and therefore not having a possibility of being predicted to be damaging. We therefore have not reported the PANTHER predictions.

RESULTS

Determination of xNPs in the genomes

In order to analyze xNPs in complete human genomes, we selected the Venter (9) and the Chinese (10) genomes as examples for our analysis. The sequencing reads for each of the genomes were aligned to the genome reference hg18 (‘Materials and methods’ section). For each alignment, up to three mismatches were allowed in order to capture SNPs, DNPs or TNPs. Any two adjacent mismatches were marked as a DNP; while three adjacent mismatches indicated a TNP. The number of xNPs found throughout the genome and their locations are shown in Table 2. For each genome and type of xNP, the total number of xNPs and the number that are homozygous are listed. Since the SNPs for these two genomes have been previously determined, we compared our results to the published counts. For the Venter genome, 3.2 million SNPs were reported (9), as compared to our finding of 2.89 million SNPS. The Chinese genome was reported to have 3.07 million SNPs (10), while our method yielded 3.69 million. It should be noted that these differences in SNP counts are in the expected directions, given the different alignment and SNP calling techniques and thresholds that were utilized (‘Discussion’ section).
Table 2.

Genome-wide distribution of SNPs, DNPs and TNPs for the Chinese and Venter Genomes

xNP locationChinese SNPs
Venter SNPs
Chinese DNPs
Venter DNPs
Chinese TNPs
Venter TNPs
TotalHomozygousTotalHomozygousTotalHomozygousTotalHomozygousTotalHomozygousTotalHomozygous
Downstream of Genes 5 kb103 00439 29675 98840 143127743693536092415024
Introns904 259367 472695 131378 2759898364672602978823334469179
Upstream of genes 5kb102 35939 09675 80339 8141187423925380100395929
Exons25 381793515 079804216448127453161
Intergenic2 547 0661 015 5392 032 9621 030 24132 94710 46027 454897421547671362406

Total xNPs3 682 0691 469 3382 894 9631 496 51545 47315 01336 70112 737317211821946639

Consensus repeats1 740 523627 0511 413 418650 63523 384646921 661584813044611055265
Genome-wide distribution of SNPs, DNPs and TNPs for the Chinese and Venter Genomes Overall, the numbers of DNPs and TNPs are greatly reduced relative to the numbers of SNPs. This is expected because the production of a DNP or a TNP requires the mutation of two or three adjacent nucleotides whereas a SNP only requires one change. For all of the xNPs, the greatest percentage occurs in intergenic regions, followed by introns, both of which are non-coding and are expected to have relatively lower levels of consistency across individuals. In contrast, far less than 1% of xNPs occur in coding exons, which are under selective pressure to prevent amino acid mutations. TNPs are almost completely absent from coding exons and there are only three coding TNPs from the Chinese genome and six from the Venter genome. Approximately half of all xNPs were observed in portions of the genome that are defined as repeats by RepeatMasker (24); this is expected since such regions cover 45% of the human genome (25).

xNPs within coding exons

Since coding exons are important regions of the genome for protein production, we focused on the analysis of xNPs in these regions. The xNPs were classified according to whether they caused no amino acid change (neutral/synonymous), caused one amino acid change, caused two amino acid changes, changed from a stop codon to a coding codon (read-through), or changed from a coding amino acid to a stop codon (premature stop). These results are shown in Table 3, and a complete list of each gene that had any xNPs along with the change produced by each type of xNP is shown in Supplementary Table S1.
Table 3.

Effects of SNPs, DNPs and TNPs within the genes of the Chinese and Venter Genomes

Type of changeChinese SNPsVenter SNPsChinese DNPsVenter DNPsChinese TNPsVenter TNPs
1 Amino acid change
    Number14 784709215211406
    Percentage of changes584793900100
    Number of genes affected784144071419806
2 Amino acid changes
    NumberN/AN/A81030
    Percentage of changesN/AN/A581000
    Number of genes affectedN/AN/A81030
Read through (from stop to coding)
    Number2460000
    Percentage of changes0.090.040.000.000.000.00
    Number of genes affected2360000
No change
    Number10 34679250200
    Percentage of changes41530200
    Number of genes affected621150020200
Premature stop codon
    Number22756410
    Percentage of changes0.890.3720.790.000.00
    Number of genes affected21750410
Total253811507916412736
Effects of SNPs, DNPs and TNPs within the genes of the Chinese and Venter Genomes We first compared the results with the theoretical calculations from Table 1. For the SNPs, a lower percentage resulted in amino acid changes than would be predicted at random. Theoretically, 68% of SNPs should change one amino acid, whereas this was found for 58 and 47% of the SNPs in the Chinese genome and the Venter genome respectively. This decrease was caused by a greater than predicted number of synonymous SNPs. We predicted that there would be 24% synonymous SNPs and we found that 41 and 53% of the Chinese and Venter genome SNPs, respectively, were synonymous. For both genomes, the synonymous to non-synonymous SNP ratio is around 50–50, as has been previously found for the genomes of multiple species (26–28). In contrast to the bias from the calculations towards synonymous SNPs, we found a strong bias towards non-synonymous DNPs. For both genomes almost all of the exonic DNPs resulted in an amino acid change. The theoretical calculations predicted 86% of the DNPs causing amino acid changes and we found 98% of the DNPs for each genome causing a change. There were only a small number of exonic TNPs, but these were completely non-synonymous. For all types of xNPs, the occurrence of both premature stop codons and stop codon read-through was less than predicted. For example, while 7% of DNPs were predicted to change a stop codon to a coding codon and result in stop codon read-through; neither genome had any DNPs producing read-through. Premature stop codons were predicted to result from 4 to 6% of SNPs and DNPs, but they were only found in 2% or less of all such events. These findings are presumably due to selective pressure against the potentially catastrophic results of either a protein truncation or elongation.

Analysis of exonic DNPs and TNPs

We found a 225 DNPs located within 200 genes in the Venter and Chinese genomes. Sixty-six (29.3%) of these DNPs are found in both genomes, and may reflect the presence of variants or errors in the reference genome. In both the Venter and Chinese genomes, over 90% of the DNPs resulted in a single amino acid change. For the Venter genome, 30% of them were predicted to me damaging by both Polyphen (21) and SIFT (22). For the Chinese genome, 35% were predicted by Polyphen to be damaging while 34% were predicted to be deleterious by SIFT. We then used DAVID (20) to functionally annotate the genes containing DNPs. For the Chinese genome, the top Gene Ontology category to describe the DNPs was for the MHC protein complex (Benjamini corrected P = 0.032) and for the Venter genome, no categories were found to be significant. Both genomes had a small number of TNPs within their exons, with the Venter genome only containing six exonic TNPS and the Chinese genome only containing three exonic TNPs. None of these TNPs are shared between the two genomes. For the Venter exonic TNPs, only one is homozygous, but they all cause a single amino acid change. For these TNPs, four of them are predicted by Polyphen to be either damaging to the protein structure, while two are predicted to be deleterious by SIFT. For the Chinese TNPs, only one is homozygous, but all cause two amino acid changes. Since Polyphen and SIFT only look at single amino acid changes, we were unable to evaluate the double nucleotide changes for their damaging potential. These TNPs were not concentrated in proteins of one type and include proteins that are structural (RDX, KRTAP10-1), signal regulatory (SIRPA), RNA binding (GPATC4), an olfactory receptor (OR5L2), bind to protein kinase A (AKAP3), a metalloprotease (ADAMTS9) and an antigen presenter (HLA-DRB5).

xNPs in sequenced exomes

To further investigate the occurrence of xNPs in human genes and their consistency, we utilized complete exome sequencing data from eight individuals (11). Since only the exomes of these individuals were sequenced, we were not able to quantify xNPs outside of genes. The SNPs, DNPs and TNPs in each exome were determined using the same criteria that were used for our initial two genomes (Venter and Chinese) and the results are shown in Table 4. For each of these exomes, we found an average of 17 164 exonic SNPs per exome which is very close to the reported count (11) of 17 272. We found a smaller number of DNPs and TNPs, with an average of 164 DNPs per exome and five TNPs per exome. Notably, the average number of exonic DNPs was identical to that observed in the Chinese genome, which was sequenced on the same platform (Illumina). We then looked at the pervasiveness of each xNP among the eight exomes (Supplementary Table S2). On average the same SNP, DNP or TNPs was found in 2.5, 1.8 and 1.4 exomes respectively. Despite this low average, there were SNPs and DNPs that were found across all eight exomes. For these loci, the occurrence of a SNP or a DNP relative to the reference genome in eight samples indicates that the reference genome probably does not contain the most common sequence. This is the case for 1450 SNPs (8.4%) and 18 DNPs (11%). These low percentages of pervasive xNPS indicate that the majority of xNPS are true variations between genomes rather than reflecting inaccuracies in the reference genome.
Table 4.

Effects of SNPs, DNPs and TNPs in the eight sequenced exomes

Type of changeNA12156
NA12878
NA18507
NA18517
NA18555
NA18956
NA19129
NA19240
Average
SNPSDNPSTNPsSNPSDNPSTNPsSNPsDNPSTNPsSNPsDNPSTNPsSNPsDNPSTNPsSNPsDNPSTNPsSNPsDNPSTNPsSNPsDNPSTNPsSNPsDNPSTNPs
1 Amino acid Change
    Number760414147228158485401505868517057350121570731541835516958385172279031544
    Percentage of changes479480479380469483469763469483469210045911004696100469486
    Number of genes affected475813234563153452301413522915454672117545151431514515445208162249151453
2 Amino acid Changes
    NumberN/A51N/A71N/A61N/A33N/A51N/A90N/A90N/A80N/A71
    Percentage of changesN/A320N/A420N/A417N/A238N/A417N/A50N/A50N/A40N/A414
    Number of genes affectedN/A51N/A71N/A61N/A33N/A51N/A90N/A90N/A80N/A71
Read through (from Stop to Coding)
    Number10007001710120080040090010001000
    Percentage of changes0000000100000000000000000.0600
    Number of genes affected10007001710120080040090010001000
No Change
    Number85783082662099942010 1411085652081492010 17050983300921220
    Percentage of changes532053105410541054205310553054005410
    Number of genes affected539730522020613920624010536520521620626550609400574220
Premature Stop Codon
    Number391042203500432035103820452042004010
    Percentage of changes0100100000100100100100000.2310
    Number of genes affected381042203400422033103720442041003910
Total Exonic xNPS16 231150515 543169518 586159618 881176815 958129615 264167118 579185518 270180217 1641645
Effects of SNPs, DNPs and TNPs in the eight sequenced exomes The most common TNP was found in four exomes in KRTAP10-1 gene and it was determined by Polyphen and SIFT to be a benign change. As with the TNPs found in the two full genomes, a significant amount of them result in the change of two adjacent amino acids which cannot be easily evaluated. To further confirm the findings of xNPs in the exomes, we compared our findings for one exome (NA19240) to the complete genome of that individual that has recently been completed using the Complete Genomics technology (29). In our analysis of the data from the exome sequencing (Table 4), we identified 180 DNPs and two TNPs in coding regions; while using the Complete Genomics data, we identified 155 coding DNPs and five coding TNPs. Seventy of the DNPs and one of the TNPs were found using both techniques. Thus, results of xNP analysis could be to some extent be dependent upon sequencing platform. At the same time, the overall abundance of DNPs and TNPs observed in the human genome and exome appear to be relatively consistent across various sequencing technologies. We then looked at the positions within codons where xNPs occur. SNPs should preferentially occur in the third codon position, as has been previously found (26,30). This is because of the wobble nature of the genetic code whereby a change in this position is often silent. DNPs and TNPs have not been previously profiled, but it is expected that a DNP would preferentially occur in either positions 1 and 2 or 2 and 3 of a codon so as not to overlap two codons. Similarly, TNPs would be expected to completely overlap an individual codon. A plot of each type of xNP and the percentage that begin in each position in a codon is shown in Figure 2. As expected, the largest percentage of SNPs occur in the third codon position, but this was not an overwhelming majority (43%). For DNPs, unexpectedly, there appears to be little preference for any codon position. For TNPs, there is a bias towards their beginning in the first codon position (54%) and covering a single codon rather than overlapping two adjacent codons.
Figure 2.

The percentage SNPs, DNPs and TNPs that begin in each codon position.

The percentage SNPs, DNPs and TNPs that begin in each codon position. Nucleotide substitutions can be categorized as either transitions or transversions depending upon the 2 nt involved. There is generally considered to be a strong bias of transitions as compared to transversions in metazoan genomes (31). This was confirmed by our findings for SNPs in the combined set of eight exomes. There were 37 457 transitions and 14 792 transversions observed. For DNPs, and TNPs, the terms of transition and transversions do not directly apply since they are associated with individual nucleotides. Nevertheless, we were able to investigate the positions within a DNP or a TNP as transition or transversions (Table 5). For DNPs, the first position was dominated (66%) by a transition, while there was much less preference at the second position. In contrast, there was a preference among TNPs (46%) for three transversions in a row.
Table 5.

A Tally of each combination of nucleotide changes for DNPs and TNPs

DNPs
TNPs
ChangesCountChangesCount
Transition–Transition249Transition–Transition–Transition1
Transition–Transversion256Transition–Transition–Transversion2
Transversion–Transition111Transition–Transversion–Transition1
Transversion–Transversion147Transition–Transversion–Transversion3
Transversion–Transition–Transition4
Transversion–Transition–Transversion2
Transversion–Transversion–Transition2
Transversion–Transversion–Transversion13
A Tally of each combination of nucleotide changes for DNPs and TNPs

xNPs in a cancerous exome

Finally, we applied our analysis to the sequenced exome of the U87 glioblastoma cell line (16). Rather than sequencing a complete exome, this study only sequenced the exons of 5253 cancer associated genes. Using our analysis, we found 53 DNPs and eight TNPs. For the DNPs, four caused a double amino acid change. Of the 49 that caused a single amino acid change, 37% were predicted by Polyphen to be damaging while SIFT determined that 31% would be deleterious. For the TNPs, four caused a double amino acid change and three that caused a single amino acid change. Of the single amino acid changes, all of them were predicted by both Polyphen and SIFT to be damaging. In addition, one TNP caused a premature stop codon in this exome. A summary of the mutations found in each gene are shown in Table 6.
Table 6.

A summary of the DNPs and TNPs found in the glioblastoma U87 exome

DNPs
TNPs
CCDS NameGene NameResult of xNPCCDS NameGene NameResult of xNP
CCDS10326ADAMTSL31 Amino acid changeCCDS11966ALPK21 Amino acid change
CCDS10411PIGQ1 Amino acid changeCCDS1352LAMC21 Amino acid change
CCDS11509CDC271 Amino acid changeCCDS4452RASGEF1C1 Amino acid change
CCDS12096ZNF5551 Amino acid changeCCDS11905FAM59A2 Amino acid change
CCDS13754PRODH1 Amino acid changeCCDS2012PROM22 Amino acid change
CCDS14006ST131 Amino acid changeCCDS34698ZKSCAN12 Amino acid change
CCDS14228DMD1 Amino acid changeCCDS7328PPP3CB2 Amino acid change
CCDS14596ZBTB331 Amino acid changeCCDS35301HUWE1Premature stop codon
CCDS14607STAG21 Amino acid change
CCDS14711CSAG11 Amino acid change
CCDS14713MAGEA21 Amino acid change
CCDS2057IL1RL11 Amino acid change
CCDS2397BARD11 Amino acid change
CCDS30824PDE4DIP1 Amino acid change
CCDS30947ABL21 Amino acid change
CCDS31702OR10G41 Amino acid change
CCDS32595KIAA01001 Amino acid change
CCDS33119NLRP131 Amino acid change
CCDS33432TCF151 Amino acid change
CCDS33539SYNJ11 Amino acid change
CCDS33876MED12L1 Amino acid change
CCDS34389CDSN1 Amino acid change
CCDS34654TRIM501 Amino acid change
CCDS35277SHROOM41 Amino acid change
CCDS35277SHROOM41 Amino acid change
CCDS35305FAM104B1 Amino acid change
CCDS35417MAGEC11 Amino acid change
CCDS35431PASD11 Amino acid change
CCDS42086LRRK11 Amino acid change
CCDS42437MAPK41 Amino acid change
CCDS42535ZNF6261 Amino acid change
CCDS42959KRTAP10-61 Amino acid change
CCDS42965KRTAP12-21 Amino acid change
CCDS42965KRTAP12-21 Amino acid change
CCDS43248DSPP1 Amino acid change
CCDS44016MAGEA2B1 Amino acid change
CCDS4881CUL71 Amino acid change
CCDS6415CYC11 Amino acid change
CCDS7148MYO3A1 Amino acid change
CCDS7566DUSP51 Amino acid change
CCDS7693NLRP61 Amino acid change
CCDS7927DDB21 Amino acid change
CCDS7954PRG31 Amino acid change
CCDS9300SACS1 Amino acid change
CCDS9737DAAM11 Amino acid change
CCDS9766HSPA21 Amino acid change
CCDS9867SNW11 Amino acid change
CCDS9928SERPINA51 Amino acid change
CCDS9998JAG21 Amino acid change
CCDS11281CCL132 Amino acid change
CCDS14446FAM46D2 Amino acid change
CCDS4269PCDH122 Amino acid change
CCDS5989DLC12 Amino acid change
A summary of the DNPs and TNPs found in the glioblastoma U87 exome In order to determine whether any of these xNPs were in genes that have been previously found to be related to glioblastoma, we conducted Pubmed searches for each of the genes that was found to have a damaging xNP, or an xNP changing the location of a stop codon. We found that a TNP in the LAMC2 gene that results in a single amino acid change L952D which is predicted to be damaging. This gene has been found to be amplified in glioblastomas (32) as well as other cancers (33,34). A TNP in the HUWE1 gene causes a truncation of the protein by the insertion of a premature stop codon reducing its length from 4374 residues to 1668 residues. The HUWE1 gene has been found to be important in brain development, and its deletion has been found to be important in malignant brain tumors (35). In the case of this cell line, HUWE1 is not deleted, but it is truncated and therefore most likely not functional.

DISCUSSION

We have characterized a novel source of genomic variation, DNPs and TNPs, which occur with a frequency of ∼1% of the total number of SNPs. In the two genomes we examined, we found tens of thousands of DNPs and thousands of TNPS. While only a small percentage of these changes are found in coding sequence and directly affect the transcribed protein, the non-exonic xNPs could of course be located in regulatory regions. Although not directly examined in this report, alteration of sequence in a promoter or an enhancer could change the expression dynamics of the associated gene (36). The coding xNPs, while small in number, could be very important clinically. In order to cause a disease, only a single amino acid change in the genome may be required. Since DNPs and TNPs cause an amino acid change in at least 90% of instances, they could very easily be a cause of a disease. Those DNPs and TNPs that cause two amino acid changes are especially intriguing since they are likely to have a pronounced affect on the protein structure and function. Exonic DNPs and TNPs are approximately 3-fold over-represented amongst amino acid-changing polymorphisms and produce greater than 100 such changes in each normal human genome. Based upon the average frequency of SNPs and simple probability, DNPs and TNPs should be much rarer than what we found. Assuming the occurrence of a SNP to be ∼1 in 1000 bp (3 million SNPs in a 3 billion base genome) and assuming independence of all SNPs, there should be one DNP every 1 million base pairs (10002), which would total 3000 DNPs in the entire human genome. There should also be one TNP every 1 billion base pairs (10003) which would total three TNPs for the entire human genome. These numbers are vastly lower than the numbers that we observed, supporting the idea that the mutations in a DNP or a TNP are not independent. It has been found that SNPs tend to cluster along the genome rather than being evenly distributed; certain regions of the genome have large amounts of SNPs and other regions of the genome are devoid of SNPs (37). Given a region of the genome with a large amount of SNPs, it is statistically more likely that a DNP or TNP would occur. The occurrence of DNPs and TNPs (as well as clusters of nearby, though non-adjacent SNPs) can be explained by results of polymerase mis-incorporation experiments. In these assays, it was found that if a polymerase incorporates the incorrect nucleotide at a particular location, it increases the likelihood that another nearby nucleotide will be incorporated incorrectly (38,39). In considering our results, it is important to recognize that the number of variants identified by sequencing can vary as a function of numerous factors, including sequencing platform, read length, depth of coverage, and read alignment parameters (including quality control filters). The present study utilized different alignment parameters from prior whole genome sequencing studies, in order to permit detection of DNPs and TNPs. For example, the Chinese genome (10) was aligned to the reference using the SOAP (40) tool and the paired-end reads were aligned together allowing for two mismatches in each read. In our analysis, we aligned all of the reads using Bowtie (17), because it is very fast at aligning reads and this speed can be further increased by allowing it to use multiple threads on a multiprocessor machine (41). Moreover, we permitted up to three mismatches in each read in order to be able to detect TNPs; if we had used the standard cutoff of two mismatches, any read providing evidence for a TNP would have been discarded as an unmappable read. Given this less stringent filter, it is not surprising that we identified a somewhat larger number of SNPs in the Chinese genome than originally reported (Wang et al. 2008). By contrast, the eight exomes were originally aligned using Maq (Li et al. 2008a) which does not have an explicit cutoff for the number of mismatches allowed; our alignment procedures resulted in an average number of SNPs that was nearly identical to the original report (Shendure et al. 2009). Finally, the original Venter genome analysis (9,42) was based upon the traditional Sanger sequencing and assembly of the Venter genome (43). As such, the SNPs between this genome and the human reference genome were determined by comparing the two genome assemblies de novo. Our analysis of the Venter/HuRef genome was completely different in that we utilized their raw sequencing reads which were truncated into non-overlapping 36-bp reads to simulate Ilumina sequencing reads (‘Materials and Methods’ section). This procedure resulted in a slight under-estimate of the total number of SNPs compared to the Venter Institute report, presumably due to some variation in regions with low-depth of coverage failing to meet our criteria for SNP calling. As an application of our technique to a real disease, we investigated the U87 glioblastoma cell line. We found two TNPs that cause pathogenic changes in genes that have already been implicated in the disease. Besides these mutations, it is very likely that a further understanding of glioblastoma could be gained from an analysis of the xNPs that were found in genes that have not already been suspected of involvement in glioblastoma.

CONCLUSION

In conclusion, the detection of DNPs and TNPs has not been previously studied, to our knowledge, and would be impractical using microarrays. With the recent advent of high-throughput sequencing and the possibility of sequencing complete exomes (11) and genomes (29), the investigation of DNPs and TNPs should be relatively straightforward. Their identification could be computationally accomplished in a manner as SNPs are called in sequenced genomes. It is hoped that the investigation of DNPs and TNPs in genomes will lead to the identification of causative mutations for genetic diseases that have thus far eluded SNP-based studies.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Institutes of Health (R01MH084098 to T.L. and P50MH080173 to A.K.M.). Funding for open access charge: R01MH084098. Conflict of interest statement. None declared.
  42 in total

1.  Accounting for human polymorphisms predicted to affect protein function.

Authors:  Pauline C Ng; Steven Henikoff
Journal:  Genome Res       Date:  2002-03       Impact factor: 9.043

Review 2.  Origins of spontaneous mutations: specificity and directionality of base-substitution, frameshift, and sequence-substitution mutageneses.

Authors:  Hisaji Maki
Journal:  Annu Rev Genet       Date:  2002-06-11       Impact factor: 16.830

3.  Gene-based SNP discovery as part of the Japanese Millennium Genome Project: identification of 190,562 genetic variations in the human genome. Single-nucleotide polymorphism.

Authors:  Hisanori Haga; Ryo Yamada; Yozo Ohnishi; Yusuke Nakamura; Toshihiro Tanaka
Journal:  J Hum Genet       Date:  2002       Impact factor: 3.172

4.  Complement factor H polymorphism in age-related macular degeneration.

Authors:  Robert J Klein; Caroline Zeiss; Emily Y Chew; Jen-Yue Tsai; Richard S Sackler; Chad Haynes; Alice K Henning; John Paul SanGiovanni; Shrikant M Mane; Susan T Mayne; Michael B Bracken; Frederick L Ferris; Jurg Ott; Colin Barnstable; Josephine Hoh
Journal:  Science       Date:  2005-03-10       Impact factor: 47.728

5.  The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Authors:  Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman
Journal:  Genome Res       Date:  2009-06-04       Impact factor: 9.043

6.  Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse.

Authors:  K Lindblad-Toh; E Winchester; M J Daly; D G Wang; J N Hirschhorn; J P Laviolette; K Ardlie; D E Reich; E Robinson; P Sklar; N Shah; D Thomas; J B Fan; T Gingeras; J Warrington; N Patil; T J Hudson; E S Lander
Journal:  Nat Genet       Date:  2000-04       Impact factor: 38.330

7.  Human non-synonymous SNPs: server and survey.

Authors:  Vasily Ramensky; Peer Bork; Shamil Sunyaev
Journal:  Nucleic Acids Res       Date:  2002-09-01       Impact factor: 16.971

8.  The effects of dNTP pool imbalances on frameshift fidelity during DNA replication.

Authors:  K Bebenek; J D Roberts; T A Kunkel
Journal:  J Biol Chem       Date:  1992-02-25       Impact factor: 5.157

9.  Synonymous mutations in the human dopamine receptor D2 (DRD2) affect mRNA stability and synthesis of the receptor.

Authors:  Jubao Duan; Mark S Wainwright; Josep M Comeron; Naruya Saitou; Alan R Sanders; Joel Gelernter; Pablo V Gejman
Journal:  Hum Mol Genet       Date:  2003-02-01       Impact factor: 6.150

10.  The diploid genome sequence of an individual human.

Authors:  Samuel Levy; Granger Sutton; Pauline C Ng; Lars Feuk; Aaron L Halpern; Brian P Walenz; Nelson Axelrod; Jiaqi Huang; Ewen F Kirkness; Gennady Denisov; Yuan Lin; Jeffrey R MacDonald; Andy Wing Chun Pang; Mary Shago; Timothy B Stockwell; Alexia Tsiamouri; Vineet Bafna; Vikas Bansal; Saul A Kravitz; Dana A Busam; Karen Y Beeson; Tina C McIntosh; Karin A Remington; Josep F Abril; John Gill; Jon Borman; Yu-Hui Rogers; Marvin E Frazier; Stephen W Scherer; Robert L Strausberg; J Craig Venter
Journal:  PLoS Biol       Date:  2007-09-04       Impact factor: 8.029

View more
  16 in total

1.  Searching for new genetic variations in expression databases for the GABAergic and glutamatergic systems.

Authors:  Manuela Barbosa Rodrigues de Souza; João Ricardo Mendes de Oliveira
Journal:  J Mol Neurosci       Date:  2012-04-22       Impact factor: 3.444

2.  Revising the M235T polymorphism position for the AGT gene and reporting a modifying variant in the Brazilian population with potential cardiac and neural impact.

Authors:  R R Lemos; S G de Lima; J E Gomes da Cunha; D F Oliveira; M B Rodrigues de Souza; C F J Ayres; M F P M Albuquerque; J R M Oliveira
Journal:  J Mol Neurosci       Date:  2012-04-25       Impact factor: 3.444

3.  The Visual Colorimetric Detection of Multi-nucleotide Polymorphisms on a Pneumatic Droplet Manipulation Platform.

Authors:  Szu-I Yeh; Wei-Feng Fang; Chao-Jyun Huang; Tzu-Ming Wang; Jing-Tang Yang
Journal:  J Vis Exp       Date:  2016-09-27       Impact factor: 1.355

Review 4.  Towards precision medicine.

Authors:  Euan A Ashley
Journal:  Nat Rev Genet       Date:  2016-08-16       Impact factor: 53.242

5.  Accurate Prediction of Protein Sequences for Proteogenomics Data Integration.

Authors:  Yanick Paco Hagemeijer; Victor Guryev; Peter Horvatovich
Journal:  Methods Mol Biol       Date:  2022

6.  Chapter 15: disease gene prioritization.

Authors:  Yana Bromberg
Journal:  PLoS Comput Biol       Date:  2013-04-25       Impact factor: 4.475

7.  Catalog of microRNA seed polymorphisms in vertebrates.

Authors:  Minja Zorc; Dasa Jevsinek Skok; Irena Godnic; George Adrian Calin; Simon Horvat; Zhihua Jiang; Peter Dovc; Tanja Kunej
Journal:  PLoS One       Date:  2012-01-27       Impact factor: 3.240

8.  Limitations of the human reference genome for personalized genomics.

Authors:  Jeffrey A Rosenfeld; Christopher E Mason; Todd M Smith
Journal:  PLoS One       Date:  2012-07-11       Impact factor: 3.240

9.  Comprehensive characterization of human genome variation by high coverage whole-genome sequencing of forty four Caucasians.

Authors:  Hui Shen; Jian Li; Jigang Zhang; Chao Xu; Yan Jiang; Zikai Wu; Fuping Zhao; Li Liao; Jun Chen; Yong Lin; Qing Tian; Christopher J Papasian; Hong-Wen Deng
Journal:  PLoS One       Date:  2013-04-05       Impact factor: 3.240

10.  MAC: identifying and correcting annotation for multi-nucleotide variations.

Authors:  Lei Wei; Lu T Liu; Jacob R Conroy; Qiang Hu; Jeffrey M Conroy; Carl D Morrison; Candace S Johnson; Jianmin Wang; Song Liu
Journal:  BMC Genomics       Date:  2015-08-01       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.