| Literature DB >> 19331672 |
Eric Arehart1, Scott Gleim1, Bill White2, John Hwa1,3, Jason H Moore2,4,5,6.
Abstract
BACKGROUND: The fidelity of DNA replication serves as the nidus for both genetic evolution and genomic instability fostering disease. Single nucleotide polymorphisms (SNPs) constitute greater than 80% of the genetic variation between individuals. A new theory regarding DNA replication fidelity has emerged in which selectivity is governed by base-pair geometry through interactions between the selected nucleotide, the complementary strand, and the polymerase active site. We hypothesize that specific nucleotide combinations in the flanking regions of SNP fragments are associated with mutation.Entities:
Year: 2009 PMID: 19331672 PMCID: PMC2669078 DOI: 10.1186/1756-0381-2-2
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1Summary of the general steps involved in implementing the MDR method. "High-risk" are labeled G1 and "low-risk" are labeled G0 constructing a new one dimensional attribute with G0 and G1. This list of attributes includes the flanking positions ten nucleotides long in both the 3' and 5' direction from the SNP site. Each position has four levels (A, C, G, T) or alternatively two levels (purine, pyrimidine) while the class has two levels (0, 1) that codes for controls and cases. A) illustrates distribution of cases (left bars) and controls (right bars) for each of the four possible classes of attributes. The dark shaded cells have been labeled 'high-risk' using a threshold of T = 1. The light-shaded cells have been labeled 'low-risk'. B) illustrates the distribution of cases and controls when the two functional positions are considered jointly, it is the distribution of cases and controls for the new single attribute constructed using MDR.
Figure 2Schematic of flanking regions in 3' and 5' direction. Each position in the NCBI dataset was evaluated with regard to nucleotide identity, and by nucleotide and also purine/pyrimidine identity in the Broad Institute dataset.
Broad Dataset with Flanking Positions Identified by A, G, C, and T Character
| SNP | Count | nucleotide positions | Testing accuracy. | ||
| 371 | +1 | 0.558 | 0.05 | 0.001 | |
| 84 | +1 | 0.613 | 0.05 | 0.001 | |
| 95 | +1 | 0.521 | <0.01 | 0.001 | |
| 322 | +1 | 0.571 | 0.02 | 0.001 |
MDR analysis of SNP flanking regions where upstream 3' to SNP site equals positions -10 through -1 and 5' region from the SNP site extending downstream equals position +1 through +10. All positions where evaluated with regard to nucleotide identity, i.e. adenine, cytosine, guanine, and thymine.
Broad Dataset with Flanking Positions Identified as either purine or pyrimidine
| SNP | Count | nucleotide positions | Testing accuracy | ||
| 322 | +1 | 0.571 | 0.02 | 0.001 | |
| 371 | +2 | 0.536 | <0.01 | 0.001 | |
| 87 | -2, +2 | 0.621 | 0.02 | 0.001 |
MDR analysis of SNP flanking regions where upstream 3' to SNP site equals positions -10 through -1 and 5' region from the SNP site extending downstream equals position +1 through +10. All positions were evaluated with regard to pyrimidine or purine identity.
NCBI Dataset Distributions
| A or C | 75838 | 8.241639417 | 1981 | 6.591249 | 73857 | 8.297365 | |
| A or G | 312882 | 34.00222348 | 11324 | 37.67759 | 301558 | 33.87813 | |
| A or T | 58562 | 6.364182699 | 1008 | 3.353851 | 57554 | 6.465826 | |
| C or G | 83070 | 9.027571749 | 2621 | 8.720679 | 80449 | 9.037934 | |
| C or T | 312689 | 33.98124934 | 11181 | 37.2018 | 301508 | 33.87251 | |
| G or T | 76497 | 8.313255762 | 1831 | 6.092164 | 74666 | 8.388251 | |
SNP model codes are based on established NCBI nomenclature. SNP type identifies the nucleotides potentially occupying the SNP site in those models.
Figure 3Single Nucleotide Polymorphism dataset collected from NCBI. Codes for cases are shown in the included table. Sequences were collected for both intronic and exonic and their numbers are given in the column marked Occurrences. Nucleotide flanking pattern searches were conducted for exonic sequences only.
Figure 4Distribution of SNP types within the NCBI Dataset. Representation of percent transition, transversion, and multiple variant types in both exons and introns for the NCBI dataset.
NCBI Transition and Transversion Distributions in Exonic and Intronic Sequences
| 22505 | 74.87 | 603066 | 67.75 | 625571 | 67.98 | |
| 7441 | 24.75 | 286526 | 32.18 | 293991.8 | 31.94 | |
| 109 | 0.362 | 534 | 0.0599 | 643.3627 | 0.06991 | |
| 30055 | 890126 | 920181 | ||||
The number and percent of transition and transversion mutation types for the NCBI dataset are listed. Multiple variants are used for those SNP models that were not clearly a transition or transversion, but rather included both types of errors.
NCBI Dataset Motifs
| A/C | +1, +2 | 0.001 | ||
| C/G | -1, +1, +2 | 0.001 | ||
| A/T | -1, +1, +2 | 0.001 | ||
| G/T | -1, +1, +2 | 0.001 | ||
| A/G | -1, +1, +2, +3 | 0.001 | ||
| C/T | -2, -1, +1, +2, +3 | 0.001 |
SNP type represents identity of nucleic acid at SNP position. Motif represents positions within templating strand found to have significant association with sequences carrying an identified SNP compared to control sequences. Here position -1 is adjacent to the SNP site on the 3' end and position +1 is adjacent to the SNP site on the 5' end.
Figure 5Nucleic Acid Distribution within Each SNP Flanking Region. The nucleotide distribution across all SNPs in our final dataset is shown center top. All four potential nucleic acids for each position within each motif are shown as bars, where a positive association is shown as a positive value on the y-axis and a negative association is depicted as a negative value on the y-axis. Nucleic acid distributions are shown for all identified SNP types.
Flanking nucleotide distribution for each SNP-type
| SNP Model | SNP | Nucleotide Positions | Nucleotide Identity | % |
| Y | C/T | -3 | C | |
| -2 | G | |||
| +1 | A | |||
| +2 | G | |||
| +3 | G | |||
| R | A/G | -2 | C | |
| -1 | C | |||
| +1 | T | |||
| +2 | X | |||
| S | C/G | -2 | C | |
| -1 | X | |||
| +1 | X | |||
| W | A/T | -2 | G | |
| -1 | C | |||
| +1 | G | |||
| K | G/T | -2 | C | |
| -1 | C | |||
| +1 | G | |||
SNP models are identified by their respective NCBI single letter code and by their potential mutation identity. Nucleotide frequency is given as percent, with a lack of significance denoted as ns.