| Literature DB >> 18325117 |
Leah E Mechanic1, Brian T Luke, Julie E Goodman, Stephen J Chanock, Curtis C Harris.
Abstract
BACKGROUND: The risk of common diseases is likely determined by the complex interplay between environmental and genetic factors, including single nucleotide polymorphisms (SNPs). Traditional methods of data analysis are poorly suited for detecting complex interactions due to sparseness of data in high dimensions, which often occurs when data are available for a large number of SNPs for a relatively small number of samples. Validation of associations observed using multiple methods should be implemented to minimize likelihood of false-positive associations. Moreover, high-throughput genotyping methods allow investigators to genotype thousands of SNPs at one time. Investigating associations for each individual SNP or interactions between SNPs using traditional approaches is inefficient and prone to false positives.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18325117 PMCID: PMC2335300 DOI: 10.1186/1471-2105-9-146
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Description of Method for Scoring Functions 1–5. In this example, a study consists of 200 cases and 200 controls and a 10-fold cross-validation is performed. Only two SNPs are examined: A (with alleles A and a) and B (with alleles B and b) in this example. The order of samples is scrambled before training. In (a) training samples (180 cases and 180 controls) are assigned to the 9 × 2 genotype-phenotype table (classification). The genotype-phenotype table is the distribution of phenotypes (i.e. case vs. control) for all possible genotype combinations for the SNPs examined. The genotype-phenotype table is used for classification of SNPs. In this example, PIA v. 2.0 designates AABB and AABb as case-genotypes, aaBB as an undetermined-genotype, and the remaining six genotypes as control-genotypes. If the training data is selected to contribute to scoring (if the Jackknife analysis, LOO is selected), a contingency data is generated using the training data (b). The contingency table compares the observed genotype-phenotype distribution to the expected based on the genotype assignments in (a). The testing data is placed into the appropriate cells of the genotype-phenotype table (c). The contingency table for testing data (d) is generated using genotype assignments from the training data (a). Since the AABB genotype represents a case-phenotype (based on training data), the seven case samples are added to the number of true positives (NTP) and the three control samples are added to the number of false positives (NFP) in the contingency table (d). Conversely, AaBB is a control-phenotype, so the five controls are added to the number of true negatives (NTN) and the three cases are added to the number of false negatives (NFN). If a testing sample is assigned to an undetermined-phenotype (aaBB), PIA counts the assignment as half-right and half-wrong. Therefore, the three cases cause NTP and NFP to be increased by 1.5; the two controls increase NTN and NFN by 1.0. After processing all testing samples, the corresponding contingency table is shown in (d). The process is then repeated for the remaining 9 sets of testing and training samples, and all contingency tables arising from the testing samples are summed.
Definition of the seven scoring functions used in PIA v. 2.0.
| Metric | Description | Formulaa |
| 1 | %Correct | (NTP + NTN)/(NTP + NFN + NFP + NTN) |
| 2 | Sensitivity + Specificity | [NTP/(NTP+NFN)] + [NTN/(NFP+NTN)] |
| 3 | Positive Predictive Value (PPV)+ Negative Predictive Value (NPV) | [NTP/(NTP+NFP)] + [NTN/(NFN+NTN)] |
| 4 | Risk Ratio | [(NTP)(NFP+NTN)]/[(NFP)(NTP+NFN)] |
| 5 | Odds Ratio | [(NTP)(NTN)]/[(NFP)(NFN)] |
| 6 | Gini Indexb | GINIparent - GINIsplit |
| GINI(k) = 1.0 - ∑j = 1, J [p(j|k)] | ||
| GINIsplit = ∑k = 1, K [(nk/n) GINI(k)] | ||
| 7 | Absolute Probability Differencec | Σk = 1, K |P1(k) - P2(k)| |
a NTP, Number of true positives; NTN, number of true negatives; NFN, number of false negatives; NFP, number of false positives.
b Gini Index is used in CART decision trees [25]. The scoring for Gini index is described under "Algorithm."
c Scoring is the probability of finding a case (P1) in cell, k, minus the probability of observing a control (P2) in cell, k, summed over all the K cells in the genotype-phenotype table.
Number of times interacting alleles were observed as highest scoring model (rank = 1) or second (rank = 2) using PIA 2.0a for 2-SNP interactions using balanced simulated data sets.
| Scoring Function | |||||||||
| Model Numberb | Rank | %Correct | Sensitivity +Specificity | PPV+NPV | Risk Ratio | Odds Ratio | Gini Index | Probability Difference | Total (Overall)c |
| 55 | 1 | 66 | 66 | 50 | 36 | 50 | 83 | 67 | 79 |
| 2 | 4 | 4 | 14 | 11 | 14 | 10 | 15 | 6 | |
| 56 | 1 | 55 | 55 | 46 | 33 | 46 | 84 | 58 | 68 |
| 2 | 3 | 3 | 10 | 11 | 9 | 2 | 5 | 5 | |
| 57 | 1 | 58 | 58 | 52 | 19 | 52 | 82 | 66 | 78 |
| 2 | 7 | 7 | 10 | 12 | 10 | 8 | 7 | 9 | |
| 58 | 1 | 88 | 88 | 76 | 58 | 76 | 96 | 89 | 92 |
| 2 | 1 | 1 | 11 | 18 | 11 | 3 | 4 | 5 | |
| 59 | 1 | 50 | 50 | 45 | 36 | 45 | 64 | 48 | 61 |
| 2 | 13 | 13 | 13 | 16 | 12 | 10 | 14 | 11 | |
a Data were generated using cell counts to assign case versus control status (IFRACT = 0) and excluded the training data when scoring and running 10 10-fold cross-validations for functions 1–5 (ITRAIN = 0, FRACT = 0.1, NTIME = 10). Results including training data in scoring are presented in Additional file 1, Table S1.
b Simulated data sets were described previously [14] and were obtained from Dr. Moore by request.
c Total score is the summation over all scoring functions after linearly scaling the score for each individual function such that the top score is 50.0.
Number of times interacting alleles were observed as highest (rank = 1) or second highest (rank = 2) pairs in the top 10 triplets using PIA 2.0a for 3-SNP interactions using balanced simulated data sets
| Scoring Function | |||||||||
| Model Numberb | Rank | %Correct | Sensitivity +Specificity | PPV+NPV | Risk Ratio | Odds Ratio | Gini Index | Probability Difference | Total (Overall)c |
| 55 | 1 | 73 | 73 | 79 | 58 | 79 | 81 | 67 | 63 |
| 2 | 9 | 9 | 5 | 9 | 5 | 7 | 15 | 12 | |
| 56 | 1 | 61 | 61 | 62 | 50 | 62 | 79 | 51 | 68 |
| 2 | 10 | 10 | 6 | 8 | 6 | 6 | 10 | 8 | |
| 57 | 1 | 72 | 72 | 71 | 47 | 71 | 78 | 55 | 75 |
| 2 | 7 | 7 | 10 | 12 | 10 | 8 | 15 | 7 | |
| 58 | 1 | 92 | 92 | 90 | 85 | 90 | 96 | 86 | 92 |
| 2 | 3 | 3 | 2 | 4 | 2 | 2 | 4 | 3 | |
| 59 | 1 | 57 | 57 | 55 | 52 | 55 | 61 | 42 | 60 |
| 2 | 10 | 10 | 10 | 8 | 10 | 10 | 9 | 10 | |
a Data were generated using cell counts to assign case versus control status (IFRACT = 0) and excluded the training data when scoring and running 10 10-fold cross-validations for functions 1–5 (ITRAIN = 0, FRACT = 0.1, NTIME = 10). Results including training data in scoring are presented in Additional file 1, Table S1.
b Simulated data sets were described previously [14] and were obtained from Dr. Moore by request.
c Total score is the summation over all scoring functions after linearly scaling the score for each individual function such that the top score is 50.0.
Number of times interacting alleles were observed as highest scoring model (rank = 1) or second (rank = 2) using PIA 2.0a for 2-SNP interactions using imbalanced simulated data sets
| Scoring Function | |||||||||
| Model Numberb | Case:Control Ratio | Rank | Sensitivity +Specificity | PPV+NPV | Risk Ratio | Odds Ratio | Gini Index | Probability Difference | Total (Overall)c |
| 55 | 1:2 | 1 | 72 | 72 | 69 | 72 | 81 | 74 | 82 |
| 2 | 6 | 7 | 7 | 7 | 9 | 8 | 6 | ||
| 1:4 | 1 | 56 | 50 | 50 | 50 | 61 | 63 | 67 | |
| 2 | 9 | 13 | 10 | 13 | 11 | 11 | 7 | ||
| 56 | 1:2 | 1 | 69 | 67 | 59 | 67 | 83 | 73 | 79 |
| 2 | 8 | 10 | 10 | 10 | 11 | 9 | 10 | ||
| 1:4 | 1 | 45 | 43 | 38 | 43 | 62 | 52 | 62 | |
| 2 | 10 | 11 | 12 | 11 | 15 | 12 | 11 | ||
| 57 | 1:2 | 1 | 63 | 62 | 45 | 62 | 82 | 69 | 76 |
| 2 | 6 | 7 | 12 | 7 | 7 | 14 | 8 | ||
| 1:4 | 1 | 51 | 49 | 34 | 49 | 67 | 53 | 66 | |
| 2 | 10 | 8 | 16 | 8 | 10 | 14 | 9 | ||
| 58 | 1:2 | 1 | 87 | 86 | 79 | 86 | 96 | 92 | 95 |
| 2 | 7 | 5 | 8 | 5 | 4 | 4 | 4 | ||
| 1:4 | 1 | 73 | 70 | 66 | 69 | 84 | 79 | 84 | |
| 2 | 8 | 8 | 15 | 9 | 10 | 8 | 7 | ||
| 59 | 1:2 | 1 | 64 | 61 | 62 | 61 | 72 | 72 | 70 |
| 2 | 6 | 9 | 12 | 9 | 7 | 7 | 10 | ||
| 1:4 | 1 | 54 | 50 | 45 | 50 | 48 | 60 | 61 | |
| 2 | 8 | 10 | 15 | 10 | 19 | 8 | 13 | ||
a Data were generated using fractional occupations to assign case versus control status (IFRACT = 1) and excluded the training data when scoring and running 10 10-fold cross-validations for functions 1–5 (ITRAIN = 0, FRACT = 0.1, NTIME = 10). Training data is included in Additional file 1, Table S2.
b Simulated data sets were described previously [14] and were obtained from Dr. Moore by request.
c Total score is the summation over all scoring functions after linearly scaling the score for each individual function such that the top score is 50.0.
Highest scoring SNP combinations associated with colon cancer using PIA v. 2.0a
| 1 | GSTT1_02 | GSTT1_02 | PTGS2_11 | PTGS2_11 | PTGS2_11 | IL4_01 | GSTT1_02 | PTGS2_11 |
| 2 | CASP8_03 | CASP8_03 | TGFB1_02 | TGFB1_02 | TGFB1_02 | IL1B_01 | CASP8_03 | IL1B_01 |
| GSTT1_02 | GSTT1_02 | PTGS2_11 | PTGS2_11 | PTGS2_11 | IL1B_03 | GSTT1_02 | IL1B_03 | |
| 3 | MTRR_01 | ESR1_03 | TGFB1_02 | TGFB1_02 | TGFB1_02 | IL1B_01 | IL4_01 | IL4_01 |
| IL1B_03 | GPX1_03 | CDC25A_02 | CDC25A_02 | CDC25A_02 | IL1B_03 | MTRR_01 | MTRR_01 | |
| SOD2_01 | GSTT1_02 | PTGS2_11 | PTGS2_11 | PTGS2_11 | SOD2_01 | DIO1_04 | DIO1_04 | |
| 4 | race | race | CHEK1_02 | CHEK1_02 | CHEK1_02 | WRN_03 | IL4R_02 | ESR1_03 |
| IL1B_01 | IL1B_01 | TGFB1_02 | TGFB1_02 | TGFB1_02 | IL5_02 | SOD2_01 | GPX1_06 | |
| IL1B_03 | IL1B_03 | TNF_02 | TNF_02 | TNF_02 | IL10_02 | GSTT1_02 | CYP19A1_06 | |
| MTHFR_02 | MTHFR_02 | CDC25A_02 | CDC25A_02 | CDC25A_02 | CYP19A1_06 | CYP19A1_09 | GSTT1_02 | |
a Data were generated using cell counts to assign case versus control status (IFRACT = 0) and excluded the training data when scoring and running 10 10-fold cross-validations for functions 1–5 (ITRAIN = 0, FRACT = 0.1, NTIME = 10).
b Total score is the summation over all scoring functions after linearly scaling the score for each individual function such that the top score is 50.0.
Top 10 most frequently observed SNP-pairs in the 100 highest scoring triplet SNP combinations associated with colon cancera.
| IL1B_01 | IL1B_03 | 9 | 7 | 0 | 0 | 1 | 71 | 3 | 91 |
| CASP8_03 | GSTT1_02 | 20 | 21 | 4 | 0 | 4 | 0 | 3 | 52 |
| CHEK1_02 | TGFB1_02 | 0 | 0 | 14 | 17 | 14 | 0 | 0 | 45 |
| MTRR_01 | SOD2_01 | 9 | 9 | 3 | 0 | 3 | 4 | 11 | 39 |
| CDC25A_02 | PTGS2_11 | 0 | 0 | 10 | 12 | 10 | 0 | 0 | 32 |
| CHEK1_02 | CDC25A_02 | 0 | 0 | 10 | 11 | 10 | 0 | 0 | 31 |
| CHEK1_02 | PTGS2_11 | 0 | 0 | 8 | 8 | 8 | 0 | 0 | 24 |
| CHEK1_02 | ALOX5_07 | 0 | 0 | 7 | 7 | 7 | 0 | 0 | 21 |
| IL4_01 | XRCC1_1 | 4 | 7 | 3 | 0 | 3 | 1 | 2 | 20 |
| MTRR_01 | DIO1_04 | 3 | 3 | 1 | 0 | 1 | 3 | 9 | 20 |
a Top 100 triplet SNP combinations were generated using cell counts to assign case versus control status (IFRACT = 0). The training data were excluded in scoring and running ten 10-fold cross-validations for functions 1–5 (ITRAIN = 0, FRACT = 0.1, NTIME = 10). If the SNPs were randomly assigned to the top 100 triplet models, a given pair is expected to be observed 3.3 times overall.