| Literature DB >> 24266904 |
Mohsen Hajiloo, Babak Damavandi, Metanat Hooshsadat, Farzad Sangi, John R Mackey, Carol E Cass, Russell Greiner, Sambasivarao Damaraju.
Abstract
BACKGROUND: This paper introduces and applies a genome wide predictive study to learn a model that predicts whether a new subject will develop breast cancer or not, based on her SNP profile.Entities:
Mesh:
Year: 2013 PMID: 24266904 PMCID: PMC3891310 DOI: 10.1186/1471-2105-14-S13-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A schema of a genome wide predictive study (GWPS). Given a labelled training dataset of subjects each described by a genome wide scan of SNPs, feature selection and learning methods are applied to learn a classifier that can predict the labels of a set of novel subjects.
Confusion matrix for comparison of actual and predicted labels on 623 breast cancer study subjects
| Predicted Label | |||
|---|---|---|---|
| Case | Control | ||
| 187 | 115 | ||
| 137 | 184 | ||
Accuracy = (TP+TN)/(TP+FP+TN+FN)=59.55%; Precision = TP/(TP+FP)=50.40%;
Recall/Sensitivity = TP/(TP+FN)=61.92%; Specificity = TN/(TN+FP)=57.32%.
Figure 2Accuracy of a hundred "Permute, Learn, and Evaluate" Instances. The accuracies of 100 random permutation tests. We see that none of these accuracies exceeded the 59.55% accuracy of our model. This means that our result is significantly better than the baseline, with a confidence of more than 99%.
Figure 3Accuracy of the BestKNN algorithm for different numbers of MeanDiff selected SNPs. Accuracy of the classifiers built using BestKNN on sets of SNPs with the top {500, 600, ..., 1500} MeanDiff scores. This suggests that our model is fairly robust to the number of MeanDiff-selected SNPs, when selecting more than 500 SNPs.
Confusion matrix for comparison of actual and predicted labels on 2287 CGEMS breast cancer dataset
| Predicted Label | |||
|---|---|---|---|
| Case | Control | ||
| 683 | 462 | ||
| 447 | 695 | ||
Accuracy = (TP+TN)/(TP+FP+TN+FN)=60.25%; Precision = TP/(TP+FP)= 60.44%;
Recall/Sensitivity = TP/(TP+FN)=59.65%; Specificity = TN/(TN+FP)=60.86%.
Accuracy of a dozen of different combinations of feature selection and learning methods
| Feature Selection Methods | |||||
|---|---|---|---|---|---|
| Information Gain | MeanDiff | mRMR | PCA | ||
| 50.88% | 52.06% | 51.20% | 51.69% | ||
| 56.17% | 58.71% | 57.78% | 51.36% | ||
| 55.37% | 57.30% | 56.18% | 51.84% | ||
10-fold cross validation accuracies of combination of 4 feature selection methods and 3 learning methods shows that none of these combinations are more accurate than our suggested combination of MeanDiff500 feature selection and BestKNN learning (59.55%); indeed, several do not even beat the baseline of 51.52%.
List of breast cancer associated SNPs reported by recent genome wide association studies
| dbSNP ID | Gene | Reference |
|---|---|---|
| rs2981579 | FGFR2 | Hunter et al., 2007 [ |
| rs2420946 | FGFR2 | Hunter et al., 2007 [ |
| rs11200014 | FGFR2 | Hunter et al., 2007 [ |
| rs7696175 | TLR1/TLR6 | Hunter et al., 2007 [ |
| rs17157903 | RELN | Hunter et al., 2007 [ |
| rs1219648 | FGFR2 | Hunter et al., 2007 [ |
| rs3803662 | TNRC9/LOC643714 | Easton et al., 2007 [ |
| rs889312 | MAP3K1 | Easton et al., 2007 [ |
| rs13281615 | 8q | Easton et al., 2007 [ |
| rs3817198 | LSP1 | Easton et al., 2007 [ |
| rs2981582 | FGFR2 | Easton et al., 2007 [ |
| rs2075555 | COL1A1 | Murabito et al., 2007 [ |
| rs1978503 | FLJ45743 | Murabito et al., 2007 [ |
| rs1926657 | ABCC4 | Murabito et al., 2007 [ |
| rs13387042 | 2q35 | Stacey et al., 2007 [ |
| rs3012642 | PHKA/HDAC8 | Gold et al., 2008 [ |
| rs7203563 | A2BP1 | Gold et al., 2008 [ |
| rs6569479 | ECHDC1/RNF146 | Gold et al., 2008 [ |
| rs2180341 | ECHDC1/RNF146 | Gold et al., 2008 [ |
| rs6569480 | ECHDC1/RNF146 | Gold et al., 2008 [ |
| rs4415084 | 5p12 | Stacey et al., 2008 [ |
| rs10941679 | 5p12 | Stacey et al., 2008 [ |
| rs2067980 | MRPS30 | Thomas et al., 2008 [ |
| rs7716600 | MRPS30 | Thomas et al., 2008 [ |
| rs11249433 | 1p11.2 | Thomas et al., 2008 [ |
| rs999737 | RAD51L1 | Thomas et al., 2008 [ |
| rs4973768 | SLC4A7 | Ahmed et al., 2009 [ |
| rs6504950 | STXBP4 | Ahmed et al., 2009 [ |
28 SNPs identified by the 8 recent genome wide association studies on breast cancer. The accuracy of the classifier learned over these 28 genotyped SNPs was not better than the baseline of 51.52%.