| Literature DB >> 23894323 |
Rosanna Upstill-Goddard1, Diana Eccles, Sarah Ennis, Sajjad Rafiq, William Tapper, Joerg Fliege, Andrew Collins.
Abstract
Two major breast cancer sub-types are defined by the expression of estrogen receptors on tumour cells. Cancers with large numbers of receptors are termed estrogen receptor positive and those with few are estrogen receptor negative. Using genome-wide single nucleotide polymorphism genotype data for a sample of early-onset breast cancer patients we developed a Support Vector Machine (SVM) classifier from 200 germline variants associated with estrogen receptor status (p<0.0005). Using a linear kernel Support Vector Machine, we achieved classification accuracy exceeding 93%. The model indicates that polygenic variation in more than 100 genes is likely to underlie the estrogen receptor phenotype in early-onset breast cancer. Functional classification of the genes involved identifies enrichment of functions linked to the immune system, which is consistent with the current understanding of the biological role of estrogen receptors in breast cancer.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23894323 PMCID: PMC3716652 DOI: 10.1371/journal.pone.0068606
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Weka kernels and classification results using 200 SNPs with the strongest ER+/− association.
| Kernel type | Percentage correctly classified | True positive rate | False positive rate | True negative rate | False negative rate | Area under ROC |
| Linear | 93.28±3.07 | 0.88±0.07 | 0.04±0.03 | 0.96±0.03 | 0.12±0.07 | 0.92±0.04 |
| Normalized quadratic polynomial | 95.69±2.69 | 0.89±0.08 | 0.01±0.02 | 0.99±0.02 | 0.11±0.08 | 0.94±0.04 |
| Quadratic Polynomial | 93.89±3.06 | 0.89±0.07 | 0.04±0.03 | 0.96±0.03 | 0.11±0.07 | 0.93±0.04 |
| Cubic Polynomial | 94.54±2.94 | 0.89±0.07 | 0.03±0.03 | 0.97±0.03 | 0.11±0.07 | 0.93±0.04 |
| RBF | 95.95±2.61 | 0.89±0.07 | 0.01±0.02 | 0.99±0.02 | 0.11±0.07 | 0.94±0.04 |
Figure 1ROC curves for ER+ and ER− classification using linear and RBF kernels.
ROC curves and area under ROC curve (AUC) values can be used as more robust measures of classifier accuracy beyond overall classification accuracy. (A) ROC curves for ER+ classification. (B) ROC curves for ER− classification. In both cases the linear model is represented by a dashed line and the RBF kernel model is represented by a solid line. The point on each curve corresponds to the true positive/negative and false positive/negative values obtained from 100 iterations of 10-fold cross-validation carried out on 542 samples with 200 SNP features. The ROC curve for any meaningful classifier needs to lie above the y = x line; the case where equal proportions of cases would be classified correctly and incorrectly, as would occur if class values were assigned at random.
Figure 2Relationship between weights under a linear classifier and chi-square values used in feature selection.
SVM models were constructed on 542 study samples with genotype data for a subset of 200 SNPs chosen based on ER+/− association, determined from the chi-square statistic. SNP feature weights were obtained from the linear SVM model and used as an indicator of the importance of each feature for classification; SNPs with the largest absolute weight values are the most important for classification. Chi-square values used in feature selection and SVM classifier weight values are uncorrelated; Pearson’s correlation coefficient r = −0.026. SNPs with absolute weight values > 0.5 are annotated with the name of the gene in which they reside or are in closest proximity to.
DAVID Annotation Clusters: Enriched gene ontology (GO) terms from the ER+/− classification.
| Cluster 1: Enrichment Score: 1.97 (GO: Biological Process) | ||||
| Term | No. genes | % genes | P value | Fold Enrichment |
| calcium ion transport | 7 | 6.03 | 0.00018 | 8.34 |
| T cell proliferation | 4 | 3.45 | 0.00051 | 25.05 |
| di-, tri-valent inorganic cation transport | 7 | 6.03 | 0.00056 | 6.73 |
| T cell activation | 6 | 5.17 | 0.00084 | 8.05 |
| lymphocyte proliferation | 4 | 3.45 | 0.00187 | 16.10 |
| leukocyte proliferation | 4 | 3.45 | 0.00214 | 15.37 |
| mononuclear cell proliferation | 4 | 3.45 | 0.00214 | 15.37 |
| lymphocyte activation | 6 | 5.17 | 0.00613 | 5.09 |
| positive regulation of immune system process | 6 | 5.17 | 0.01269 | 4.26 |
| leukocyte activation | 6 | 5.17 | 0.01356 | 4.19 |
| cell proliferation | 8 | 6.90 | 0.01360 | 3.10 |
| response to abiotic stimulus | 7 | 6.03 | 0.02048 | 3.22 |
| cell activation | 6 | 5.17 | 0.02619 | 3.54 |
Significant enrichment of genes in KEGG pathway identified by DAVID.
| Pathway | Genes | P value |
| Axon guidance | EPHA4, FYN, NRP1, NTN4, PPP3CA | 0.007 |
| T cell receptor signalling pathway | FYN, IL5, PPP3CA, PTPRC | 0.027 |
| Fc epsilon RI signalling pathway | FYN, IL5, MAP2K4 | 0.081 |