| Literature DB >> 27688706 |
Zena M Hira1, Duncan F Gillies1.
Abstract
In order to provide the most effective therapy for cancer, it is important to be able to diagnose whether a patient's cancer will respond to a proposed treatment. Methylation profiling could contain information from which such predictions could be made. Currently, hypothesis testing is used to determine whether possible biomarkers for cancer progression produce statistically significant results. However, this approach requires the identification of individual genes, or sets of genes, as candidate hypotheses, and with the increasing size of modern microarrays, this task is becoming progressively harder. Exhaustive testing of small sets of genes is computationally infeasible, and so hypothesis generation depends either on the use of established biological knowledge or on heuristic methods. As an alternative machine learning, methods can be used to identify groups of genes that are acting together within sets of cancer data and associate their behaviors with cancer progression. These methods have the advantage of being multivariate and unbiased but unfortunately also rapidly become computationally infeasible as the number of gene probes and datasets increases. To address this problem, we have investigated a way of utilizing prior knowledge to segment microarray datasets in such a way that machine learning can be used to identify candidate sets of genes for hypothesis testing. A methylation dataset is divided into subsets, where each subset contains only the probes that relate to a known gene pathway. Each of these pathway subsets is used independently for classification. The classification method is AdaBoost with decision trees as weak classifiers. Since each pathway subset contains a relatively small number of gene probes, it is possible to train and test its classification accuracy quickly and determine whether it has valuable diagnostic information. Finally, genes from successful pathway subsets can be combined to create a classifier of high accuracy.Entities:
Keywords: cancer progression; machine learning; methylation profiling
Year: 2016 PMID: 27688706 PMCID: PMC5030825 DOI: 10.4137/CIN.S39859
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Pathway algorithm: in the first step the original methylation dataset is split into several smaller subsets in which all the genes of one subset belong to one pathway in the ConsensusPath database. AdaBoost is applied on the subsets to build classifiers for disease progression. The classification accuracy of each subset was calculated using stratified cross-validation to account for unbalanced classes. Randomly picked subsets of the probes in the original dataset were created so that the pathway sets with the highest accuracies could be tested for significance using z-scores and P values.
Notes: 1http://globocan.iarc.fr/Default.aspx.6https://www.etriks.org/.
Accuracies obtained using the complete datasets with linear (PCA) and nonlinear (Isomap) forms of dimensionality reduction and AdaBoost.
| DATASET | ACCURACY | VARIANCE |
|---|---|---|
| CML with PCA | 0.6044 | 0.0222 |
| CML with Isomap | 0.5155 | 0.0159 |
| LGG with PCA | 0.7083 | 0.0177 |
| LGG with Isomap | 0.6347 | 0.0289 |
Note: The results are not significant.
Figure 2ROC curve for the prediction of disease progression using the complete LGG dataset.
Figure 3ROC curve for the prediction of disease progression using the complete CML dataset.
Best performing pathway sets for LGG and their accuracies and variances using 10-K stratified cross-validation.
| PATHWAY | ACCURACY | VARIANCE | ||
|---|---|---|---|---|
| A | 0.904 | 0.0191 | 3.9713 | 0.000036 |
| B | 0.891 | 0.0163 | 3.68249905 | 0.000116 |
| C | 0.890 | 0.0070 | 3.65034371 | 0.000131 |
| D | 0.879 | 0.0056 | 3.39268741 | 0.000346 |
Abbreviations: Pathway A, pantothenate and CoA biosynthesis; pathway B, transcription factor creb; pathway C, pyrimidine metabolism; pathway D, IL2.
Figure 4ROC curves for the four pathway sets with the highest accuracy on the LGG dataset.
Figure 5Comparison between pantothenate and CoA biosynthesis and retinoate biosynthesis II pathway sets.
Figure 6Comparison between pantothenate and CoA biosynthesis and activation of Rac pathway sets.
Accuracy results for logistic regression on the best LGG pathway sets.
| PATHWAY | LOGISTIC REGRESSION |
|---|---|
| A | 0.708 |
| B | 0.697 |
| C | 0.650 |
| D | 0.674 |
Note: The pathways are the same as in Table 2.
Figure 7The gene selection algorithm based on accuracy thresholds and how important each feature is when constructing the decision tree.
The most discriminant genes for the LGG dataset.
| SYMBOL | FUNCTIONAL ANNOTATION |
|---|---|
| DDOST | Dolichyl-Diphosphooligosaccharide |
| PRKAR2B | protein kinase, cAMP-dependent |
| PDPK1 | 3-phosphoinositide dependent protein kinase 1 |
| phosphatidylinositol-4,5-bisphosphate 3-kinase | |
| cell division cycle 16 | |
| OAT | ornithine aminotransferase |
| Kirsten Rat Sarcoma Viral Oncogene Homolog | |
| neurotrophic tyrosine kinase, receptor, type 1 | |
| NF1 | neurofibromin 1 |
| BTRC | beta-transducin repeat containing E3 ubiquitin protein ligase |
| phosphoinositide-3-kinase, regulatory subunit 3 (gamma) | |
| KCNMB4 | potassium large conductance calcium-activated channel, Mβ4 |
| IFNGR1 | interferon gamma receptor 1 |
| SC5DL | sterol-C5-desaturase |
| activating transcription factor 2 | |
| GABRB2 | gamma-aminobutyric acid (GABA) A receptor, beta 2 |
| syntaxin 1 A (brain) | |
| GPX4 | glutathione peroxidase 4 |
| GAB2 | GRB2-associated binding protein 2 |
| EIF2AK1 | eukaryotic translation initiation factor 2-alpha kinase 1 |
| SOS1 | son of sevenless homolog 1 (Drosophila) |
| EXOC6 | exocyst complex component 6 |
| insulin receptor substrate 1 | |
| ANK1 | ankyrin 1, erythrocytic 2 |
| IL6R | interleukin 6 receptor |
| NRCAM | neuronal cell adhesion molecule |
| SLC22A2 | solute carrier family 22 (organic cation transporter), member 2 |
| PPCDC | phosphopantothenoylcysteine decarboxylase |
| UPB1 | ureidopropionase, beta |
| PTK2B | protein tyrosine kinase 2 beta |
| ITGA2 | integrin, alpha 2 (CD49B, alpha 2 subunit of VLA-2 receptor) |
| signal transducer and activator of transcription 3 | |
| SLCO4A1 | solute carrier organic anion transporter family, member 4A1 |
| SLCO2A1 | solute carrier organic anion transporter family, member 2A1 |
Best performing pathway sets for CML and their accuracies and variances after 10-K stratified cross-validation.
| PATHWAY | ACCURACY | VARIANCE | ||
|---|---|---|---|---|
| A | 0.9888 | 0.0011 | 6.44028444 | <0.00001 |
| B | 0.9888 | 0.0011 | 6.44028444 | <0.00001 |
| C | 0.8244 | 0.0176 | 2.11346295 | 0.0176 |
Abbreviations: Pathway A, regulation of KIT signaling; pathway B, signaling events mediated by stem cell factor receptor (c-Kit); pathway C, superpathway of D-myo-inositol(1,4,5)-trisphosphate metabolism.
Figure 8ROC curve for the regulation of KIT signaling pathway set.
Figure 9Comparison between regulation of KIT signaling and arrestins in gpcr desensitization pathway sets.
Figure 10Comparison between regulation of KIT signaling and NF-kappa B signaling – Homo sapiens pathway sets.
Figure 11Comparison between regulation of KIT signaling and acetylcholine synthesis pathway sets.
Accuracy results for logistic regression applied to CML pathway sets.
| PATHWAY | LOGISTIC REGRESSION |
|---|---|
| A | 0.703 |
| B | 0.693 |
| C | 0.682 |
Note: The pathways are defined in Table 5.
The most discriminative genes for the CML data.
| GENE NAME | FUNCTIONAL ANNOTATION |
|---|---|
| Polyphosphate-5-phosphatase, 40kda | |
| Inositol polyphosphate-5-phosphatase, 75kda | |
| Inositol monophosphatase domain containing 1 | |
| Inositol polyphosphate-1-phosphatase | |
| Inositol polyphosphate-5-phosphatase j | |
| Inositol 1,4,5-trisphosphate 3-kinase b | |
| Sh2b adaptor protein 3 | |
| Synaptojanin 2 |