| Literature DB >> 24303297 |
Charalampos S Floudas1, Jeya Balaji Balasubramanian, Marjorie Romkes, Vanathi Gopalakrishnan.
Abstract
Technology is constantly evolving, necessitating the development of workflows for efficient use of high-dimensional data. We develop and test an empirical workflow for predictive modeling based on single nucleotide polymorphisms (SNP) from genome-wide association study (GWAS) datasets. To this aim, we use as a case study SNP-based prediction of survival for non-small cell lung cancer (NSCLC) with a Bayesian rule learner system (BRL+). Lung cancer is a leading cause of mortality. Standard treatment for early stages of NSCLC is surgery. Adjuvant chemotherapy would be beneficial for patients with early recurrence; consequently, we need models capable of such prediction. This workflow outlines the challenges involved in processing GWAS datasets from one popular platform (Affymetrix®), from the results files of the hybridization experiment to the model construction. Our results show that our workflow is feasible and efficient for processing such data while also yielding SNP based models with high predictive accuracy over cross validation.Entities:
Year: 2013 PMID: 24303297 PMCID: PMC3814469
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1.The SNPR workflow for SNP-based Prediction with Bayesian Rule Learning. Column “Process & Platform” refers to the phase and the platform (array/software) used. Column “Data format” refers to the format of the file that contains the data. Rectangular boxes: processing steps; parallelogram boxes: input/output; The dotted horizontal lines create groups of steps occurring within the same platform.
Performance measures across 5-fold cross-validation by the three rule-based classifiers in the BRL+ module - Rule Learner (RL), Global Bayesian Rule Learner and Local Bayesian Rule Learner - Decision Tree (DT). The average and standard error values are computed over the 5-fold cross-validation
| 90.91 (incl. abstentions | 86.67 +/− 9.71 | 100 +/− 0.0 | 73.33 +/− 19.44 | 14 | |
| 80.30 | 70 +/− 6.45 | 90 +/− 0.0 | 50 +/− 12.91 | 3 | |
| 83.33 | 80.67 +/− 9.11 | 88 +/− 3.74 | 73.33 +/− 19.44 | 4 |
Accuracy: determined over confusion-matrix obtained at the end of the 5-fold cross-validation;
balanced accuracy: average sensitivity and specificity with standard error;
sensitivity: average sensitivity and standard error for predicting the “poor survival” class correctly;
specificity: average specificity and standard error for predicting the “good survival” class correctly;
number of variables selected: total number of attributes that compose the rule model;
abstentions: Rule Learner abstains from classifying an instance when it fails to match any of the learnt rules. Rule Learner treats this as incorrect classification.
Functional analysis of the 33 genes mapped from the 100 SNPs of the feature selection stage.
| Cancer | 6.75 ×10−4 − 4.27 ×10−2 | 9 |
| Inflammatory Response | 6.75 ×10−4 − 4.43 ×10−2 | 6 |
| Cardiovascular Disease | 1.61 ×10−3 − 4.74 ×10−2 | 5 |
| Cell-To-Cell Signaling and Interaction | 2.49E-05 − 4.74 ×10−2 | 8 |
| Cell Morphology | 1.36 ×10−4 − 4.58 ×10−2 | 10 |
| Cellular Movement | 3.59 ×10−4 − 4.89 ×10−2 | 10 |