| Literature DB >> 25132738 |
Hyunsu Ju1, Allan R Brasier2, Alexander Kurosky3, Bo Xu4, Victor E Reyes5, David Y Graham6.
Abstract
BACKGROUND: The development of accurate classification models depends upon the methods used to identify the most relevant variables. The aim of this article is to evaluate variable selection methods to identify important variables in predicting a binary response using nonlinear statistical models. Our goals in model selection include producing non-overfitting stable models that are interpretable, that generate accurate predictions and have minimum bias. This work was motivated by data on clinical and laboratory features of Helicobacter pylori infections obtained from 60 individuals enrolled in a prospective observational study.Entities:
Keywords: Amino acid analysis; Classification; Helicobacter pylori; Peptic ulcer disease; Variable selection
Year: 2014 PMID: 25132738 PMCID: PMC4132894 DOI: 10.4172/jpb.1000308
Source DB: PubMed Journal: J Proteomics Bioinform ISSN: 0974-276X
Amino acid measurements in subjects with and without peptic ulcers.
| Characteristic | Without Ulcer = 30 (50%) | With Ulcer = 30 (50%) | All subjects = 60 |
|---|---|---|---|
| 0.15 ± 0.04 | 0.19 ± 0.08 | 0.17 ± 0.07 | |
| 1.3 ± 0.49 | 1.43 ± 0.84 | 1.36 ± 0.69 | |
| 42.29 ± 11.41 | 54.17 ± 22.42 | 48.23 ± 18.63 | |
| 1.39 ± 0.58 | 1.46 ± 0.6 | 1.43 ± 0.59 | |
| 1.85 ± 0.48 | 2.02 ± 0.52 | 1.94 ± 0.5 | |
| 2.49 ± 0.63 | 2.69 ± 0.81 | 2.59 ± 0.73 | |
| 7.04 ± 1.78 | 7.92 ± 1.69 | 7.48 ± 1.78 | |
| 3.97 ± 0.64 | 4.64 ± 1.09 | 4.31 ± 0.95 | |
| 5.74 ± 1.53 | 5.77 ± 2.01 | 5.76 ± 1.77 | |
| 0.34 ± 0.1 | 0.54 ± 0.14 | 0.44 ± 0.16 | |
| 0.19 ± 0.07 | 0.21 ± 0.08 | 0.2 ± 0.07 | |
| 2.89 ± 0.69 | 2.94 ± 0.8 | 2.91 ± 0.74 | |
| 0.9 ± 0.25 | 1 ± 0.3 | 0.95 ± 0.28 | |
| 2.59 ± 1.08 | 2.69 ± 1.34 | 2.64 ± 1.21 | |
| 0.88 ± 0.3 | 0.9 ± 0.32 | 0.89 ± 0.31 | |
| 1.25 ± 0.31 | 1.37 ± 0.39 | 1.31 ± 0.36 | |
| 2.95 ± 0.59 | 2.95 ± 0.88 | 2.95 ± 0.74 | |
| 1.09 ± 0.36 | 1.39 ± 0.74 | 1.24 ± 0.6 | |
| 2.95 ± 1 | 3.23 ± 1.7 | 3.09 ± 1.39 | |
| 0.97 ± 0.28 | 0.94 ± 0.31 | 0.95 ± 0.29 | |
| 3.21 ± 1.47 | 3.2 ± 1.34 | 3.21 ± 1.39 | |
| 0.07 ± 0.06 | 0.1 ± 0.09 | 0.09 ± 0.08 | |
| 0.02 ± 0.03 | 0.04 ± 0.07 | 0.03 ± 0.05 | |
| 0.09 ± 0.12 | 0.14 ± 0.17 | 0.12 ± 0.15 | |
| 0.02 ± 0.02 | 0.03 ± 0.03 | 0.02 ± 0.02 | |
| 0.02 ± 0.02 | 0.03 ± 0.06 | 0.03 ± 0.05 | |
| 0.05 ± 0.08 | 0.03 ± 0.05 | 0.04 ± 0.07 | |
| 0.07 ± 0.09 | 0.11 ± 0.18 | 0.09 ± 0.14 |
P<0.05
P<0.001
Sparse penalized logistic regression coefficients.
| Characteristic | LASSO | EN | SCAD |
|---|---|---|---|
| 0 | 1.204 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0.137 | 0 | |
| 0 | 0 | 0 | |
| 7.38 | 6.649 | 9.092 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | −0.019 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | −0.545 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0.518 | 0 | |
| 0 | 4.491 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 | |
| 0 | 0 | 0 |
Model analysis of deviance tests. Two GAM analyses are shown.
| GAM analysis incorporating both linear and smoothing components. | |||
|---|---|---|---|
| Parameters | df | Chi-square | Pr> chisq |
| 1 | 14.4400 | 0.0004** | |
| 1 | 1.040 | 0.315 | |
| 1 | 2.280 | 0.139 | |
| 1 | 0.078 | 0.784 | |
| 2 | 8.596 | 0.014** | |
| 2 | 0.145 | 0.930 | |
| 2 | 6.284 | 0.043** | |
| 2 | 2.141 | 0.343 | |
| 1 | 13.0320 | 0.0007** | |
| 1 | 0.476 | 0.494 | |
| 1 | 1.638 | 0.205 | |
| 1 | 0.0001 | 0.989 | |
For each, the DF, degrees of freedom and dominant factors significant at the level alpha=0.1 (*) and 0.05(**) are shown for each parameter.
Marginal posterior inclusion probability and term importance. Shown are the posterior model probabilities from the MCMC 8000 samples from 8 chains, each ran 5000 iterations after a burn-in of 500.
| Coefficients | P(gamma=1) | Pi | Dimension |
|---|---|---|---|
| 0.499 | 0.657 | 1 | |
| 0.750 | 0.336 | 8 | |
| 0.006 | 0.000 | 1 | |
| 0.006 | 0.000 | 6 | |
| 0.026 | −0.003 | 1 | |
| 0.212 | 0.010 | 7 | |
| 0.016 | −0.001 | 1 | |
| 0.015 | 0.000 | 7 |
:P(gamma=1)>.25;
:P(gamma=1)>.5
Figure 1Partial Residual Plots
Lines shown are a solid line representing a spline and dotted lines are 95% confidence band for each predictor. For each is shown the relationship between the predictor with residualized (adjusted) dependent variable values.
Confusion matrix for the MARS model.
| Class | Total | Prediction | |
|---|---|---|---|
| H.Pylori | PUD | ||
| (n=27) | (n=33) | ||
| 30 | 26 | 4 | |
| 30 | 1 | 29 | |
| 60 | correct = 86.67% | correct = 96.67% | |
Figure 2ROC analysis. Shown is a Receiver Operating Characteristic (ROC) curve for the predictive model for peptic ulcer disease. Y axis, Sensitivity; X axis, 1-Specificity.
Figure 3Variable Importance for MARS model of PUD
Variable importance was computed for each feature in the MARS model. Y axis, percent contribution for each analyte.
MARS Basis Functions. Shown are the basis functions (BF) for the MARS model of PUD prediction. Bm, each individual basis function, am, coefficient of the basis function.
| Bm | Definition | am | Variable descriptor |
|---|---|---|---|
| 0.7615 −Histidine | 3.49646 | Histidine | |
| Citrulline −0.338 | 6.6434 | Citrulline | |
| Citrulline −0.49 | −7.00747 | Citrulline | |
| Lysine −1.401 | 0.121036 | Lysine | |
| Arginine −2.719 | −0.104259 | Arginine |
(y)+, = max(0,y)
Figure 4Box Plots
For each variable shown, the distribution of each predictor is divided over case (with ulcer) and control (without ulcer). PUD: endoscopy-documented peptic ulcer.
Figure 5GAM Check Plots
GAM plots produce deviance residuals against approximate theoretical quantilies of the deviance residual distribution. GAM: Generalized Additive Models.