| Literature DB >> 23888262 |
Faezeh Hosseinzadeh1, Amir Hossein Kayvanjoo, Mansuor Ebrahimi, Bahram Goliaei.
Abstract
Early diagnosis of lung cancers and distinction between the tumor types (Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC) are very important to increase the survival rate of patients. Herein, we propose a diagnostic system based on sequence-derived structural and physicochemical attributes of proteins that involved in both types of tumors via feature extraction, feature selection and prediction models. 1497 proteins attributes computed and important features selected by 12 attribute weighting models and finally machine learning models consist of seven SVM models, three ANN models and two NB models applied on original database and newly created ones from attribute weighting models; models accuracies calculated through 10-fold cross and wrapper validation (just for SVM algorithms). In line with our previous findings, dipeptide composition, autocorrelation and distribution descriptor were the most important protein features selected by bioinformatics tools. The algorithms performances in lung cancer tumor type prediction increased when they applied on datasets created by attribute weighting models rather than original dataset. Wrapper-Validation performed better than X-Validation; the best cancer type prediction resulted from SVM and SVM Linear models (82%). The best accuracy of ANN gained when Neural Net model applied on SVM dataset (88%). This is the first report suggesting that the combination of protein features and attribute weighting models with machine learning algorithms can be effectively used to predict the type of lung cancer tumors (SCLC and NSCLC).Entities:
Keywords: Artificial neural network; Attributes weighting; Lung cancer; Naïve bayes; Prediction; Structural and physicochemical features; Support vector machine
Year: 2013 PMID: 23888262 PMCID: PMC3710575 DOI: 10.1186/2193-1801-2-238
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
The list of overexpressed genes in three classes of lung tumors (SCLC, NSCLC and COMMON) defined by microarray analysis; extracted from GSEA db
| Tumor type | Gene Symbol |
|---|---|
| SCLC | APAF1, BCL2, BCL2L1, BIRC2, BIRC3, CCNE1, CCNE2, CDK2, CDKN1B, CDKN2B, CHUK, CKS1B, COL4A1, COL4A2, COL4A4, COL4A6, CYCS, FN1, IKBKB, IKBKG, ITGA2, ITGA2B, ITGA3, ITGA6, ITGAV, ITGB1, LAMA1, LAMA2, LAMA3, LAMA4, LAMA5, LAMB1, LAMB2, LAMB3, LAMB4, LAMC1, LAMC2, LAMC3, MAX, MYC, NFKB1, NFKBIA, NOS2, PIAS1, PIAS2, PIAS3, PIAS4, PTEN, PTGS2, PTK2, RELA, SKP2, TRAF1, TRAF2, TRAF3, TRAF4, TRAF5, TRAF6, XIAP |
| NSCLC | ARAF, BAD, BRAF, CDKN2A, EGF, EGFR, ERBB2, FOXO3, GRB2, HRAS, KRAS, MAP2K1, MAP2K2, MAPK1, MAPK3, NRAS, PDPK1, PLCG1, PLCG2, PRKCA, PRKCB, PRKCG, RAF1, RASSF1, RASSF5, SOS1, SOS2, STK4, TGFA |
| COMMON | AKT1, AKT2, AKT3, CASP9, CCND1, CDK4, CDK6, E2F1, E2F2, E2F3, FHIT, PIK3CA, PIK3CB, PIK3CD, PIK3CG, PIK3R1, PIK3R2, PIK3R3, PIK3R5, RARB, RB1, RXRA, RXRB, RXRG, TP53 |
Figure 1The most important protein attributes selected by more than fifty percent of attribute weighting algorithms. As is evident, the features of distribution descriptor (F5.3), dipeptide composition (F1.2) and autocorrelation (F3.1) were defined important by 80% of attribute weighting models.
Figure 2Dispersion of protein attributes that gained weight value between 0 to 1 by attribute weighting model of SAM (the index of protein attributes exactly defined in Additional file : Table S1).
Figure 3Dispersion of protein attributes that gained weight value between 0 to 1 by attribute weighting model of Maximum Relevance (the index of protein attributes exactly defined in Additional file : Table S1).
The total accuracy and Kappa obtained from applying seven algorithms with on 13 datasets (FCdb and 12 datasets that obtained from attribute weighting models)
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
| Datasets | ||||||||
| FCdb | Accuracy | 67.42% | 67.42% | 47.12% | 51.67% | 45.30% | 64.77% | 33.26% |
| Kappa | 40.85% | 40.85% | 25.76% | 0.00% | 0.00% | 44.54% | 2.48% | |
| Chi Squared | Accuracy | 70.30% | 70.30% | 55.91% | 61.36% | 47.50% | 68.48% | 32.42% |
| Kappa | 44.09% | 44.09% | 29.36% | 31.01% | 18.46% | 46.71% | 2.21% | |
| Deviation | Accuracy | 50.68% | 50.68% | 58.71% | 51.97% | 49.24% | 34.02% | 31.44% |
| Kappa | 2.08% | 2.08% | 32.56% | 13.00% | 22.16% | 0.06% | -0.24% | |
| Gini Index | Accuracy | 67.42% | 67.42% | 64.77% | 63.94% | 43.86% | 63.18% | 31.44% |
| Kappa | 42.19% | 42.19% | 45.05% | 31.19% | 0.00% | 42.74% | -0.24% | |
| Info Gain | Accuracy | 74.39% | 74.39% | 60.30% | 65.61% | 45.30% | 63.18% | 31.44% |
| Kappa | 54.43% | 54.43% | 38.77% | 34.02% | 0.00% | 43.16% | -0.24% | |
| Info Gain Ratio | Accuracy | 65.76% | 65.76% | 64.62% | 54.17% | 45.30% | 67.42% | 34.17% |
| Kappa | 34.46% | 34.46% | 45.06% | 6.46% | 0.00% | 47.55% | -1.65% | |
| PCA | Accuracy | 50.68% | 50.68% | 58.71% | 51.97% | 49.24% | 34.02% | 31.44% |
| Kappa | 2.08% | 2.08% | 32.56% | 13.00% | 22.16% | 0.06% | -0.24% | |
| Relief | Accuracy | 71.29% | 71.29% | 58.03% | 56.89% | 56.36% | 73.79% | 30.00% |
| Kappa | 47.88% | 47.88% | 26.55% | 13.61% | 26.89% | 56.22% | -1.21% | |
|
| Accuracy | 81.67% | 81.67% | 59.55% | 66.74% | 43.79% | 78.18% | 36.14% |
| Kappa | 69.09% | 69.09% | 34.98% | 40.22% | 0.00% | 64.73% | 3.19% | |
| Uncertainty | Accuracy | 69.32% | 69.32% | 61.14% | 58.64% | 45.30% | 64.55% | 32.35% |
| Kappa | 44.57% | 44.57% | 39.90% | 16.82% | 0.00% | 43.04% | 1.22% | |
| Rule | Accuracy | 64.92% | 64.92% | 59.39% | 51.67% | 45.30% | 61.14% | 31.36% |
| Kappa | 36.01% | 36.01% | 37.90% | 0.00% | 0.00% | 38.09% | -6.10% | |
| SAM | Accuracy | 62.20% | 62.20% | 56.67% | 52.50% | 46.21% | 54.77% | 33.26% |
| Kappa | 31.04% | 31.04% | 36.30% | 2.00% | 0.00% | 26.61% | 2.48% | |
| MR | Accuracy | 78.03% | 78.03% | 58.86% | 63.03% | 53.64% | 76.36% | 32.65% |
| Kappa | 63.43% | 63.43% | 28.59% | 31.73% | 24.33% | 61.02% | -4.27% | |
The total accuracy obtained from running seven methods and on the 12 datasets derived from attribute weighting models
|
| |||||||
|---|---|---|---|---|---|---|---|
| Dataset | |||||||
| SAM | 70.98% | 68.86% | 51.67% | 41.97% | 45.68% | 41.82% | 66.52% |
| MR | 70.38% | 70.00% | 51.67% | 47.35% | 40.15% | 35.15% | 69.32% |
| Chi Squared | 69.02% | 70.83% | 51.67% | 43.79% | 44.39% | 32.58% | 71.89% |
| Deviation | 67.80% | 71.14% | 51.67% | 47.42% | 39.24% | 40.38% | 67.12% |
| Gini Index | 63.94% | 68.18% | 51.67% | 45.76% | 47.12% | 32.73% | 68.56% |
| Info Gain | 70.00% | 70.98% | 51.67% | 45.83% | 49.17% | 42.95% | 70.30% |
| Info Gain Ratio | 71.97% | 68.48% | 51.67% | 46.59% | 43.79% | 23.86% | 67.50% |
| PCA | 70.15% | 68.26% | 51.67% | 45.68% | 41.29% | 29.77% | 72.73% |
| Relief | 70.83% | 66.74% | 51.67% | 45.83% | 37.95% | 28.26% | 70.23% |
| Rule | 71.89% | 68.33% | 51.67% | 47.58% | 43.71% | 28.03% | 72.88% |
|
| 70.00% | 68.26% | 51.67% | 47.35% | 44.02% | 33.03% | 67.42% |
| Uncertainty | 67.42% | 69.17% | 51.67% | 47.35% | 43.48% | 29.92% | 66.44% |
Figure 5Average performances of two validation methods (X-Validation and Wrapper-Validation) applied on seven different algorithms ( , Linear, Lib, Evolutionary, POS, Hyper and Fast).
The total accuracy and Kappa index obtained from three Neural Network models on 13 datasets (FCdb and 12 datasets that obtained from attribute weighting models)
| Data Base | Auto MLp Accuracy | Neural Net Accuracy | Perceptron Accuracy | Auto MLp Kappa | Neural Net Kappa | Perceptron Kappa |
|---|---|---|---|---|---|---|
| Chi Squared | 73.79% | 70.23% | 54.09% | 56.39% | 51.54% | 24.84% |
| Info Gain Ratio | 80.76% | 83.41% | 52.58% | 68.53% | 71.55% | 13.60% |
| FCdb | 69.24% | 81.59% | 50.76% | 50.05% | 69.36% | 3.43% |
|
| 85.15% | 87.73% | 57.80% | 75.66% | 79.66% | 20.61% |
| Uncertainty | 82.58% | 81.59% | 52.42% | 71.33% | 69.92% | 18.58% |
| PCA | 51.67% | 51.67% | 30.98% | 0.77% | 0.00% | -5.04% |
| Relief | 77.27% | 75.61% | 51.67% | 62.52% | 60.30% | 16.09% |
| Rule | 76.06% | 80.53% | 48.03% | 60.96% | 67.58% | 5.45% |
| Deviation | 51.67% | 52.50% | 30.98% | 1.73% | 3.33% | -5.04% |
| Gini Index | 76.29% | 76.21% | 48.86% | 61.62% | 61.97% | 11.38% |
| Info Gain | 85.91% | 85.98% | 51.44% | 76.98% | 77.09% | 16.98% |
| SAM | 66.32% | 64.62% | 52.65% | 45.56% | 42.92% | 15.51% |
| MR | 76.29% | 75.45% | 58.56% | 61.88% | 60.12% | 25.92% |
The total accuracy and Kappa index obtained from two Naïve Bayes models on 13 datasets (FCdb and 12 datasets that obtained from attribute weighting models)
| Data Base | Bayes Kernel Accuracy | Bayes Kernel Kappa | ||
|---|---|---|---|---|
| Rule | 57.93% | 48.03% | 13.54% | 25.30% |
|
| 66.97% | 77.35% | 42.20% | 63.30% |
| Uncertainty | 58.21% | 55.45% | 14.33% | 32.65% |
| Relief | 66.74% | 72.65% | 42.94% | 55.55% |
| PCA | 54.39% | 44.02% | 19.52% | 11.38% |
| Info Gain Ratio | 61.24% | 61.14% | 25.22% | 41.00% |
| Info Gain | 69.32% | 70.23% | 44.63% | 54.66% |
| Gini Index | 65.68% | 66.74% | 38.20% | 48.97% |
| Deviation | 54.39% | 44.02% | 19.52% | 11.38% |
| Chi Squared | 63.18% | 64.09% | 37.20% | 42.02% |
| FCdb | 58.21% | 14.30% | 42.20% | 32.60% |
| MR | 70.00% | 77.20% | 50.84% | 63.60% |
| SAM | 61.20% | 62.95% | 26.95% | 39.75% |
Figure 4The mechanism of Kernel trick models. These machines are used to compute a non-linearly separable function into a higher dimension linearly separable function.