| Literature DB >> 33335163 |
Xiao-Ying Liu1, Sheng-Bing Wu2, Wen-Quan Zeng2, Zhan-Jiang Yuan2, Hong-Bo Xu2.
Abstract
Biomarker selection and cancer classification play an important role in knowledge discovery using genomic data. Successful identification of gene biomarkers and biological pathways can significantly improve the accuracy of diagnosis and help machine learning models have better performance on classification of different types of cancer. In this paper, we proposed a LogSum + L2 penalized logistic regression model, and furthermore used a coordinate decent algorithm to solve it. The results of simulations and real experiments indicate that the proposed method is highly competitive among several state-of-the-art methods. Our proposed model achieves the excellent performance in group feature selection and classification problems.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33335163 PMCID: PMC7747646 DOI: 10.1038/s41598-020-79028-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Contour plots (two-dimensional) for the regularization methods.
Figure 2Exact solution of (a) (b) (c) (d) in an orthogonal design.
Figure 3Flowchart of the coordinate descent algorithm for the penalized logistic regression model.
Training results of different methods on the simulated datasets.
| Method | Scenario | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | AUC | ||||||
| 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | ||
| 0.2 | 90.00% (1.85%) | 98.78% (0.37%) | 91.30% (1.12%) | 99.82% (0.01%) | 88.73%(2.14%) | 97.45% (0.49%) | 97.12% (0.53%) | 98.12% (0.32%) | |
87.14% (2.28%) | 99.16% (0.12%) | 86.96% (2.97%) | 99.78% (0.03%) | 87.32% (3.17%) | 97.33% (0.52%) | 95.08% (0.93%) | 97.89% (0.35%) | ||
94.29% (0.59%) | 98.65% (0.35%) | 95.65% (0.62%) | 99.82% (0.01%) | 92.96% (1.26%) | 98.31% (0.38%) | 98.84% (0.21%) | 98.53% (0.33%) | ||
(0%) | (0.01%) | (0%) | (0.01%) | (0%) | (0.01%) | (0%) | (0.06%) | ||
| 0.6 | 91.43% (1.35%) | 98.65% (0.26%) | 87.69% (2.61%) | 98.76% (0.13%) | 94.67% (0.92%) | 97.24% (0.29%) | 97.37% (0.31%) | 98.16% (0.28%) | |
85.71% (2.03%) | 97.76% (0.31%) | 69.23% (2.84%) | 97.84% (0.20%) | 100% (0%) | 98.86% (0.22%) | 96.04% (0.48%) | 98.07% (0.34%) | ||
90.71% (1.76%) | 98.65% (0.23%) | 87.69% (1.48%) | 99.12% (0.04%) | 93.33% (0.81%) | 98.21% (0.26%) | 97.58% (0.40%) | 98.54% (0.32%) | ||
(0.21%) | (0.02%) | (0.62%) | (0.02%) | (0%) | (0.02%) | (0%) | (0.09%) | ||
Numbers in parentheses are the standard deviations and the best results are highlighted in bold.
Test results of different methods on the simulated datasets.
| Method | Scenario | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | AUC | ||||||
| 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | ||
| 0.2 | 75.00% (3.82%) | 71.67% (3.19%) | 78.31% (2.83%) | 78.57% (2.88%) | 74.19% (3.64%) | 63.50% (4.31%) | 86.76% (1.63%) | 80.80% (2.37%) | |
78.33% (3.15%) | 66.67% (4.18%) | 79.54% (2.68%) | 75.00% (3.07%) | 87.10% (2.63%) | 53.13% (6.16%) | 84.43% (1.87%) | 76.23% (3.49%) | ||
80.00% (1.86%) | 65.00% (4.03%) | 79.63% (2.66%) | 71.43% (3.92%) | 87.10% (2.52%) | 58.76% (5.83%) | 88.65% (1.31%) | 77.23% (3.46%) | ||
(1.55%) | (3.17%) | (1.43%) | (3.17%) | (1.96%) | (3.05%) | (0.62%) | (2.29%) | ||
| 0.6 | 68.33% (4.01%) | 58.33% (4.92%) | 62.07% (4.65%) | 59.09% (4.83%) | 70.97% (3.62%) | 57.89% (5.52%) | 82.76% (1.94%) | 65.67% (4.46%) | |
71.67% (3.61%) | 56.67% (5.32%) | 55.17% (5.18%) | 63.64% (4.78%) | 77.42% (2.95%) | 44.74% (8.03%) | 81.76% (2.43%) | 59.93% (5.04%) | ||
73.33% (3.33%) | 55.00% (5.57%) | 58.62% (4.96%) | 59.09% (5.02%) | 80.65% (2.31%) | 52.63% (5.24%) | 86.76% (1.88%) | 52.27% (5.71%) | ||
(1.73%) | (2.83%) | (2.04%) | (3.11%) | (1.78%) | (2.71%) | (0.50%) | (4.32%) | ||
Numbers in parentheses are the standard deviations and the best results are highlighted in bold.
Results of -sensitivity, -specificity obtained by four methods. (Numbers in parentheses are the standard deviations and the best results are highlighted in bold).
| Method | Scenario | ||||
|---|---|---|---|---|---|
| 1 | 2 | 1 | 2 | ||
| 0.2 | 73.45% (2.95%) | 71.53% (2.94%) | 99.90% (0.01%) | 95.81% (0.52%) | |
73.16% (2.26%) | 71.31% (2.32%) | 99.95% (0.01%) | 76.65% (2.73%) | ||
74.62% (3.05%) | 73.15% (2.89%) | 99.95% (0.01%) | 95.45% (1.92%) | ||
(2.58%) | (2.74%) | (0.01%) | (0.01%) | ||
| 0.6 | 64.18% (3.56%) | 62.43% (4.62%) | 99.70% (0.01%) | 95.00% (0.73%) | |
65.36% (3.63%) | 63.34% (4.13%) | 99.95% (0.01%) | 76.00% (3.04%) | ||
65.41% (3.81%) | 63.62% (4.51%) | 99.90% (0.01%) | 95.96% (0.65%) | ||
(2.92%) | (3.86%) | (0.01%) | (0.01%) | ||
Three publicly available lung cancer gene expression datasets.
| Dataset | No. of probes | Classes (Class1/Class2) | No. of sample (Class1/Class2) |
|---|---|---|---|
| GSE10072 | 22,284 | Normal/Lung Cancer | 107 (49/58) |
| GSE19188 | 54,675 | Normal/Lung Cancer | 156 (88/91) |
| GSE19804 | 54,675 | Normal/Lung Cancer | 120 (60/60) |
Training and test accuracy and number of selected genes of three lung cancer datasets in four methods.
| Data | Method | Training accuracy | Test accuracy | No. selected genes |
|---|---|---|---|---|
| GSE10072 | 98.32% (0.14%) | 95.12% (0.31%) | 23 (1.97) | |
99.04% (0.04%) | 98.4% (0.17%) | 72 (8.45) | ||
98.21% (0.16%) | 92.1% (0.94%) | 11 (1.32) | ||
(0.02%) | (0.08%) | 7 (0.82) | ||
| GSE19188 | 97.11% (0.21%) | 51.46% (6.05%) | 72 (9.33) | |
98.33% (0.09%) | 47.56% (7.41%) | 121 (10.34) | ||
96.3% (0.28%) | 46.19% (5.23%) | 17 (2.03) | ||
(0.01%) | (3.44%) | 10 (1.21) | ||
| GSE19804 | 99.05% (0.02%) | 95.2% (0.61%) | 37 (4.32) | |
99.05% (0.02%) | 94.6% (0.64%) | 70 (7.73) | ||
97.14% (0.22%) | 96.6% (0.58%) | 9 (1.03) | ||
(0.01%) | 6 (0.82) |
Numbers in parentheses are the standard deviations and the best results are highlighted in bold.
Figure 4Venn diagram analysis of the results of , , and regularization methods.
Figure 5Maximum Integrative Network of features selected by the LogSum + L2 penalized logistic regression model in GSE10072 dataset.
Figure 6Maximum Integrative Network of features selected by the LogSum + L2 penalized logistic regression model in GSE19188 dataset.
Figure 7Maximum Integrative Network of features selected by the LogSum + L2 penalized logistic regression model in GSE19804 dataset.
The genes are selected by the LogSum + L2 penalized logistic regression model for different datasets.
| Prob_ID | Gene symbol | Gene name |
|---|---|---|
| 201839_s_at | EPCAM | Epithelial cell adhesion molecule (EPCAM) |
| 200685_at | SRSF11 | Serine and arginine rich splicing factor 11(SRSF11) |
| 204600_at | EPHB3 | EPH receptor B3(EPHB3) |
| 205297_s_at | CD79B | CD79b molecule (CD79B) |
| 202932_at | YES1 | YES proto-oncogene 1, Src family tyrosine kinase (YES1) |
| 201983_s_at | EGFR | Epidermal growth factor receptor (EGFR) |
| 201596_x_at | KRT18 | Keratin 18(KRT18) |
| 204292_x_at | STK11 | Serine/threonine kinase 11(STK11) |
| 205880_at | PRKD1 | Protein kinase D1(PRKD1) |
| 208694_at | PRKDC | Protein kinase, DNA-activated, catalytic polypeptide (PRKDC) |
| 205868_s_at | PTPN11 | Protein tyrosine phosphatase, non-receptor type 11(PTPN11) |
| 214250_at | NUMA1 | Nuclear mitotic apparatus protein 1(NUMA1) |
| 231657_s_at | CCDC74A | Coiled-coil domain containing 74A(CCDC74A) |
| 220939_s_at | DPP8 | Dipeptidyl peptidase 8(DPP8) |
| 210704_at | FEZ2 | Fasciculation and elongation protein zeta 2(FEZ2) |
| 208601_s_at | TUBB1 | Tubulin beta 1 class VI(TUBB1) |
| 207660_at | DMD | Dystrophin (DMD) |
| 1553655_at | CDC20B | Cell division cycle 20B(CDC20B) |
| 201839_s_at | EPCAM | Epithelial cell adhesion molecule (EPCAM) |
| 1552370_at | C4ORF33 | Chromosome 4 open reading frame 33(C4orf33) |
| 1556925_at | SMC3 | Structural maintenance of chromosomes 3(SMC3) |
| 207611_at | HIST1H2BL | Histone cluster 1 H2B family member l(HIST1H2BL) |
| 1554600_s_at | LMNA | Lamin A/C(LMNA) |