| Literature DB >> 30158596 |
Hua Chai1, Yong Liang2, Sai Wang1, Hai-Wei Shen1.
Abstract
Traditional supervised learning classifier needs a lot of labeled samples to achieve good performance, however in many biological datasets there is only a small size of labeled samples and the remaining samples are unlabeled. Labeling these unlabeled samples manually is difficult or expensive. Technologies such as active learning and semi-supervised learning have been proposed to utilize the unlabeled samples for improving the model performance. However in active learning the model suffers from being short-sighted or biased and some manual workload is still needed. The semi-supervised learning methods are easy to be affected by the noisy samples. In this paper we propose a novel logistic regression model based on complementarity of active learning and semi-supervised learning, for utilizing the unlabeled samples with least cost to improve the disease classification accuracy. In addition to that, an update pseudo-labeled samples mechanism is designed to reduce the false pseudo-labeled samples. The experiment results show that this new model can achieve better performances compared the widely used semi-supervised learning and active learning methods in disease classification and gene selection.Entities:
Mesh:
Year: 2018 PMID: 30158596 PMCID: PMC6115447 DOI: 10.1038/s41598-018-31395-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The work flow of proposed logistic regression model combining SSL and AL.
The gene selection performances of different methods in simulation experiments.
| Group A | Group B | |||||||
|---|---|---|---|---|---|---|---|---|
| NC | NS | sensitivity | specificity | NC | NS | sensitivity | specificity | |
| logistic | 3.15 | 14.05 | 0.315 | 0.995 | 4.82 | 26.58 | 0.482 | 0.989 |
| AL-lo | 3.65 | 17.80 | 0.365 | 0.992 | 5.19 | 28.16 | 0.519 | 0.988 |
| SSL-lo | 3.87 | 23.61 | 0.387 | 0.990 | 5.51 | 45.40 | 0.551 | 0.979 |
| ASSL-lo | 5.32 | 63.90 | 0.532 | 0.971 | 6.74 | 96.27 | 0.674 | 0.955 |
| Auto-ASSL(A) | 3.59 | 57.68 | 0.359 | 0.973 | 5.26 | 97.39 | 0.526 | 0.953 |
| Auto-ASSL(B) | 4.17 | 27.45 | 0.417 | 0.988 | 5.75 | 53.60 | 0.575 | 0.976 |
Figure 2The classification accuracy of different methods in simulation experiments.
Figure 3The ROC curves of different methods in simulation experiments.
The AUC obtained by different methods in simulation experiments.
| AUC | logistic | AL-lo | SSL-lo | ASSL-lo | Auto-ASSL(A) | Auto-ASSL(B) |
|---|---|---|---|---|---|---|
| Group A | 0.9584 | 0.9723 | 0.9709 | 0.9874 | 0.9448 | 0.9810 |
| Group B | 0.9682 | 0.9855 | 0.9796 | 0.9943 | 0.9738 | 0.9917 |
Details of real datasets used in the experiments.
| Dataset | Number of genes | Number of samples | Number of labeled samples | Disease types |
|---|---|---|---|---|
| DLBCL | 2648 | 77 | 26 | diffuse large b-cell lymphoma |
| Prostate | 2135 | 102 | 34 | prostate cancer |
| GSE21050 | 54613 | 310 | 103 | soft tissue sarcomas |
| GSE32603 | 13200 | 231 | 77 | breast cancer |
The classification accuracy obtained by different methods in the real datasets.
| Method | DLBCL | Prostate | GSE21050 | GSE32603 |
|---|---|---|---|---|
| logistic | 77.94% | 86.54% | 79.01% | 69.69% |
| AL-lo | 83.15% | 91.53% | 84.43% | 73.68% |
| SSL-lo | 81.82% | 88.97% | 80.92% | 70.57% |
| ASSL-lo | 87.14% | 94.42% | 89.34% | 80.63% |
| Auto-ASSL(A) | 80.67% | 88.55% | 78.33% | 68.48% |
| Auto-ASSL(B) | 85.62% | 93.36% | 86.37% | 76.46% |
Figure 4ROC curves obtained by different methods in real datasets (a) DLBCL (b) Prostate (c) GSE21050 (d) GSE32603.
The AUC obtained by different methods in the real datasets.
| Method | DLBCL | Prostate | GSE21050 | GSE32603 |
|---|---|---|---|---|
| logistic | 0.9199 | 0.9569 | 0.8975 | 0.7557 |
| AL-lo | 0.9295 | 0.9749 | 0.9394 | 0.8328 |
| SSL-lo | 0.8942 | 0.9708 | 0.9232 | 0.7962 |
| ASSL-lo | 0.9583 | 0.9862 | 0.9596 | 0.9023 |
| Auto-ASSL(A) | 0.8333 | 0.9646 | 0.8835 | 0.7757 |
| Auto-ASSL(B) | 0.9391 | 0.9785 | 0.9432 | 0.8390 |
Figure 5The number of genes selected by different methods in real datasets.
The genes selected by different methods in DLBCL.
| logistic | AL-lo | SSL-lo | ASSL-lo | Auto-ASSL(A) | Auto-ASSL(B) | |
|---|---|---|---|---|---|---|
| 1 |
|
|
|
|
|
|
| 2 | KIF2C | MT2A | PURA | CD34 | GPR18 |
|
| 3 | MT2A | MIF | MT2A | TXNIP | ESD | MIF |
| 4 | MORC3 |
|
| MT2A |
| MORC3 |
| 5 | TLE4 | SELL | TLE4 | PURA | SELL | TLE4 |
| 6 | SELL | BMI1 | MIF | TRIB2 | MYCLP1 | SELL |
| 7 | N4BP2L1 | IFITM2 | N4BP2L1 | GAPDH | TRIM23 | N4BP2L1 |
| 8 |
| GAPDH | SELL | MYCLP1 | TLE4 |
|
| 9 | EFNA3 | CCL21 | CCL21 |
| KIF2C | CCL21 |
| 10 | MYCLP1 | SMAD6 | ESD | MIF | GAPDH | MT2A |
The genes selected by different methods in GSE32603.
| logistic | AL-lo | SSL-lo | ASSL-lo | Auto-ASSL(A) | Auto-ASSL(B) | |
|---|---|---|---|---|---|---|
| 1 |
| GRB2 | LOC642236 |
|
|
|
| 2 | LOC642236 | MS4A1 |
|
|
|
|
| 3 | GRB2 |
| GRB2 | ZSCAN9 | GRB2 | GRB2 |
| 4 |
|
|
| CDKN1B | TMEM242 | TMEM242 |
| 5 | MAST1 | UBE2W | UFC1 | HPSE | ||
| 6 | C2orf70 | EPHB1 |
| |||
| 7 | TAF8 | SUPT20H | ||||
| 8 | MTSS1 | ARL2BP | ||||
| 9 | PRDM4 | STK3 |
The genes selected by different methods in Prostate.
| logistic | AL-lo | SSL-lo | ASSL-lo | Auto-ASSL(A) | Auto-ASSL(B) | |
|---|---|---|---|---|---|---|
| 1 |
|
|
|
| PTGDS |
|
| 2 |
| XBP1 | XBP1 | XBP1 |
|
|
| 3 | MYOF | NELL2 |
|
| NELL2 | MYOF |
| 4 | XBP1 | TGFB3 | PTGDS | NELL2 | RRAD | XBP1 |
| 5 | PTGDS |
| NELL2 | RBM3 | HSBP1 |
|
| 6 | NELL2 | ATP5ME | MYOF | PTGDS | MYOF | NELL2 |
| 7 | SERPINA3 | TRIM29 | ATP5ME | SDC1 |
| SERPINA3 |
| 8 | RBM3 | MYOF | SERPINA3 | CFD | PDLIM5 |
|
| 9 | TGFB3 | RBM3 | TGFB3 | ATP5ME | ATP5ME | TGFB3 |
| 10 | TRIM29 | SERPINA3 | TRIM29 | HSBP1 | SERPINA3 | TRIM29 |
The genes selected by different methods in GSE21050.
| logistic | AL-lo | SSL-lo | ASSL-lo | Auto-ASSL(A) | Auto-ASSL(B) | |
|---|---|---|---|---|---|---|
| 1 | C15orf41 |
|
| FADS1 |
|
|
| 2 |
| SNORD35B | C8orf82 | SNORD35B | ADD3 | SNORD35B |
| 3 | C8orf82 |
|
| IFT43 |
|
|
| 4 |
| NFATC2IP | SLC1A4 | C8orf82 | SNORD35B | ADD3 |
| 5 | LPAR1 | C8orf82 | PML | CDC42EP3 | FHL2 | C8orf82 |
| 6 | AKT2 | NUP155 | PLD1 |
| PCDH18 | XPO6 |
| 7 | XPO6 | XPO6 | WDHD1 | DCN | NFATC2IP | ATP6V1D |
| 8 | SLC1A4 | IFT43 | AKT2 |
| YEATS2 | IFT43 |
| 9 | PLD1 | PCDH18 | RPL13A | XPO6 | LIMK2 | NUP155 |
| 10 | SNORD35B | WDHD1 | NFATC2IP | ADD3 | SMAD4 |
|