| Literature DB >> 32164283 |
Lei-Ming Yuan1, Yiye Sun2, Guangzao Huang1.
Abstract
A novel multi-classification method, which integrates the elastic net and probabilistic support vector machine, was proposed to solve this problem in cancer detection with gene expression profile data of platelets, whose problems mainly are a kind of multi-class classification problem with high dimension, small samples, and collinear data. The strategy of one-against-all (OVA) was employed to decompose the multi-classification problem into a series of binary classification problems. The elastic net was used to select class-specific features for the binary classification problems, and the probabilistic support vector machine was used to make the outputs of the binary classifiers with class-specific features comparable. Simulation data and gene expression profile data were intended to verify the effectiveness of the proposed method. Results indicate that the proposed method can automatically select class-specific features and obtain better performance of classification than that of the conventional multi-class classification methods, which are mainly based on global feature selection methods. This study indicates the proposed method is suitable for general multi-classification problems featured with high-dimension, small samples, and collinear data.Entities:
Keywords: cancer detection; class-specific features; elastic net; multi-class classification; platelets; probabilistic support vector machine
Mesh:
Year: 2020 PMID: 32164283 PMCID: PMC7085688 DOI: 10.3390/s20051528
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1The idea of solving the multi-class classification problem using class-specific features by one-against-all (OVA) strategy.
Figure 2The flowchart of the proposed method.
Structure of dataset 1.
| Classes | Features from [1] to [60] | Features from [61] to [120] | ||||
|---|---|---|---|---|---|---|
| 1 to 20 | 21 to 40 | 41 to 60 | 61 to 80 | 81 to 100 | 101 to 120 | |
| Class 1 |
|
|
|
|
|
|
| Class 2 |
|
|
|
|
|
|
| Class 3 |
|
|
|
|
|
|
Descriptions of the platelets data.
| Data | No. of Classes | No. of Features | No. of Samples |
|---|---|---|---|
| Data 10 | HD+BrCa | 1809 | 94 |
| Data 11 | HD+BrCa+CRC | 3200 | 136 |
| Data 12 | HD+BrCa+CRC+NSCLC | 3398 | 196 |
| Data 13 | HD+BrCa+CRC+NSCLC+GBM | 3575 | 236 |
| Data 14 | HD+BrCa+CRC+NSCLC+GBM+PAAD | 3876 | 271 |
| Data 15 | HD+BrCa+CRC+NSCLC+GBM+PAAD+HBC | 3728 | 285 |
Note: The sizes for each class, including HD, BrCa, CRC, NSCLC, GBM, PAAD, HBC are 55, 39, 42, 60, 40, 35, 14, respectively.
Figure 3The average gene expression levels of each class in dataset 15.
Descriptions of microarray data for cancer detection.
| Data | No. of Class | No. of Features | No. of Samples |
|---|---|---|---|
| Leukemia1 | 3 | 5327 | 72 |
| Leukemia2 | 3 | 11,225 | 72 |
| Small Round Blue Cells Tumor (SRBCT) | 4 | 83 | 2308 |
| Brain_Tumor2 | 4 | 50 | 10,367 |
| Brain_Tumor1 | 5 | 90 | 5920 |
| Tumors_9 | 9 | 60 | 5726 |
| Tumors11 | 11 | 12,534 | 174 |
Figure 4The selected class-specific features of dataset 1 by EPSVM: (a) treating class 1 as the positive class; (b) treating class 2 as the positive class; (c) treating class 3 as the positive class. Indicates the selected feature and indicates the average feature vector of one class.
Figure 5The select global features: (a) support vector machine recursive feature elimination (SVM-RFE); (b) Relieff. Indicates the selected feature.
NER of simulation data.
| Simulation Data | EPSVM | SVM-REF+SVM | Relieff + SVM |
|---|---|---|---|
| Data 1 | 1.0000 | 0.9344 | 0.9750 |
| Data 2 | 0.9896 | 0.9792 | 0.9854 |
| Data 3 | 0.9859 | 0.9734 | 0.9594 |
| Data 4 | 0.9825 | 0.9738 | 0.9362 |
| Data 5 | 0.9896 | 0.9646 | 0.9333 |
| Data 6 | 0.9920 | 0.9268 | 0.9196 |
| Data 7 | 0.9883 | 0.9148 | 0.8914 |
| Data 8 | 0.9806 | 0.8896 | 0.8799 |
| Data 9 | 0.9731 | 0.8544 | 0.8456 |
NER and feature selection results for dataset 10 (HD+BrCa).
| Ranking Features | EPSVM ( | Relieff + SVM | SVM-RFE + SVM | |||
|---|---|---|---|---|---|---|
| No. of Selected Features | Mean | No. of Selected Features | Mean | No. of Selected Features | Mean | |
| 100 | 46.1000 | 0.9294 | 38.7500 | 0.9268 | 15.0000 | 0.9222 |
| 200 | 52.3000 | 0.9464 | 67.5000 | 0.9519 | 22.5000 | 0.9335 |
| 400 | 68.8000 | 0.9461 | 140.0000 | 0.9283 | 40.0000 | 0.9200 |
| 600 | 64.6000 | 0.9409 | 187.5000 | 0.9473 | 60.0000 | 0.9488 |
| 800 | 81.3000 | 0.9463 | 290.0000 | 0.9415 | 90.0000 | 0.9498 |
| 1000 | 78.0000 | 0.9414 | 162.5000 | 0.9524 | 100.0000 | 0.9312 |
| 1200 | 80.3000 | 0.9646 | 240.0000 | 0.9424 | 120.0000 | 0.9623 |
| 1400 | 85.4000 | 0.9422 | 332.5000 | 0.9243 | 140.0000 | 0.9681 |
| 1600 | 79.4000 | 0.9385 | 200.0000 | 0.9472 | 160.0000 | 0.9372 |
| 1800 | 84.2000 | 0.9556 | 292.5000 | 0.9280 | 180.0000 | 0.9488 |
The mean NER of different methods for platelet data.
| Methods | Dataset 11 | Dataset 12 | Dataset 13 | Dataset 14 | Dataset 15 | |
|---|---|---|---|---|---|---|
| EPSVM | ( | 0.8928 | 0.8216 | 0.8153 | 0.8067 | 0.8133 |
| ( | 0.8758 | 0.8203 | 0.8184 | 0.7943 | 0.8014 | |
| ( | 0.9035 | 0.8391 | 0.8269 | 0.8030 | 0.7904 | |
| SVM-RFE+SVM | 0.8624 | 0.7996 | 0.7947 | 0.7432 | 0.7599 | |
| Relieff+SVM | 0.8708 | 0.7386 | 0.7980 | 0.7346 | 0.7603 | |
Figure 6Principal component scores scatter plot (PC 1 × PC 2) of Dataset 11 with different features: (a) raw features; (b) global features selected by SVM-RFE; (c) global features selected by Relieff; (d) class-specific features treating healthy donors (HD) class as positive class; (e) class-specific features treating BrCa class as positive class; (f) class-specific features treating colorectal cancer (CRC) class as positive class.
SN and SP of dataset 10 for each class ( = 0.5).
| Class | SN | SP | ||||
|---|---|---|---|---|---|---|
| EPSVM | SVM-REF + SVM | Relieff + SVM | EPSVM | SVM-REF + SVM | Relieff + SVM | |
| HD | 0.9074 | 0.8333 | 0.9074 | 1.0000 | 0.9877 | 0.9753 |
| BrCa | 0.7949 | 0.8205 | 0.8718 | 0.8958 | 0.8438 | 0.8333 |
| CRC | 0.7857 | 0.7857 | 0.6905 | 0.8710 | 0.9032 | 0.9462 |
SN and SP of dataset 14 for each class ( = 0.5).
| Class | SN | SP | ||||
|---|---|---|---|---|---|---|
| EPSVM | SVM-REF + SVM | Relieff + SVM | EPSVM | SVM-REF + SVM | Relieff + SVM | |
| HD | 0.8889 | 0.9444 | 0.8667 | 0.9688 | 0.9455 | 0.9610 |
| BrCa | 0.5692 | 0.4462 | 0.4769 | 0.9244 | 0.9098 | 0.9098 |
| CRC | 0.7500 | 0.5900 | 0.6400 | 0.9040 | 0.8987 | 0.8693 |
| NSCLC | 0.8462 | 0.7538 | 0.6923 | 0.9805 | 0.9707 | 0.9659 |
| GBM | 0.4833 | 0.4500 | 0.5000 | 0.9687 | 0.9494 | 0.9422 |
| PAAD | 0.6429 | 0.5429 | 0.5571 | 0.8988 | 0.8864 | 0.9136 |
| HBC | 0.4000 | 0.3600 | 0.3600 | 0.9933 | 0.9911 | 0.9889 |
Figure 7Feature selection results of proposed method with different in one experiment for Dataset 11 (only the top 500 features are shown), where ×, ▽, ○ indicate the selected class-specific features respectively, and – indicates the gene expression levels of one sample: (a) , the total number of selected feature is 477; (b) , the total number of selected feature is 272; (c) , the total number of selected feature is 253.
NER of microarray data for tumor detection.
| Data | EPSVM | SVM-RFE+SVM | Relieff+SVM |
|---|---|---|---|
| Leukemia1 | 0.9593 | 0.9333 | 0.9445 |
| Leukemia2 | 0.9852 | 0.8987 | 0.9589 |
| SRBCT | 0.9958 | 0.9644 | 0.9844 |
| Brain_Tumor2 | 0.7581 | 0.7590 | 0.7979 |
| Brain_Tumor1 | 0.8293 | 0.7244 | 0.7862 |
| Tumors_9 | 0.8293 | 0.7244 | 0.7862 |
| Tumors11 | 0.9595 | 0.9016 | 0.9182 |