| Literature DB >> 29312636 |
Jian Liu1, Xuesong Wang1, Yuhu Cheng1, Lin Zhang1.
Abstract
Since tumor is seriously harmful to human health, effective diagnosis measures are in urgent need for tumor therapy. Early detection of tumor is particularly important for better treatment of patients. A notable issue is how to effectively discriminate tumor samples from normal ones. Many classification methods, such as Support Vector Machines (SVMs), have been proposed for tumor classification. Recently, deep learning has achieved satisfactory performance in the classification task of many areas. However, the application of deep learning is rare in tumor classification due to insufficient training samples of gene expression data. In this paper, a Sample Expansion method is proposed to address the problem. Inspired by the idea of Denoising Autoencoder (DAE), a large number of samples are obtained by randomly cleaning partially corrupted input many times. The expanded samples can not only maintain the merits of corrupted data in DAE but also deal with the problem of insufficient training samples of gene expression data to a certain extent. Since Stacked Autoencoder (SAE) and Convolutional Neural Network (CNN) models show excellent performance in classification task, the applicability of SAE and 1-dimensional CNN (1DCNN) on gene expression data is analyzed. Finally, two deep learning models, Sample Expansion-Based SAE (SESAE) and Sample Expansion-Based 1DCNN (SE1DCNN), are designed to carry out tumor gene expression data classification by using the expanded samples. Experimental studies indicate that SESAE and SE1DCNN are very effective in tumor classification.Entities:
Keywords: 1-dimensional convolutional neural network; classification; deep learning; gene expression data; sample expansion
Year: 2017 PMID: 29312636 PMCID: PMC5752549 DOI: 10.18632/oncotarget.22762
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Summery of tumor gene expression datasets
| Dataset | Data Labels | Number of | |
|---|---|---|---|
| Genes | Samples | ||
| Breast cancer | 1=non-IBC, 2=IBC | 30006 | 20 |
| Leukemia | 1=MP, 2=HDMTX, 3=HDMTX+MP, 4=LDMTX+MP | 12600 | 60 |
| Colon cancer | 1=cancer, 2= normal | 2000 | 62 |
The parameters and classification accuracies of SESAE with different number of corrupted genes on breast cancer
| Layer | ||||||
|---|---|---|---|---|---|---|
| Hidden Layer1 | Number of Nodes | 50 | 50 | 50 | 50 | 50 |
| Hidden Layer2 | Number of Nodes | 50 | 50 | 50 | 50 | 50 |
| Accuracy (%) | 87.33 | 86.67 | 86.00 | 86.67 | 86.00 | |
The parameters and classification accuracies of SE1DCNN with different number of corrupted genes on breast cancer
| Layer | ||||||
|---|---|---|---|---|---|---|
| C1 Filter | Number | 11 | 11 | 5 | 11 | 11 |
| Size | 21 | 21 | 21 | 21 | 21 | |
| M1 Filter | Size | 4 | 4 | 4 | 4 | 4 |
| C2 Filter | Number | 5 | 5 | 5 | 5 | 5 |
| Size | 21 | 21 | 21 | 21 | 21 | |
| M2 Filter | Size | 4 | 4 | 4 | 4 | 4 |
| Accuracy (%) | 95.33 | 95.33 | 93.33 | 94.67 | 94.00 | |
The parameters and classification accuracies of SESAE with different number of corrupted genes on leukemia dataset
| Layer | ||||||
|---|---|---|---|---|---|---|
| Hidden Layer 1 | Number of Nodes | 30 | 30 | 30 | 30 | 30 |
| Hidden Layer 2 | Number of Nodes | 30 | 30 | 30 | 30 | 30 |
| Accuracy (%) | 49.79 | 49.36 | 48.72 | 48.30 | 48.51 | |
The parameters and classification accuracies of SE1DCNN with different number of corrupted genes on leukemia dataset
| Layer | ||||||
|---|---|---|---|---|---|---|
| C1 Filter | Number | 22 | 17 | 22 | 9 | 17 |
| Size | 21 | 21 | 21 | 21 | 21 | |
| M1 Filter | Size | 4 | 4 | 4 | 4 | 4 |
| C2 Filter | Number | 5 | 5 | 5 | 16 | 5 |
| Size | 21 | 21 | 21 | 21 | 21 | |
| M2 Filter | Size | 4 | 4 | 4 | 4 | 4 |
| Accuracy (%) | 57.87 | 57.02 | 57.24 | 56.17 | 55.96 | |
The parameters and classification accuracies of SESAE with different number of corrupted genes on colon cancer
| Layer | ||||||
|---|---|---|---|---|---|---|
| Hidden Layer 1 | Number of Nodes | 100 | 100 | 100 | 100 | 100 |
| Hidden Layer 2 | Number of Nodes | 100 | 100 | 100 | 100 | 100 |
| Accuracy (%) | 84.49 | 83.68 | 83.28 | 83.89 | 83.69 | |
The parameters and classification accuracies of SE1DCNN with different number of corrupted genes on colon cancer
| Layer | ||||||
|---|---|---|---|---|---|---|
| C1 Filter | Number | 25 | 5 | 20 | 12 | 20 |
| Size | 21 | 21 | 21 | 21 | 21 | |
| M1 Filter | Size | 4 | 4 | 4 | 4 | 4 |
| C2 Filter | Number | 20 | 10 | 7 | 9 | 5 |
| Size | 21 | 21 | 21 | 21 | 21 | |
| M2 Filter | Size | 4 | 4 | 4 | 4 | 4 |
| Accuracy (%) | 84.90 | 85.51 | 85.30 | 84.49 | 85.10 | |
The classification accuracies (%) of SESAE and SE1DCNN on three datasets with different values of a when all the samples are expanded
| Dataset | Method | |||||
|---|---|---|---|---|---|---|
| Breast | SESAE | 99.88 | 99.78 | 99.75 | 99.53 | 99.49 |
| SE1DCNN | 99.94 | 99.84 | 99.81 | 99.74 | 99.68 | |
| Leukemia | SESAE | 99.78 | 99.55 | 99.46 | 99.20 | 98.94 |
| SE1DCNN | 99.84 | 99.67 | 99.54 | 99.37 | 99.12 | |
| Colon | SESAE | 99.96 | 99.94 | 99.93 | 99.86 | 99.86 |
| SE1DCNN | 99.98 | 99.95 | 99.93 | 99.88 | 99.86 |
The classification accuracies (%) of different methods on three datasets
| Methods | Breast Cancer | Leukemia | Colon Cancer |
|---|---|---|---|
| SE1DCNN | |||
| 1DCNN | 86.00 | 51.49 | 83.67 |
| SESAE | 87.33 | 49.79 | 84.49 |
| SAE | 80.67 | 32.55 | 82.07 |
| SAE in [ | 63.33 | 33.71 | 66.67 |
| SAE (Fine tuning) in [ | 83.33 | 33.71 | 83.33 |
| Softmax/SVM | 85.0 | 46.33 | 83.33 |
Figure 1The graphical representation of Autoencoder (A) and Denoising Autoencoder (B).
Figure 2The schematic representation of sample expansion method
Figure 3The SESAE architecture consisting of sample expansion process, one input layer, two hidden layers and one output layer
Figure 4The SE1DCNN architecture consisting of sample expansion process, two convolutional layers, two max pooling layers and one fully-connected layer
Figure 5The flowchart of tumor classification by using SESAE or SE1DCNN