| Literature DB >> 35855803 |
Qiang Zhang1, Peng Peng1, Yi Gu1.
Abstract
Traditional algorithms have the following drawbacks: (1) they only focus on a certain aspect of genetic data or local feature data of osteosarcoma patients, and the extracted feature information is not considered as a whole; (2) they do not equalize the sample data between categories; (3) the generalization ability of the model is weak, and it is difficult to perform the task of classifying the survival status of osteosarcoma patients better. In this context, this paper designs a survival status prediction model for osteosarcoma patients based on E-CNN-SVM and multisource data fusion, taking into full consideration the characteristics of the small number of samples, high dimensionality, and interclass imbalance of osteosarcoma patients' genetic data. The model fuses four gene sequencing data highly correlated with bone tumors using the random forest algorithm in a dimensionality reduction and then equalizes the data using a hybrid sampling method combining the SMOTE algorithm and the TomekLink algorithm; secondly, the CNN model with the incentive module is used to further extract features from the data for more accurate extraction of characteristic information; finally, the data are passed to the SVM model to further improve the stability and classification performance of the model. The model has been demonstrated to be more effective in improving the accuracy of the classification of patients with osteosarcoma.Entities:
Mesh:
Year: 2022 PMID: 35855803 PMCID: PMC9288314 DOI: 10.1155/2022/9464182
Source DB: PubMed Journal: Comput Intell Neurosci
Clinical characteristics of these patients.
| Clinical features | |
|---|---|
| Number of cases | 65 |
|
| |
| Age at illness (years) | 15.09 ± 4.89 |
|
| |
| Gender | |
| Male | 36 |
| Women | 29 |
|
| |
| Race | |
| White people | 39 |
| Black people | 10 |
| Asian | 7 |
| Unknown | 9 |
|
| |
| Primary site | |
| Lower limbs | 59 |
| Pelvis | 2 |
| Upper limb | 4 |
|
| |
| Survival time (days) | 1339.31 ± 982.08 |
Figure 1The flow chart of the E-CNN-SVM and multisource algorithm.
The steps of the E-CNN-SVM and multisource algorithm.
| Step 1 | The copy number variation data, DNA methylation data, RNA gene sequencing data, and RNA homologue sequencing data are each reduced in dimension using the random forest algorithm |
| Step 2 | Equal weighted fusion of these four types of data |
| Step 3 | Combine the SMOTE algorithm with the TomekLink algorithm to clean and equalize the data |
| Step 4 | Pretrain the E-CNN model and save the optimal model |
| Step 5 | Feature extraction of data using the input layer to the fully connected layer of the E-CNN model |
| Step 6 | Use the processed data to train the SVM model and use the trained model for classification |
Figure 2The random forest construction process.
Figure 3Schematic diagram of the SMOTE algorithm for constructing new nodes.
Figure 4The schematic diagram of CNN & E-CNN. (a) The schematic diagram of the CNN model. (b) The schematic diagram of the E-CNN model.
The E-CNN-SVM model parameters.
| CNN | ||
| First convolutional layer | Filters: 128 | kernel_size: 1 |
| First dropout layer | Rate: 0.3 | |
| Second convolutional layer | Filters: 64 | kernel_size: 1 |
| Second dropout layer | Rate: 0.3 | |
| Motivation module | Compression ratio: 4 | |
| Fully connected layer | Units: 2 | |
|
| ||
| SVM | ||
| Penalty factor | 0.9 | |
| Cache_size | 3000 | |
| Kernel | rbf | |
Confusion matrix.
| Patient survival status | Classificationpredicted as death | Category predictionfor survival |
|---|---|---|
| Death | TP | FN |
| Survival | FP | TN |
E-CNN-SVM experimental data.
| Number | Model | Data type | Accuracy (%) | Recall (%) |
| Variance |
|---|---|---|---|---|---|---|
| 1 | E-CNN-SVM | Multisource | 100 | 100 | 100 | 0 |
| 2 | E-CNN-SVM | Copy number variation | 100 | 100 | 100 | 0 |
| 3 | E-CNN-SVM | DNA methylation | 100 | 100 | 100 | 0 |
| 4 | E-CNN-SVM | RNA-seq-gene | 100 | 100 | 100 | 0 |
| 5 | E-CNN-SVM | RNA-seq-Ios | 92 | 75 | 86 | 0 |
E-CNN experimental data.
| Number | Model | Data type | Accuracy (%) | Recall rate (%) |
| Variance |
|---|---|---|---|---|---|---|
| 1 | E-CNN | Multisource | 97 | 97 | 96 | 0.0015 |
| 2 | E-CNN | Copy number variation | 79 | 69 | 69 | 0.0036 |
| 3 | E-CNN | DNA methylation | 86 | 76 | 77 | 0.0031 |
| 4 | E-CNN | RNA-seq-gene | 82 | 71 | 78 | 0.0105 |
| 5 | E-CNN | RNA-seq-Ios | 88 | 88 | 84 | 0.0012 |
CNN experimental data.
| Number | Model | Data type | Accuracy (%) | Recall rate (%) |
| Variance |
|---|---|---|---|---|---|---|
| 1 | CNN | Multisource | 86 | 97 | 81 | 0.0031 |
| 2 | CNN | Copy number variation | 75 | 69 | 66 | 0.0036 |
| 3 | CNN | DNA methylation | 80 | 74 | 74 | 0.0041 |
| 4 | CNN | RNA-seq-gene | 79 | 71 | 73 | 0.0036 |
| 5 | CNN | RNA-seq-Ios | 83 | 75 | 80 | 0.0054 |
Model comparison experimental data.
| Number | Model | Data type | Accuracy (%) | Recall rate (%) |
| Variance |
|---|---|---|---|---|---|---|
| 1 | E-CNN-SVM | Multisource | 100 | 100 | 100 | 0.0000 |
| 2 | E-CNN | Multisource | 97 | 97 | 96 | 0.0015 |
| 3 | CNN | Multisource | 86 | 97 | 81 | 0.0031 |
| 4 | SVM | Multisource | 76 | 35 | 51 | 0.0081 |
| 5 | XGBoost | Multisource | 76 | 66 | 64 | 0.0065 |
| 6 | CNN-LSTM | Multisource | 59 | 0 | 0 | 0.0000 |