| Literature DB >> 35205547 |
Abstract
Early diagnosis of cancer is beneficial in the formulation of the best treatment plan; it can improve the survival rate and the quality of patient life. However, imaging detection and needle biopsy usually used not only find it difficult to effectively diagnose tumors at early stage, but also do great harm to the human body. Since the changes in a patient's health status will cause changes in blood protein indexes, if cancer can be diagnosed by the changes in blood indexes in the early stage of cancer, it can not only conveniently track and detect the treatment process of cancer, but can also reduce the pain of patients and reduce the costs. In this paper, 39 serum protein markers were taken as research objects. The difference of the entropies of serum protein marker sequences in different types of patients was analyzed, and based on this, a cost-sensitive analysis model was established for the purpose of improving the accuracy of cancer recognition. The results showed that there were significant differences in entropy of different cancer patients, and the complexity of serum protein markers in normal people was higher than that in cancer patients. Although the dataset was rather imbalanced, containing 897 instances, including 799 normal instances, 44 liver cancer instances, and 54 ovarian cancer instances, the accuracy of our model still reached 95.21%. Other evaluation indicators were also stable and satisfactory; precision, recall, F1 and AUC reach 0.807, 0.833, 0.819 and 0.92, respectively. This study has certain theoretical and practical significance for cancer prediction and clinical application and can also provide a research basis for the intelligent medical treatment.Entities:
Keywords: KNN; approximate entropy; cancer prediction; cost-sensitive learning; imbalanced dataset; sample entropy
Year: 2022 PMID: 35205547 PMCID: PMC8871087 DOI: 10.3390/e24020253
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Two-category Cost Matrix.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predict positive | C(0,0) | C(0,1) |
| Predict negative | C(1,0) | C(1,1) |
The average entropies.
| Normal | Liver | Ovary | |
|---|---|---|---|
| ApEn | 0.510 | 0.285 | 0.403 |
| SaEn | 0.575 | 0.237 | 0.365 |
| FuzzyEn | 0.018 | 0.017 | 0.019 |
| InfoEn | 5.281 | 5.284 | 5.284 |
Figure 1Entropy scatter plots of different types.
The cost matrix.
| Real Category | Predicted Category | ||
|---|---|---|---|
| 0 | 1 | 2 | |
| 0 | 0 | 1 | 1 |
| 1 | 9 | 0 | 1.3 |
| 2 | 7 | 1 | 0 |
Evaluation of classification results of blood protein index.
| KNN | CS-KNN | |
|---|---|---|
| accuracy | 0.948 | 0.952 |
| precision | 0.902 | 0.828 |
| Recall | 0.691 | 0.806 |
| F1_score | 0.766 | 0.814 |
Figure 2Roc curve of blood protein index.
Properties of dataset for validation.
| Dataset | Property Number | Class Number | Sample Number in Each Category |
|---|---|---|---|
| Heart Disease | 10 | 3 | 179/35/26 |
| Breast Cancer Wisconsin | 9 | 2 | 444/239 |
| Speaker Accent Recognition | 12 | 3 | 165/45/30 |
Average accuracy and F1 of two KNN algorithms.
| Dataset | KNN | CS-KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Average Accuracy | SD | F1 | SD | k | Average Accuracy | SD | F1 | SD | k | |
| Heart Disease | 0.746 | 0.391 | 0.715 | 0.436 | 14 | 0.746 | 0.383 | 0.720 | 0.424 | 6 |
| Breast Cancer Wisconsin | 0.969 | 0.026 | 0.979 | 0.025 | 5 | 0.977 | 0.017 | 0.986 | 0.010 | 8 |
| Speaker Accent Recognition | 0.738 | 0.229 | 0.796 | 0.204 | 5 | 0.746 | 0.231 | 0.796 | 0.220 | 5 |
Evaluation results of cost-sensitive KNN.
| Original Index | Entropy Index | Original Index and Entropy Index | |
|---|---|---|---|
| accuracy | 0.952 | 0.705 | 0.952 |
| precision | 0.828 | 0.438 | 0.807 |
| Recall | 0.806 | 0.568 | 0.833 |
| F1_score | 0.814 | 0.451 | 0.819 |
Figure 3Roc curve of cost-sensitive KNN.