| Literature DB >> 32528221 |
Zeinab Sajjadnia1, Raof Khayami1, Mohammad Reza Moosavi2.
Abstract
In recent years, due to an increase in the incidence of different cancers, various data sources are available in this field. Consequently, many researchers have become interested in the discovery of useful knowledge from available data to assist faster decision-making by doctors and reduce the negative consequences of such diseases. Data mining includes a set of useful techniques in the discovery of knowledge from the data: detecting hidden patterns and finding unknown relations. However, these techniques face several challenges with real-world data. Particularly, dealing with inconsistencies, errors, noise, and missing values requires appropriate preprocessing and data preparation procedures. In this article, we investigate the impact of preprocessing to provide high-quality data for classification techniques. A wide range of preprocessing and data preparation methods are studied, and a set of preprocessing steps was leveraged to obtain appropriate classification results. The preprocessing is done on a real-world breast cancer dataset of the Reza Radiation Oncology Center in Mashhad with various features and a great percentage of null values, and the results are reported in this article. To evaluate the impact of the preprocessing steps on the results of classification algorithms, this case study was divided into the following 3 experiments: Breast cancer recurrence prediction without data preprocessing Breast cancer recurrence prediction by error removal Breast cancer recurrence prediction by error removal and filling null values Then, in each experiment, dimensionality reduction techniques are used to select a suitable subset of features for the problem at hand. Breast cancer recurrence prediction models are constructed using the 3 widely used classification algorithms, namely, naïve Bayes, k-nearest neighbor, and sequential minimal optimization. The evaluation of the experiments is done in terms of accuracy, sensitivity, F-measure, precision, and G-mean measures. Our results show that recurrence prediction is significantly improved after data preprocessing, especially in terms of sensitivity, F-measure, precision, and G-mean measures.Entities:
Keywords: Data preprocessing; breast cancer; classification; data mining techniques; recurrence
Year: 2020 PMID: 32528221 PMCID: PMC7262833 DOI: 10.1177/1176935120917955
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Reza Radiation Oncology Center breast cancer data description of attributes.
| Explanation | Attribute name |
|---|---|
| Patient’s age | Age |
| Age at marriage | marriage |
| Number of gravidity | g |
| Number of pregnancy | p |
| Number of dead births | d |
| Number of alive births | al |
| Number of abortions | ab |
| TNM classification of the tumor{t1, t2, t3, t4} | t-tnm |
| TNM classification of the lymph node{n0, n1, n2, n3} | n-tnm |
| TNM classification of the metastasis{m0, m1} | m-tnm |
| Number of lymph nodes involved | node involve |
| Type of surgery operation{MRM, BSC} | surgery |
| Has the patient had a history of cancer in the family? {yes, no} | familyh |
| Family relationship with someone who has cancer {mother, father, etc} | relation |
| Sort of cancer in relatives of patients{breast, lung, etc} | sort |
| Was radiotherapy applied to the patient?{yes, no} | Rt |
| Radiation dose rate | dose |
| Was adjuvant chemotherapy applied to the patient?{yes, no} | Chemotherapy |
| Was neoadjuvant chemotherapy applied to the patient?{yes, no} | Neo |
| Was hormone therapy applied to the patient or is not{yes, no} | Hormone1 |
| Estrogen receptor state{negative, positive} | er |
| Progesterone receptor state{negative, positive} | pr |
| The latest condition of the patient | lastcon |
| Has the patient’s cancer recurred?{yes, no} | recc |
Abbreviations: MRM, modified radical mastectomy; TNM, tumour, node, metastasis; BSC, breast-conserving surgery.
Figure 1.The steps of case study in RROC.
RBF indicates radial basis function; RROC, Reza Radiation Oncology Center; SMO, sequential minimal optimization.
Figure 2.The overall procedure of the decision methods.
SMO indicates sequential minimal optimization.
Confusion matrix for a 2-class problem.
| Predicted class | |||
|---|---|---|---|
| Class = recurred | Class = non-recurred | ||
| Actual class | Class = recurred | True positive( | False negative( |
| Class = nonrecurred | False positive( | True negative( | |
TP: The number of samples labeled as recurrence by the physician and correctly predicted by the classification algorithm as recurrence.
FN: The number of samples labeled as recurrence by the physician but wrongly predicted by the classification algorithm as no recurrence.
FP: The number of samples labeled as no recurrence by the physician but wrongly predicted by the classification algorithm as recurrence.
TN: The number of samples labeled as no recurrence by the physician and correctly predicted by the classification algorithm as no recurrence.
Setting the parameters for the prediction models in Weka software.
| Algorithm | Parameters | Value | |
|---|---|---|---|
| IBK | meanSquared | True | |
| nearestNeigborSearchAlgorithm | KDTree | ||
| Naïve Bayes | useKernelEstimator | False | True |
| useSupervisedDiscritization | False | False | |
| SMO | No change | ||
Abbreviations: SMO, sequential minimal optimization.
Evaluating classification algorithms before data preprocessing.
| Classifier | Dimension | Accuracy | Sensitivity | Precision | F-measure | G-mean |
|---|---|---|---|---|---|---|
| IBK ( | (manual)25 | 92.66 | 9.74 | 10.75 | 10.22 | 30.64 |
| (manual)59 | 92.26 | 8.97 | 9.11 | 9.04 | 29.35 | |
| (manual)79 | 91.56 | 27.82 | 18.24 | 22.03 | 51.25 | |
| (manual)93 | 89.87 |
|
|
|
| |
| (weka)22 |
| 3.85 | 13.76 | 6.01 | 19.51 | |
| (weka)93 | 89.87 |
|
|
|
| |
| Naïve Bayes | (manual)25 |
| 8.33 |
| 12.13 | 28.68 |
| (manual)59 | 93.98 | 16.15 | 22.18 | 18.69 | 39.68 | |
| (manual)79 | 93.51 | 16.67 | 19.64 | 18.03 | 40.20 | |
| (manual)93 | 86.37 | 32.69 | 11.53 | 17.05 | 53.87 | |
| (weka)22 | 85.74 |
| 13.60 |
|
| |
| (weka)93 | 86.37 | 32.69 | 11.53 | 17.05 | 53.87 | |
| SMO | (manual)25 | 95.97 | 11.54 |
| 19.69 | 33.93 |
| (manual)59 | 95.77 | 11.03 | 53.42 | 18.28 | 33.13 | |
| (manual)79 | 95.37 | 18.85 | 41.18 | 25.86 | 43.15 | |
| (manual)93 |
|
| 64.47 |
|
| |
| (weka)22 | 95.88 | 14.74 | 57.79 | 23.49 | 38.30 | |
| (weka)93 |
|
| 64.47 |
|
|
Abbreviations: SMO, sequential minimal optimization.
Evaluating classification algorithms after filling null values.
| Classifier | Dimension | Accuracy | Sensitivity | Precision | F-measure | G-mean |
|---|---|---|---|---|---|---|
| IBK ( | (manual)25 |
|
|
|
|
|
| (manual)59 | 96.37 | 34.10 | 64.41 | 44.59 | 58.15 | |
| (manual)79 | 95.92 | 42.95 | 52.92 | 47.42 | 64.97 | |
| (manual)93 | 95.60 | 49.36 | 48.67 | 49.01 | 69.43 | |
| (weka)22 | 97.51 | 58.08 | 78.24 | 66.67 | 75.93 | |
| (weka)93 | 95.63 | 48.97 | 49.04 | 49.01 | 69.18 | |
| Naïve Bayes | (manual)25 | 96.88 | 74.49 | 61.22 | 67.21 | 85.39 |
| (manual)59 |
|
|
|
|
| |
| manual(79) | 98.59 | 73.08 | 92.38 | 81.60 | 85.37 | |
| (manual)93 | 97.39 | 70.90 | 69.13 | 70.00 | 83.60 | |
| (weka)22 | 97.78 | 74.10 | 74.10 | 74.10 | 85.58 | |
| (weka)93 | 97.44 | 72.05 | 69.47 | 70.74 | 84.28 | |
| SMO | (manual)25 |
|
| 99.85 |
|
|
| (manual)59 |
|
| 99.85 |
|
| |
| (manual)79 |
| 84.49 |
| 91.59 | 91.92 | |
| (manual)93 | 99.30 | 83.85 | 99.85 | 91.15 | 91.56 | |
| (weka)22 | 99.29 |
| 98.65 | 91.10 | 91.96 | |
| (weka)93 | 99.29 | 83.72 | 99.69 | 91.01 | 91.49 |
Abbreviations: SMO, sequential minimal optimization.
Figure 3.Comparing the accuracy of the recurrence prediction models before and after preprocessing.
SMO indicates sequential minimal optimization.
Figure 7.Comparing the G-mean charts of the recurrence prediction models before and after preprocessing.
SMO indicates sequential minimal optimization.
Figure 4.Comparing the sensitivity of the recurrence prediction models before and after preprocessing.
SMO indicates sequential minimal optimization.
Figure 5.Comparing the precision of the recurrence prediction models before and after preprocessing.
SMO indicates sequential minimal optimization.
Figure 6.Comparing F-measures of the recurrence prediction models before and after preprocessing.
SMO indicates sequential minimal optimization.
Comparing the results of 3 steps of the case study in terms of the sensitivity, precision, and G-mean measures.
| F | Standard deviation | Mean | Measure | ||
|---|---|---|---|---|---|
| <.001 | 82.88 | 12.27 | 21.56 | Before preprocessing | Sensitivity |
| 11.29 | 18.46 | Basic preprocessing | |||
| 16.03 | 69.57 | Final preprocessing | |||
| <.001 | 34.19 | 21.51 | 29.88 | Before preprocessing | Precision |
| 21.80 | 29.89 | Basic preprocessing | |||
| 19.11 | 79.64 | Final preprocessing | |||
| <.001 | 69.85 | 12.60 | 43.34 | Before preprocessing | G-mean |
| 12.69 | 40.09 | Basic preprocessing | |||
| 10.50 | 82.48 | Final preprocessing |
Comparing the 3 steps of the case study in terms of the F-measure and accuracy measures.
|
| Standard deviation | Mean | Measure | ||
|---|---|---|---|---|---|
| <.001 | 31.362 | 3.59 | 92.63 | Before preprocessing | Accuracy |
| 2.68 | 93.35 | Basic preprocessing | |||
| 1.37 | 97.88 | Final preprocessing | |||
| <.001 | 80.126 | 8.41 | 20.17 | Before preprocessing | F-measure |
| 8.48 | 18.74 | Basic preprocessing | |||
| 17.04 | 73.85 | Final preprocessing |
Evaluating classification algorithms after error removal.
| Classifier | Dimension | Accuracy | Sensitivity | Precision | F-measure | G-mean |
|---|---|---|---|---|---|---|
| IBK ( | (manual)25 | 93.07 | 7.69 | 10.00 | 8.70 | 27.30 |
| (manual)59 | 92.46 | 8.21 | 8.90 | 8.54 | 28.10 | |
| (manual)79 | 91.91 | 26.79 |
| 22.13 | 50.41 | |
| (manual)93 | 90.63 |
| 17.73 |
|
| |
| (weka)22 |
| 3.72 | 14.36 | 5.91 | 19.19 | |
| (weka)93 | 90.63 | 32.44 | 17.68 | 22.89 | 54.99 | |
| Naïve Bayes | (manual)25 |
| 6.54 |
| 10.47 | 25.47 |
| (manual)59 | 94.30 | 13.08 | 22.17 | 16.45 | 35.79 | |
| (manual)79 | 94.06 | 15.26 | 22.12 | 18.06 | 38.59 | |
| (manual)93 | 88.18 | 36.28 | 14.62 | 20.84 | 57.30 | |
| (weka)22 | 92.00 | 16.54 | 13.81 | 15.05 | 39.72 | |
| (weka)93 | 87.85 |
| 14.81 |
|
| |
| SMO | (manual)25 | 96.01 | 11.54 |
| 19.87 | 33.93 |
| (manual)59 | 95.99 | 11.54 | 69.23 | 19.78 | 33.93 | |
| (manual)79 | 95.57 | 16.15 | 45.32 | 23.82 | 40.02 | |
| (manual)93 | 96.15 | 24.74 | 62.87 | 35.51 | 49.58 | |
| (weka)22 | 95.24 | 5.38 | 24.71 | 8.84 | 23.12 | |
| (weka)93 |
|
| 63.14 |
|
|
Abbreviations: SMO, sequential minimal optimization.