| Literature DB >> 32429090 |
Muhammad Fazal Ijaz1, Muhammad Attique2, Youngdoo Son1.
Abstract
Globally, cervical cancer remains as the foremost prevailing cancer in females. Hence, it is necessary to distinguish the importance of risk factors of cervical cancer to classify potential patients. The present work proposes a cervical cancer prediction model (CCPM) that offers early prediction of cervical cancer using risk factors as inputs. The CCPM first removes outliers by using outlier detection methods such as density-based spatial clustering of applications with noise (DBSCAN) and isolation forest (iForest) and by increasing the number of cases in the dataset in a balanced way, for example, through synthetic minority over-sampling technique (SMOTE) and SMOTE with Tomek link (SMOTETomek). Finally, it employs random forest (RF) as a classifier. Thus, CCPM lies on four scenarios: (1) DBSCAN + SMOTETomek + RF, (2) DBSCAN + SMOTE+ RF, (3) iForest + SMOTETomek + RF, and (4) iForest + SMOTE + RF. A dataset of 858 potential patients was used to validate the performance of the proposed method. We found that combinations of iForest with SMOTE and iForest with SMOTETomek provided better performances than those of DBSCAN with SMOTE and DBSCAN with SMOTETomek. We also observed that RF performed the best among several popular machine learning classifiers. Furthermore, the proposed CCPM showed better accuracy than previously proposed methods for forecasting cervical cancer. In addition, a mobile application that can collect cervical cancer risk factors data and provides results from CCPM is developed for instant and proper action at the initial stage of cervical cancer.Entities:
Keywords: artificial intelligence; cancer; cervical cancer; digital health; imbalanced data analysis; machine learning; medical information systems; outlier detection
Mesh:
Year: 2020 PMID: 32429090 PMCID: PMC7284557 DOI: 10.3390/s20102809
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Dataset features, number of entries, and missing values.
| Number | Attribute Name | Type | Missing Values |
|---|---|---|---|
| 1 | Age | Int | 0 |
| 2 | Number of sexual partners | Int | 26 |
| 3 | First sexual intercourse (age) | Int | 7 |
| 4 | Num of pregnancies | Int | 56 |
| 5 | Smokes | bool | 13 |
| 6 | Smokes (years) | bool | 13 |
| 7 | (Smokes (packs/year) | bool | 13 |
| 8 | Hormonal Contraceptives | bool | 108 |
| 9 | Hormonal Contraceptives (years) | Int | 108 |
| 10 | Intrauterine Device (IUD) | bool | 117 |
| 11 | IUD (years) | Int | 117 |
| 12 | Sexually Transmitted Disease (STD) | bool | 105 |
| 13 | STDs (number) | Int | 105 |
| 14 | STDs: condylomatosis | bool | 105 |
| 15 | STDs: cervical condylomatosis | bool | 105 |
| 16 | STDs: vaginal condylomatosis | bool | 105 |
| 17 | STDs: vulvo-perineal condylomatosis | bool | 105 |
| 18 | STDs: syphilis | bool | 105 |
| 19 | STDs: pelvic inflammatory disease | bool | 105 |
| 20 | STDs: genital herpes | bool | 105 |
| 21 | STDs: molluscum contagiosum | bool | 105 |
| 22 | STDs: AIDS | bool | 105 |
| 23 | STDs: HIV | bool | 105 |
| 24 | STDs: Hepatitis B | bool | 105 |
| 25 | STDs: HPV | bool | 105 |
| 26 | STDs: Number of diagnosis | Int | 0 |
| 27 | STDs: Time since first diagnosis | Int | 787 |
| 28 | STDs: Time since last diagnosis | Int | 787 |
| 29 | Dx: Cancer | bool | 0 |
| 30 | Dx: Cervical Intraepithelial Neoplasia (CIN) | bool | 0 |
| 31 | Dx: Human Papillomavirus (HPV) | bool | 0 |
| 32 | Diagnosis: Dx | bool | 0 |
| 33 | Hinselmann: target variable | bool | |
| 34 | Schiller: target variable | bool | |
| 35 | Cytology: target variable | bool | |
| 36 | Biopsy: target variable | bool |
Figure 1Prediction model for cervical cancer.
Performance metrics for the classification model.
| Performance Metric | Formula |
|---|---|
| Precision |
|
| Recall/Sensitivity |
|
| Specificity/True Negative Rate | TN/(TN + FP) |
| F1 Score | 2 × (Precision × Recall)/(Precision + Recall) |
| Accuracy |
|
Different outcomes of two-class prediction.
| Predicted as “Yes” | Predicted as “No” | |
|---|---|---|
| Actual “Yes” | True Positive (TP) | False Negative (FN) |
| Actual “No” | False Positive (FP) | True Negative (TN) |
Results of Chi-square.
| No | Feature’s Name | Features Scores |
|---|---|---|
| 1 | Smokes (years) | 421.4689 |
| 2 | Hormonal Contraceptives (years) | 246.6243 |
| 3 | Sexually Transmitted Diseases (STDs) (number) | 87.28867 |
| 4 | STDs: genital herpes | 43.73654 |
| 5 | STDs: HIV | 29.35086 |
| 6 | STDs: Number of diagnosis | 21.74795 |
| 7 | Dx: Cancer | 21.74795 |
| 8 | Dx: cervical intraepithelial neoplasia (CIN) | 20.71644 |
| 9 | Dx: human papillomavirus (HPV) | 12.64184 |
| 10 | Dx | 12.44904 |
Figure 2Optimal eps value for DBSCAN.
Results of synthetic minority over sampling technique (SMOTE) and SMOTETomek.
| Before SMOTE | After SMOTE | Before SMOTETomek | After SMOTETomek | ||||
|---|---|---|---|---|---|---|---|
| Minority (%) | Majority (%) | Minority (%) | Majority (%) | Minority (%) | Majority (%) | Minority (%) | Majority (%) |
| 55 (6.41%) | 803 (93.59%) | 803 (93.59%) | 803 (93.59%) | 55 (6.41%) | 803 (93.59%) | 803 (93.59%) | 803 (93.59%) |
Performance evaluation results based on density-based spatial clustering of applications with noise (DBSCAN) and SMOTE for Biopsy.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 92.797 | 91.666 | 93.908 | 92.768 | 92.768 |
| MLP | 96.049 | 97.549 | 94.416 | 96.000 | 96.001 |
| Logistic Regression | 94.020 | 93.627 | 94.416 | 94.015 | 94.014 |
| Naïve Bayes | 93.666 | 96.568 | 90.355 | 93.506 | 93.516 |
| KNN | 94.289 | 98.039 | 89.847 | 94.001 | 94.014 |
| Proposed CCPM (Random Forest) | 97.025 | 98.039 | 95.939 | 97.006 | 97.007 |
Performance evaluation results based on DBSCAN and SMOTETomek for Biopsy.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 94.692 | 93.782 | 95.544 | 94.682 | 94.683 |
| MLP | 95.697 | 95.854 | 95.544 | 95.696 | 95.696 |
| Regression | 95.697 | 95.854 | 95.544 | 95.696 | 95.696 |
| Naïve Bayes | 93.587 | 96.373 | 90.594 | 93.416 | 93.417 |
| KNN | 94.430 | 94.300 | 94.554 | 94.430 | 94.430 |
| Proposed CCPM (Random Forest) | 96.720 | 97.409 | 96.039 | 96.720 | 96.708 |
Performance evaluation results based on iForest and SMOTE for Biopsy.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy(%) |
|---|---|---|---|---|---|
| SVM | 95.501 | 90.957 | 99.456 | 95.154 | 95.161 |
| MLP | 96.432 | 93.085 | 99.456 | 96.233 | 96.236 |
| Regression | 96.131 | 93.085 | 98.913 | 95.965 | 95.967 |
| Naïve Bayes | 95.656 | 92.021 | 98.913 | 95.426 | 95.430 |
| KNN | 98.668 | 97.872 | 99.456 | 98.655 | 98.655 |
| Proposed CCPM (Random Forest) | 98.924 | 98.936 | 98.130 | 98.924 | 98.925 |
Performance evaluation results based on iForest and SMOTETomek for Biopsy.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 94.726 | 91.935 | 97.282 | 94.591 | 94.594 |
| MLP | 96.845 | 94.623 | 98.913 | 96.755 | 96.756 |
| Logistic Regression | 94.853 | 90.860 | 98.369 | 94.587 | 94.594 |
| Naïve Bayes | 94.619 | 90.322 | 98.369 | 94.316 | 94.324 |
| KNN | 97.302 | 96.774 | 97.826 | 97.297 | 97.297 |
| Proposed CCPM (Random Forest) | 98.918 | 98.924 | 98.913 | 98.918 | 98.919 |
Performance evaluation results based on DBSCAN and SMOTE for Schiller.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 95.582 | 96.292 | 94.807 | 95.572 | 95.572 |
| MLP | 97.759 | 96.825 | 95.710 | 96.759 | 97.662 |
| Logistic Regression | 95.297 | 91.594 | 98.498 | 95.096 | 95.106 |
| Naïve Bayes | 93.165 | 90.963 | 96.033 | 93.589 | 93.575 |
| KNN | 92.205 | 93.440 | 91.119 | 92.247 | 92.244 |
| Proposed CCPM (Random Forest) | 98.216 | 99.208 | 99.487 | 99.217 | 99.217 |
Performance evaluation results based on DBSCAN and SMOTETomek for Schiller.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 95.393 | 93.298 | 97.368 | 95.393 | 95.393 |
| MLP | 97.912 | 97.883 | 97.938 | 97.762 | 97.911 |
| Regression | 93.641 | 87.891 | 94.680 | 93.188 | 93.105 |
| Naïve Bayes | 93.587 | 96.373 | 90.594 | 93.416 | 93.417 |
| KNN | 94.580 | 97.387 | 91.150 | 94.261 | 91.260 |
| Proposed CCPM (Random Forest) | 99.509 | 99.484 | 99.463 | 99.474 | 99.479 |
Performance evaluation results based on iForest and SMOTE for Schiller.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 96.613 | 94.623 | 97.563 | 96.141 | 96.143 |
| MLP | 98.771 | 97.291 | 98.677 | 98.774 | 97.724 |
| Regression | 94.497 | 91.714 | 96.938 | 94.363 | 94.373 |
| Naïve Bayes | 93.048 | 92.746 | 93.343 | 93.098 | 93.098 |
| KNN | 93.317 | 94.514 | 91.344 | 93.881 | 93.881 |
| Proposed CCPM (Random Forest) | 98.714 | 97.314 | 100.00 | 98.714 | 98.714 |
Performance evaluation results based on iForest and SMOTETomek for Schiller.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 94.011 | 93.625 | 94.329 | 94.054 | 94.010 |
| MLP | 97.369 | 96.808 | 96.913 | 97.509 | 97.164 |
| Logistic Regression | 94.142 | 91.635 | 96.681 | 94.073 | 94.072 |
| Naïve Bayes | 93.085 | 92.000 | 95.172 | 93.866 | 93.866 |
| KNN | 94.762 | 94.707 | 91.344 | 93.072 | 93.072 |
| Proposed CCPM (Random Forest) | 98.463 | 98.907 | 98.074 | 98.499 | 98.495 |
Performance evaluation results based on DBSCAN and SMOTE for Hinselmann.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 98.500 | 98.492 | 98.507 | 98.500 | 98.500 |
| MLP | 98.759 | 100.00 | 98.760 | 98.759 | 98.759 |
| Logistic Regression | 97.796 | 98.994 | 96.568 | 97.766 | 97.766 |
| Naïve Bayes | 97.165 | 98.963 | 95.433 | 97.089 | 97.087 |
| KNN | 96.905 | 98.440 | 95.433 | 96.847 | 96.844 |
| Proposed CCPM (Random Forest) | 99.016 | 100.00 | 97.948 | 98.997 | 98.997 |
Performance evaluation results based on DBSCAN and SMOTETomek for Hinselmann.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 99.004 | 99.512 | 98.461 | 98.999 | 98.053 |
| MLP | 98.792 | 100.00 | 97.512 | 98.762 | 98.620 |
| Regression | 98.034 | 98.989 | 97.073 | 98.015 | 98.015 |
| Naïve Bayes | 93.587 | 96.373 | 90.594 | 93.416 | 93.417 |
| KNN | 97.580 | 98.507 | 96.550 | 97.561 | 97.560 |
| Proposed CCPM (Random Forest) | 99.509 | 100.00 | 98.963 | 99.504 | 99.504 |
Performance evaluation results based on iForest and SMOTE for Hinselmann.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 99.514 | 97.012 | 97.024 | 98.509 | 98.530 |
| MLP | 98.771 | 96.050 | 98.677 | 98.774 | 98.270 |
| Regression | 98.537 | 97.014 | 98.058 | 98.533 | 98.533 |
| Naïve Bayes | 98.048 | 98.238 | 96.172 | 98.048 | 98.048 |
| KNN | 98.317 | 99.514 | 97.044 | 98.288 | 98.288 |
| Proposed CCPM (Random Forest) | 99.514 | 99.014 | 100.00 | 99.504 | 99.504 |
Performance evaluation results based on iForest and SMOTETomek for Hinselmann.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 97.755 | 96.431 | 99.481 | 99.754 | 98.754 |
| MLP | 97.369 | 97.326 | 98.913 | 99.509 | 98.509 |
| Logistic Regression | 98.529 | 99.065 | 97.927 | 98.525 | 98.525 |
| Naïve Bayes | 97.085 | 98.000 | 96.172 | 97.066 | 97.066 |
| KNN | 97.782 | 98.507 | 97.044 | 97.772 | 97.772 |
| Proposed CCPM (Random Forest) | 98.514 | 100.00 | 98.974 | 99.509 | 99.509 |
Performance evaluation results based on DBSCAN and SMOTE for Cytology.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 94.475 | 99.519 | 87.807 | 93.852 | 93.872 |
| MLP | 91.759 | 90.465 | 92.710 | 94.759 | 94.682 |
| Logistic Regression | 84.999 | 80.000 | 89.393 | 84.637 | 84.635 |
| Naïve Bayes | 80.655 | 71.065 | 88.345 | 79.792 | 79.900 |
| KNN | 94.002 | 99.000 | 87.878 | 93.467 | 93.521 |
| Proposed CCPM (Random Forest) | 97.225 | 96.428 | 97.989 | 97.215 | 97.217 |
Performance evaluation results based on DBSCAN and SMOTETomek for Cytology.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 94.413 | 98.098 | 90.000 | 94.165 | 94.187 |
| MLP | 99.452 | 99.083 | 91.052 | 95.182 | 95.175 |
| Regression | 86.326 | 77.860 | 93.137 | 85.490 | 85.606 |
| Naïve Bayes | 81.145 | 84.882 | 85.912 | 80.123 | 80.128 |
| KNN | 91.111 | 90.952 | 80.888 | 89.560 | 89.620 |
| Proposed CCPM (Random Forest) | 97.228 | 97.428 | 97.989 | 97.715 | 97.716 |
Performance evaluation results based on iForest and SMOTE for Cytology.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 94.333 | 97.963 | 90.293 | 94.041 | 94.043 |
| MLP | 93.771 | 94.291 | 93.677 | 91.774 | 91.724 |
| Regression | 83.098 | 79.487 | 87.980 | 84.671 | 84.641 |
| Naïve Bayes | 81.635 | 77.830 | 85.106 | 81.261 | 81.250 |
| KNN | 89.497 | 97.129 | 78.971 | 88.242 | 88.366 |
| Proposed CCPM (Random Forest) | 97.518 | 97.448 | 97.584 | 97.518 | 97.514 |
Performance evaluation results based on iForest and SMOTETomek for Cytology.
| Method | Precision (%) | Recall/Sensitivity (%) | Specificity (%) | F1 Score (%) | Accuracy (%) |
|---|---|---|---|---|---|
| SVM | 93.231 | 99.038 | 85.556 | 92.512 | 92.537 |
| MLP | 91.998 | 92.788 | 94.913 | 92.523 | 92.234 |
| Logistic Regression | 85.784 | 77.669 | 92.422 | 84.093 | 84.900 |
| Naïve Bayes | 93.085 | 92.000 | 95.172 | 93.866 | 93.866 |
| KNN | 89.321 | 93.333 | 84.532 | 89.071 | 89.108 |
| Proposed CCPM (Random Forest) | 97.463 | 97.907 | 97.074 | 97.499 | 97.495 |
Comparison of biopsy test results of CCPM with past studies.
| Studies | Method | No of Features | Sensitivity (%) | Specificity (%) | Accuracy (%) |
|---|---|---|---|---|---|
| Wu and Zhu [ | SVM-RFE | 6 | 100 | 87.32 | 92.39 |
| 18 | 100 | 90.05 | 94.03 | ||
| SVM-PCA | 8 | 100 | 89.09 | 93.45 | |
| 11 | 100 | 90.05 | 94.03 | ||
| Abdoh et al. [ | Smote-RF-RFE | 6 | 94.94 | 95.52 | 95.23 |
| 18 | 94.42 | 97.26 | 95.87 | ||
| Smote-RF-PCA | 8 | 93.77 | 97.26 | 95.55 | |
| 11 | 94.16 | 97.76 | 95.74 | ||
| Present work | DBSCAN + SMOTETomek + RF | 10 | 97.409 | 96.039 | 96.708 |
| DBSCAN + SMOTE+ RF | 10 | 98.039 | 95.939 | 97.007 | |
| iForest + SMOTETomek + RF | 10 | 98.924 | 98.913 | 98.919 | |
| iForest + SMOTE + RF | 10 | 98.936 | 98.130 | 98.925 |
Comparison of Schiller test results of CCPM with past studies.
| Studies | Method | No of Features | Sensitivity (%) | Specificity (%) | Accuracy (%) |
|---|---|---|---|---|---|
| Wu and Zhu [ | SVM-RFE | 7 | 98.73 | 84.46 | 90.18 |
| 18 | 98.73 | 84.63 | 90.18 | ||
| SVM-PCA | 6 | 98.99 | 83.14 | 89.49 | |
| 12 | 98.99 | 84.30 | 90.18 | ||
| Abdoh et al. [ | Smote-RF-RFE | 7 | 93.24 | 90.31 | 91.73 |
| 18 | 93.51 | 92.35 | 92.91 | ||
| Smote-RF-PCA | 6 | 92.70 | 96.17 | 94.49 | |
| 12 | 92.03 | 97.58 | 94.88 | ||
| Present work | DBSCAN + SMOTETomek + RF | 10 | 99.48 | 99.46 | 99.48 |
| DBSCAN + SMOTE+ RF | 10 | 99.20 | 99.49 | 99.22 | |
| iForest + SMOTETomek + RF | 10 | 98.91 | 98.07 | 98.50 | |
| iForest + SMOTE + RF | 10 | 97.31 | 100 | 98.71 |
Comparison of Hinselmann test results of CCPM with past studies.
| Studies | Method | No of Features | Sensitivity (%) | Specificity (%) | Accuracy (%) |
|---|---|---|---|---|---|
| Wu and Zhu [ | SVM-RFE | 5 | 100 | 84.63 | 90.77 |
| 15 | 100 | 84.49 | 93.69 | ||
| SVM-PCA | 5 | 100 | 84.63 | 92.09 | |
| 11 | 100 | 84.65 | 93.79 | ||
| Abdoh et al. [ | Smote-RF-RFE | 5 | 96.52 | 93.80 | 95.14 |
| 15 | 96.65 | 95.14 | 95.88 | ||
| Smote-RF-PCA | 5 | 96.52 | 98.30 | 97.42 | |
| 11 | 96.52 | 98.42 | 97.48 | ||
| Present work | DBSCAN + SMOTETomek + RF | 10 | 100 | 98.96 | 99.50 |
| DBSCAN + SMOTE+ RF | 10 | 100 | 97.95 | 99.01 | |
| iForest + SMOTETomek + RF | 10 | 100 | 98.97 | 99.50 | |
| iForest + SMOTE + RF | 10 | 99.01 | 100 | 99.50 |
Comparison of CYTOLOGY TEST results of CCPM with past studies.
| Studies | Method | No of features | Sensitivity (%) | Specificity (%) | Accuracy (%) |
|---|---|---|---|---|---|
| Wu and Zhu [ | SVM-RFE | 8 | 100 | 84.42 | 90.65 |
| 15 | 100 | 87.28 | 92.37 | ||
| SVM-PCA | 8 | 100 | 86.65 | 91.98 | |
| 11 | 100 | 87.44 | 92.46 | ||
| Abdoh et al. [ | Smote-RF-RFE | 8 | 87.37 | 97.54 | 92.52 |
| 15 | 93.56 | 98.15 | 95.89 | ||
| Smote-RF-PCA | 8 | 95.58 | 97.17 | 96.39 | |
| 11 | 95.32 | 98.40 | 96.89 | ||
| Present work | DBSCAN + SMOTETomek + RF | 10 | 97.43 | 98.01 | 97.72 |
| DBSCAN + SMOTE+ RF | 10 | 96.43 | 98.01 | 97.22 | |
| iForest + SMOTETomek + RF | 10 | 97.91 | 97.08 | 97.50 | |
| iForest + SMOTE + RF | 10 | 97.45 | 97.58 | 97.51 |
Time and Space Complexities of Machine Learning Algorithms.
| Model Name | Time Complexity | Space Complexity |
|---|---|---|
| KNN |
|
|
| Logistic Regression |
|
|
| SVM | O( | O( |
| Naive Bayes | O( | O( |
| Random Forest | O( | O( |
Figure 3Cervical Cancer Predication Model architecture framework.
Figure 4(a) Interface of mobile application to gather user’s data. (b) Prediction result interface of CCPM mobile application.