| Literature DB >> 35893305 |
Md Martuza Ahamad1, Sakifa Aktar1, Md Jamal Uddin1, Tasnia Rahman2, Salem A Alyami3, Samer Al-Ashhab3, Hanan Fawaz Akhdar4, Akm Azad5,6, Mohammad Ali Moni7.
Abstract
One of the common types of cancer for women is ovarian cancer. Still, at present, there are no drug therapies that can properly cure this deadly disease. However, early-stage detection could boost the life expectancy of the patients. The main aim of this work is to apply machine learning models along with statistical methods to the clinical data obtained from 349 patient individuals to conduct predictive analytics for early diagnosis. In statistical analysis, Student's t-test as well as log fold changes of two groups are used to find the significant blood biomarkers. Furthermore, a set of machine learning models including Random Forest (RF), Support Vector Machine (SVM), Decision Tree (DT), Extreme Gradient Boosting Machine (XGBoost), Logistic Regression (LR), Gradient Boosting Machine (GBM) and Light Gradient Boosting Machine (LGBM) are used to build classification models to stratify benign-vs.-malignant ovarian cancer patients. Both of the analysis techniques recognized that the serumsamples carbohydrate antigen 125, carbohydrate antigen 19-9, carcinoembryonic antigen and human epididymis protein 4 are the top-most significant biomarkers as well as neutrophil ratio, thrombocytocrit, hematocrit blood samples, alanine aminotransferase, calcium, indirect bilirubin, uric acid, natriumas as general chemistry tests. Moreover, the results from predictive analysis suggest that the machine learning models can classify malignant patients from benign patients with accuracy as good as 91%. Since generally, early-stage detection is not available, machine learning detection could play a significant role in cancer diagnosis.Entities:
Keywords: benign ovarian tumors; machine learning; ovarian cancer; statistical analysis; tumor marker
Year: 2022 PMID: 35893305 PMCID: PMC9394434 DOI: 10.3390/jpm12081211
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Figure 1The schematic diagram of the overall workflow.
The attribute list for different subgroups of the dataset.
| Blood Routine Test | General Chemistry | Tumor Marker |
|---|---|---|
| Neutrophil ratio | Albumin | Carbohydrate antigen 72-4 |
| Thrombocytocrit | Indirect bilirubin | Alpha-fetoprotein |
| Hematocrit | Uric acid | Carbohydrate antigen 19-9 |
| Mean corpuscular hemoglubin | Nutrium | Menopause |
| Lymphocyte | Total protein | Carbohydrate antigen 125 |
| Platelet distribution width | Alanine aminotransderase | Carcinoembryonic antigen |
| Mean corpuscular volume | Total bilirubin | Age |
| Platelet count | Blood urea nitrogen | Human epididymic protein 4 |
| Hemoglobin | Magnesium | |
| Eosinophil ratio | Glucose | |
| Mean platelet volume | Creatinine | |
| Basophil cell count | Phosphorus | |
| Red blood cell count | Globulin | |
| Mononuclear cell count | Gama glutamyl tranferasey | |
| Red blood cell distribution width | Alkaline phosphates | |
| Basophil cell ratios | Kalium | |
| Direct bilirubin | ||
| Carban dioxide-combining power | ||
| Chlorine | ||
| Aspartate aminotransferase | ||
| Anion gap |
Association between benign ovarian tumor and ovarian cancer patients. The results of independent sample t-test with blood samples, general biochemistry tests and tumor markers. N.B. BOT: Benign Ovarian Tumor; OC: Ovarian Cancer; SD: Standard Deviation.
| Abbreviation | Biomarkers | Type | Unit | Mean ± SD | 95% CI |
| |
|---|---|---|---|---|---|---|---|
| BOT | OC | ||||||
| MPV | Mean platelet volume | full blood | fL |
|
| (−0.48, 0.25) | 0.55 |
| BASO# | Basophil cell count | full blood |
|
| (−0.006, 0.002) | 0.28 | |
| PHOS | Phosphorus | serum | mmol/L |
|
| (−0.05, 0.03) | 0.67 |
| GLU | Glucose | serum | mmol/L |
|
| (0.18, 0.69) | <0.01 |
| CA72-4 | Carbohydrate antigen 72-4 | serum | U/mL |
|
| (2.18, 8.01) | <0.01 |
| K | Kalium | serum | mmol/L |
|
| (−0.4, −1.17) | 0.92 |
| AST | Aspartate aminotransferase | serum | u/L |
|
| (1.87, 5.32) | <0.01 |
| BASO% | Basophil cell ratio | full blood | % |
|
| (−0.15, −0.001) | 0.05 |
| Mg | Magnesium | serum | mmol/L |
|
| (−0.03, 0.02) | 0.78 |
| CL | Chlorine | serum | mmol/L |
|
| (−0.05, 1.07) | 0.6 |
| CEA | Carcinoembryonic antigen | serum | ng/mL |
|
| (1.55, 5.98) | <0.01 |
| EO# | Eosinophil count | full blood |
|
| (−0.03, 0.003) | 0.13 | |
| CA19-9 | Carbohydrate antigen 19-9 | serum | U/mL |
|
| (15.48, 66.01) | <0.01 |
| ALB | Albumin | serum | g/L |
|
| (−5.29, −3.1) | <0.01 |
| IBIL | Indirect bilirubin | serum | umol/L |
|
| (−1.77, −0.57) | <0.01 |
| GGT | Gama glutamyl transferase | serum | u/L |
|
| (−0.85, 6.68) | 0.13 |
| MCH | Mean corpuscular hemoglobin | full blood | Pg |
|
| (−1.39, −0.32) | <0.01 |
| GLO | Globulin | serum | g/L |
|
| (0.83, 2.68) | <0.01 |
| ALT | Alanine aminotransferase | serum | u/L |
|
| (−2.44, 2.23) | 0.93 |
| DBIL | Direct bilirubin | serum | umol/L |
|
| (−0.74, −0.15) | <0.01 |
| RDW | Red blood cell distribution width | full blood | % |
|
| (−0.13, 0.62) | 0.2 |
| PDW | Platelet distribution width | full blood | % |
|
| (−1.46, −0.21) | <0.01 |
| CREA | Creatinine | serum | umol/L |
|
| (−4.23, 0.7) | 0.16 |
| AFP | Alpha-fetoprotein | serum | ng/mL |
|
| (−2.28, 37.66) | 0.08 |
| HGB | Hemoglobin | full blood | g/L |
|
| (−9.35, −2.93) | <0.01 |
| Na | Natrium | serum | mmol/L |
|
| (0.22, 1.42) | <0.01 |
| HE4 | Human epididymis protein 4 | serum | pmol/L |
|
| (202.8, 347.34) | <0.01 |
| LYM# | Lymphocyte count | full blood |
|
| (−0.4, −0.17) | <0.01 | |
| CA125 | Carbohydrate antigen 125 | serum | U/mL |
|
| (449.57, 751.81) | <0.01 |
| BUN | Blood urea nitrogen | serum | mmol/L |
|
| (−0.28, 0.26) | 0.94 |
| LYM% | Lymphocyte ratio | full blood | % |
|
| (−8.61, −4.46) | <0.01 |
| Ca | Calcium | serum | mmol/L |
|
| (−0.21, −0.06) | <0.01 |
| AG | Anion gap | serum | mmol/L |
|
| (−0.75, 1.08) | 0.73 |
| MONO# | Mononuclear cell count | full blood |
|
| (−0.03, 0.78) | <0.01 | |
| PLT | Platelet count | full blood |
|
| (32.06, 70,74) | <0.01 | |
| NEU | Neutrophil ratio | full blood | % |
|
| (5.16, 9.09) | <0.01 |
| EO% | Eosinophil ratio | full blood |
|
| (−0.48, −0.004) | 0.05 | |
| TP | Total protein | serum | g/L |
|
| (−4.29, −1.29) | <0.01 |
| UA | Uric acid | serum |
|
| (−9.46, 19.45) | 0.5 | |
| RBC | Red blood cell count | full blood |
|
| (−0.19, 0.005) | 0.06 | |
| PCT | Thrombocytocrit | full blood | L/L |
|
| (0.02,0.06) | <0.01 |
| CO2CP | Carban dioxide-combining power | serum | mmol/L |
|
| (−0.05, 1.07) | 0.08 |
| TBIL | Total bilirubin | serum |
|
| (−2.46, −0.77) | <0.01 | |
| HCT | Hematocrit | full blood | L/L |
|
| (−0.02, −0.002) | 0.02 |
| MONO% | Monocyte ratio | full blood | % |
|
| (−0.03, 0.78) | 0.07 |
| MCV | Mean corpuscular volume | full blood | fL |
|
| (−1.82, 0.72) | 0.4 |
| ALP | Alkaline phosphatase | serum | u/L |
|
| (9.56, 27.59) | <0.01 |
Figure 2The analysis results for the dataset blood samples; (A) The feature importance of blood samples calculated by ML algorithms according to coefficient values after model training; (B) The association between benign ovarian tumor and ovarian cancer patients applying independent sample t-test, the lighter and larger bubble represent higher association; (C) The box plot of the five top most associated blood samples.
Figure 3The analysis results for the dataset general chemistry tests; (A) The feature importance of general chemistry tests calculated by ML algorithms according to coefficient values after model training; (B) The association between benign ovarian tumor and ovarian cancer patients applying independent sample t-test, the lighter and larger bubble represent higher association; (C) The box plot of the five top most associated general chemistry tests.
Figure 4The analysis results for the dataset cancer markers; (A) The feature importance of cancer markers calculated by ML algorithms according to coefficient values after model training; (B) The association between benign ovarian tumor and ovarian cancer patients applying independent sample t-test, the lighter and larger bubble represent higher association; (C) The box plot of the four top most associated cancer markers with patients age.
Accuracy and evaluation matrices scores for each of the data groups.
| Dataset | Model | Accuracy | Precision | Recall | F1-Score | AUC | Log Loss |
|---|---|---|---|---|---|---|---|
| Blood Samples | RF | 0.81 | 0.76 | 0.92 | 0.82 | 0.78 | 7.6 |
| SVM | 0.81 | 0.77 | 0.89 | 0.82 | 0.78 | 7.8 | |
| DT | 0.81 | 0.83 | 0.78 | 0.81 | 0.81 | 6.71 | |
| XGBoost | 0.81 | 0.78 | 0.86 | 0.82 | 0.77 | 7.6 | |
| LR | 0.80 | 0.79 | 0.81 | 0.80 | 0.78 | 7.6 | |
| GBM | 0.82 | 0.82 | 0.84 | 0.83 | 0.82 | 6.23 | |
| LGBM | 0.82 | 0.80 | 0.86 | 0.83 | 0.82 | 6.2 | |
| General Chemistry | RF | 0.81 | 0.80 | 0.83 | 0.82 | 0.80 | 6.71 |
| SVM | 0.80 | 0.76 | 0.90 | 0.81 | 0.79 | 7.11 | |
| DT | 0.68 | 0.70 | 0.68 | 0.69 | 0.68 | 11.03 | |
| XGBoost | 0.76 | 0.76 | 0.78 | 0.78 | 0.77 | 8.15 | |
| LR | 0.80 | 0.75 | 0.89 | 0.82 | 0.79 | 7.11 | |
| GBM | 0.75 | 0.76 | 0.76 | 0.76 | 0.75 | 8.63 | |
| LGBM | 0.75 | 0.87 | 0.82 | 0.84 | 0.76 | 7.11 | |
| OC Marker | RF | 0.86 | 0.80 | 0.97 | 0.87 | 0.86 | 4.79 |
| SVM | 0.85 | 0.80 | 0.95 | 0.86 | 0.84 | 5.27 | |
| DT | 0.85 | 0.81 | 0.92 | 0.86 | 0.85 | 5.2 | |
| XGBoost | 0.86 | 0.80 | 0.97 | 0.86 | 0.86 | 4.79 | |
| LR | 0.83 | 0.80 | 0.92 | 0.85 | 0.83 | 5.7 | |
| GBM | 0.85 | 0.80 | 0.95 | 0.86 | 0.84 | 5.27 | |
| LGBM | 0.85 | 0.80 | 0.95 | 0.86 | 0.84 | 5.27 | |
| Combined | RF | 0.88 | 0.83 | 0.95 | 0.89 | 0.87 | 4.31 |
| SVM | 0.81 | 0.77 | 0.89 | 0.83 | 0.80 | 6.71 | |
| DT | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 7.6 | |
| XGBoost | 0.86 | 0.82 | 0.95 | 0.86 | 0.86 | 4.79 | |
| LR | 0.82 | 0.79 | 0.89 | 0.84 | 0.82 | 6.23 | |
| GBM | 0.88 | 0.83 | 0.95 | 0.89 | 0.87 | 4.31 | |
| LGBM | 0.88 | 0.85 | 0.92 | 0.88 | 0.87 | 4.31 |
Accuracy and evaluation matrices scores for each of the data groups for the dataset of 106 patients.
| Dataset | Model | Accuracy | Precision | Recall | F-1 Score | AUC | Log-Loss |
|---|---|---|---|---|---|---|---|
| Blood Samples | RF | 0.86 | 0.82 | 1 | 0.9 | 0.81 | 4.71 |
| SVM | 0.81 | 0.77 | 1 | 0.88 | 0.75 | 6.28 | |
| DT | 0.77 | 0.8 | 0.86 | 0.83 | 0.74 | 7.85 | |
| XGBoost | 0.77 | 0.8 | 0.86 | 0.83 | 0.74 | 7.85 | |
| LR | 0.82 | 0.78 | 1 | 0.88 | 0.75 | 6.28 | |
| GBM | 0.73 | 0.72 | 0.73 | 0.72 | 0.68 | 9.42 | |
| LGBM | 0.64 | 0.64 | 1 | 0.78 | 0.5 | 12.56 | |
| General Chemistry | RF | 0.77 | 0.76 | 0.93 | 0.84 | 0.71 | 7.85 |
| SVM | 0.77 | 0.76 | 0.93 | 0.84 | 0.71 | 7.85 | |
| DT | 0.59 | 0.67 | 0.71 | 0.69 | 0.54 | 14.13 | |
| XGBoost | 0.73 | 0.75 | 0.86 | 0.8 | 0.68 | 9.42 | |
| LR | 0.77 | 0.76 | 0.93 | 0.84 | 0.71 | 7.85 | |
| GBM | 0.73 | 0.72 | 0.73 | 0.72 | 0.68 | 9.42 | |
| LGBM | 0.64 | 0.64 | 1 | 0.78 | 0.5 | 12.56 | |
| OC Marker | RF | 0.91 | 1 | 0.86 | 0.92 | 0.93 | 3.14 |
| SVM | 0.82 | 0.92 | 0.79 | 0.85 | 0.83 | 6.28 | |
| DT | 0.59 | 1 | 0.36 | 0.53 | 0.68 | 14.13 | |
| XGBoost | 0.68 | 1 | 0.5 | 0.67 | 0.75 | 10.99 | |
| LR | 0.82 | 0.92 | 0.79 | 0.85 | 0.83 | 6.28 | |
| GBM | 0.81 | 0.84 | 0.82 | 0.82 | 0.83 | 6.28 | |
| LGBM | 0.64 | 0.64 | 1 | 0.78 | 0.5 | 12.56 | |
| Combined | RF | 0.86 | 0.87 | 0.93 | 0.9 | 0.84 | 4.71 |
| SVM | 0.64 | 0.8 | 0.57 | 0.67 | 0.66 | 12.56 | |
| DT | 0.68 | 1 | 0.5 | 0.67 | 0.75 | 10.99 | |
| XGBoost | 0.86 | 1 | 0.79 | 0.88 | 0.89 | 4.71 | |
| LR | 0.86 | 0.82 | 1 | 0.9 | 0.81 | 4.71 | |
| GBM | 0.86 | 0.87 | 0.86 | 0.87 | 0.87 | 4.71 | |
| LGBM | 0.64 | 0.64 | 1 | 0.78 | 0.5 | 12.56 |
A comparison between proposed methods and previous methods.
| References | Dataset | Classifiers | Accuracy | Sensitivity | AUC |
|---|---|---|---|---|---|
| [ | Clinical data (349 patients with 49 features) | DT | 0.87 | 0.82 | - |
| [ | Clinical data (202 patients with 32 features) | XGBoost | 0.80 | - | - |
| [ | Image data (348 patients) | SVM, ELM | 0.87 | 0.87 | 0.89 |
| Proposed | Clicnical data (349 patients with 49 features) | RF, GBM, LGBM | 0.88 | 0.97 | 0.87 |
| Proposed | Clicnical data (106 patients with OC marker features) | RF | 0.91 | 0.86 | 0.93 |