| Literature DB >> 34659677 |
Peyman Almasinejad1, Amin Golabpour2, Mohammad Reza Mollakhalili Meybodi1, Kamal Mirzaie1, Ahmad Khosravi3.
Abstract
Missing data occurs in all research, especially in medical studies. Missing data is the situation in which a part of research data has not been reported. This will result in the incompatibility of the sample and the population and misguided conclusions. Missing data is usual in research, and the extent of it will determine how misinterpreted the conclusions will be. All methods of parameter estimation and prediction models are based on the assumption that the data are complete. Extensive missing data will result in false predictions and increased bias. In the present study, a novel method has been proposed for the imputation of medical missing data. The method determines what algorithm is suitable for the imputation of missing data. To do so, a multiobjective particle swarm optimization algorithm was used. The algorithm imputes the missing data in a way that if a prediction model is applied to the data, both specificity and sensitivity will be optimized. Our proposed model was evaluated using real data of gastric cancer and acute T-cell leukemia (ATLL). First, the model was then used to impute the missing data. Then, the missing data were imputed using deletion, average, expectation maximization, MICE, and missForest methods. Finally, the prediction model was applied for both imputed datasets. The accuracy of the prediction model for the first and the second imputation methods was 0.5 and 16.5, respectively. The novel imputation method was more accurate than similar algorithms like expectation maximization and MICE.Entities:
Mesh:
Year: 2021 PMID: 34659677 PMCID: PMC8519720 DOI: 10.1155/2021/1203726
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 3.822
Figure 1The steps of the proposed model.
Figure 2The process of data classification and missing data simulation.
The continuous and discrete algorithms that were used in the proposed model.
| No. | Algorithm name |
|---|---|
| Discrete algorithms | |
| 1 | Support vector machines-linear |
| 2 | Support vector machines-quadratic |
| 3 | Support vector machines-polynomial |
| 4 | Support vector machines-RBF = 5 |
| 5 | Support vector machines-RBF = 2 |
| 6 | Support vector machines-RBF = 1 |
| 7 | Support vector machines-RBF = 0.5 |
| 8 | Support vector machines-RBF = 0.2 |
| 9 | Support vector machines-RBF = 0.1 |
| 10 | 1-NN |
| 11 | 3-NN |
| 12 | 5-NN |
| 13 | 7-NN |
| 14 | 9-NN |
| 15 | Decision tree (C4.5) |
| 16 | Artificial neural network feed forward |
| 17 | Logistic regression |
| 18 | Naïve Bayesian |
|
| |
| Continuous algorithms | |
| 1 | Support vector regression (SVR) |
| 2 | 1-NN |
| 3 | 3-NN |
| 4 | 5-NN |
| 5 | 7-NN |
| 6 | 9-NN |
| 7 | Continuous decision tree (CART) |
| 8 | Artificial neural network feed forward |
| 9 | Multiple regression |
Figure 3The structure of the proposed particle.
Figure 4The proposed fitness function.
The characteristics of gastric cancer variables.
| ID | Variable name | Variable type | Notes |
|---|---|---|---|
| 1 | Sex | Nominal | 61 males and 19 females |
| 2 | Birth year | Interval | Minimum = 1,305, maximum = 1,346 |
| 3 | Education | Ordinal | (1) Illiterate, (2) underdiploma |
| 4 | Race | Ordinal | (1) Fars, (2) Kurd, (3) Turk |
| 5 | PMH | Ordinal | (1) Hypertension (HTN), (2) coronary artery disease (CAD), (3) diabetes mellitus (DM), (4) DM + HTN, (5) DM + HTN + CAD, (6) HTN + CAD |
| 6 | Age at diagnosis | Interval | Minimum = 46, maximum = 87 |
| 7 | FH of gastric cancer | Ordinal | Family history of gastric cancer: (1) first-degree relative (FDR), (2) second-degree relatives (SDR) |
| 8 | Age at dx of family GC | Interval | Family's age at diagnosis: minimum = 45, maximum = 82 |
| 9 | Hx of other GI cancer | Ordinal | History of other GI cancer |
| 10 | Types of other GI cancer | Ordinal | (1) Small intestine, (2) liver, (3) esophagus, (4) large intestine |
| 11 | Hx of non-GI cancer | Ordinal | (1) First-degree relative |
| 12 | Treatment | Ordinal | (1) Surgery, (2) surgery + chemo + radio, (3) chemo |
| 13 | Cause of death | Ordinal | (1) cancer, (2) MI, (3) PTE |
| 14 | Pathology | Ordinal | (1) Adenocarcinoma, (2) inflammatory tumour, (3) mucinous adenocarcinoma, (4) neuroendocrine carcinoma, (5) signet ring cell carcinoma, (6) GIST tumour, (7) undifferentiated carcinoma |
| 15 | Addiction | Nominal | 17 subjects: addicted, 63 subjects: non-addicted |
| 16 | Survival | Nominal | 33 and 67 subjects pass away after one and two years, respectively |
The percent of missing data in independent variables of gastric cancer data.
| ID | Variable name | Missing | Valid | |
|---|---|---|---|---|
|
| Percent | |||
| 1 | Hx of non-GI cancer | 71 | 88.75 | 9 |
| 2 | Type of other GI cancer | 64 | 80.00 | 16 |
| 3 | Hx of other GI cancer | 64 | 80.00 | 16 |
| 4 | Age at Dx of family GC | 58 | 72.50 | 22 |
| 5 | FH of gastric cancer | 57 | 71.25 | 23 |
| 6 | PMH | 35 | 43.75 | 45 |
| 7 | Age at diagnosis | 4 | 5.00 | 76 |
| 8 | Birth year | 1 | 1.25 | 79 |
Figure 5The structure of model design for the prediction of gastric cancer survival time.
Figure 6The structure of model design for the prediction of gastric cancer survival time.
The number and percent of missing data of independent variables.
| ID | Variable name | Missing | Valid | |
|---|---|---|---|---|
|
| Percent | |||
| 1 | FBS | 12 | 48.0 | 13 |
| 2 | Rb | 8 | 32.0 | 17 |
| 3 | P53 | 8 | 32.0 | 17 |
| 4 | CDK4 | 8 | 32.0 | 17 |
| 5 | CDK2 | 8 | 32.0 | 17 |
| 6 | Creat | 5 | 20.0 | 20 |
| 7 | Urea | 5 | 20.0 | 20 |
| 8 | CA | 5 | 20.0 | 20 |
| 9 | MCV | 1 | 4.0 | 24 |
| 10 | MCHC | 1 | 4.0 | 24 |
| 11 | MCH | 1 | 4.0 | 24 |
| 12 | RBC | 1 | 4.0 | 24 |
The comparison of the proposed model of imputation with EM algorithm for ATLL patients' data.
| Algorithm name | Sensitivity (%) | Specificity (%) | Accuracy (%) | PPV+ (%) | PPV− (%) |
|
|---|---|---|---|---|---|---|
| Delete missing | 47.00 | 40.60 | 45.95 | 47.1 | 38.95 | 45.47 |
| Mean algorithm | 47.37 | 51.20 | 53.77 | 44.77 | 49.83 | 43.49 |
| Expectation maximization | 62.57 | 69.25 | 70.23 | 64.44 | 65.35 | 61.31 |
| MICE algorithm | 46.16 | 49.28 | 53.37 | 45.92 | 46.88 | 43.09 |
| missForest algorithm | 58.30 | 62.65 | 64.65 | 58.15 | 61.00 | 56.09 |
| Proposed algorithm | 86.15 | 82.4 | 86.75 | 83.57 | 84.67 | 83.50 |