| Literature DB >> 30526601 |
Mi-Mi Liu1, Li Wen1, Yong-Jia Liu2, Qiao Cai1, Li-Ting Li1, Yong-Ming Cai3,4.
Abstract
BACKGROUND: Although gastric cancer is a malignancy with high morbidity and mortality in China, the survival rate of patients with early gastric cancer (EGC) is high after surgical resection. To strengthen diagnosing and screening is the key to improve the survival and life quality of patients with EGC. This study applied data mining methods to improve screening for the risk of EGC on the basis of noninvasive factors, and displayed important influence factors for the risk of EGC.Entities:
Keywords: C5.0 decision tree; Early gastric cancer; Logistic regression; Multilayer perceptron; SMOTE; Stomach neoplasms; Tree augmented naive bayesian network
Mesh:
Year: 2018 PMID: 30526601 PMCID: PMC6284275 DOI: 10.1186/s12911-018-0689-4
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
The demographic characteristics of the participants
| Low risk of EGC | High risk of EGC | |
|---|---|---|
| ( | ( | |
| Sex | ||
| Male | 237 (48.67) | 65 (49.62) |
| Female | 250 (51.33) | 66 (50.38) |
| Age (year)a | 51.36 (11.49) | 53.37 (10.75) |
| Weight (kg)a | 59.43 (9.54) | 58.84 (9.77) |
| Height (cm)a,b | 161.99 (7.57) | 161.68 (7.31) |
| BMIa | 22.61 (3.00) | 22.43 (2.81) |
| Education levels | ||
| Illiterate | 10 (2.05) | 1 (0.76) |
| Primary school | 97 (11.92) | 34 (25.95) |
| Junior school | 156 (32.03) | 47 (35.88) |
| Senior school | 116 (23.82) | 22 (16.79) |
| College | 108 (22.18) | 27 (20.62) |
| Occupations | ||
| Cadre | 162 (33.26) | 44 (33.59) |
| Worker | 183 (37.58) | 62 (47.33) |
| Peasant | 142 (29.16) | 25 (19.08) |
| Languages | ||
| Mandarin | 71 (14.58) | 20 (15.27) |
| Cantonese | 154 (31.62) | 60 (45.80) |
| Hakka | 161 (33.06) | 34 (25.95) |
| Teochew | 101 (20.74) | 17 (12.98) |
| Residences | ||
| City | 217 (44.56) | 57 (43.51) |
| Townlet | 142 (29.16) | 30 (22.90) |
| Village | 128 (26.28) | 44 (33.59) |
aData are presented as a mean (SD), others are presented as a number (percentage)
bItems were eliminated because of weak correlation with the risk of EGC
The eating habits of the participants
| Low risk of EGC | High risk of EGC | |
|---|---|---|
| ( | ( | |
| High salt intake | ||
| Yes | 137 (28.13) | 39 (29.77) |
| No | 350 (71.87) | 92 (70.23) |
| Pickled foods | ||
| Often | 57 (11.70) | 16 (12.21) |
| Seldom | 430 (88.30) | 115 (87.79) |
| Fried/smoke foodsa | ||
| Often | 43 (8.83) | 6 (4.58) |
| Seldom | 444 (91.17) | 125 (95.42) |
| Fruit | ||
| Often | 240 (49.28) | 75 (57.25) |
| Seldom | 247 (50.72) | 56 (42.75) |
| Vegetablea | ||
| Often | 456 (93.63) | 128 (97.71) |
| Seldom | 31 (6.37) | 3 (2.29) |
| Tea | ||
| Often | 168 (34.50) | 37 (28.24) |
| Seldom | 319 (65.50) | 94 (71.76) |
| Smoking | ||
| Yes | 149 (30.60) | 43 (32.82) |
| No | 338 (69.40) | 88 (67.18) |
| Drinking | ||
| Yes | 79 (16.22) | 21 (16.03) |
| No | 408 (83.78) | 110 (83.97) |
| Drinking-water source | ||
| Water supply | 422 (86.65) | 124 (94.66) |
| Wells water | 50 (10.27) | 7 (5.34) |
| Rivers water | 15 (3.08) | 0 (0.00) |
| Drinking hot water | ||
| Yes | 204 (41.89) | 68 (51.91) |
| No | 283 (58.11) | 63 (48.09) |
| Speed of eating | ||
| Fast | 306 (62.83) | 70 (53.44) |
| Slow | 181 (37.17) | 61 (46.56) |
All data are presented as a number (percentage)
aItems were eliminated because of weak correlation with the risk of EGC
The main symptoms during the nearly 3 months of the participants
| Low risk of EGC | High risk of EGC | |
|---|---|---|
| ( | ( | |
| Abdominal pain | ||
| Yes | 228 (46.82) | 64 (48.85) |
| No | 259 (53.18) | 67 (51.15) |
| Abdominal distension | ||
| Yes | 220 (45.17) | 66 (50.38) |
| No | 267 (54.83) | 65 (49.62) |
| Acid reflux | ||
| Yes | 143 (29.36) | 48 (36.64) |
| No | 344 (70.64) | 83 (63.36) |
| Belching | ||
| Yes | 125 (25.67) | 40 (30.53) |
| No | 262 (74.33) | 91 (69.47) |
| Early satiety | ||
| Yes | 57 (11.70) | 19 (14.50) |
| No | 430 (88.30) | 112 (85.50) |
| Postprandial distress | ||
| Yes | 91 (18.69) | 31 (23.66) |
| No | 396 (81.31) | 100 (76.34) |
| Heartburn | ||
| Yes | 61 (12.53) | 22 (16.79) |
| No | 426 (87.47) | 109 (83.21) |
| Melaenaa | ||
| Yes | 36 (7.39) | 9 (6.87) |
| No | 451 (92.61) | 122 (93.13) |
| Emaciationa | ||
| Yes | 37 (7.60) | 7 (5.34) |
| No | 450 (92.40) | 124 (94.66) |
| Poor appetitea | ||
| Yes | 39 (8.01) | 9 (6.87) |
| No | 448 (91.99) | 122 (93.13) |
| Dysphagiaa | ||
| Yes | 6 (1.23) | 3 (2.29) |
| No | 481 (98.77) | 128 (97.71) |
| Nauseaa | ||
| Yes | 42 (8.62) | 14 (10.69) |
| No | 445 (91.38) | 117 (89.31) |
| Poststernal discomforta | ||
| Yes | 44 (9.03) | 16 (12.21) |
| No | 443 (90.97) | 115 (87.79) |
| No obvious symptom | ||
| Yes | 56 (11.50) | 16 (12.21) |
| No | 431 (88.50) | 115 (87.79) |
All data are presented as a number (percentage)
aItems were eliminated because of weak correlation with the risk of EGC
The family or previous diseases histories of the participants
| Low risk of EGCs | High risk of EGCs | |
|---|---|---|
| ( | ( | |
| Esophageal cancera | ||
| Yes | 14 (2.87) | 2 (1.53) |
| No | 473 (97.13) | 129 (98.47) |
| Gastric cancera | ||
| Yes | 25 (5.13) | 9 (6.87) |
| No | 462 (94.87) | 122 (93.13) |
| Colorectal cancera | ||
| Yes | 8 (1.64) | 3 (2.29) |
| No | 477 (98.36) | 128 (97.71) |
| Diabetes mellitusa | ||
| Yes | 30 (6.16) | 14 (10.69) |
| No | 457 (93.84) | 117 (89.31) |
| Hypertension | ||
| Yes | 78 (16.02) | 19 (14.50) |
| No | 409 (83.98) | 112 (85.50) |
| Hyperlipidemia | ||
| Yes | 68 (13.96) | 27 (20.61) |
| No | 419 (86.04) | 104 (79.39) |
| HP infection | ||
| Negative | 23 (4.72) | 12 (9.16) |
| Positive | 29 (5.95) | 17 (12.98) |
| Unidentified | 435 (89.32) | 102 (77.86) |
| Gastroscopy | ||
| Yes | 96 (19.71) | 25 (19.08) |
| No | 391 (80.29) | 106 (80.92) |
| Gastric ulcera | ||
| Yes | 28 (5.75) | 10 (7.63) |
| No | 459 (94.25) | 121 (92.37) |
All data are presented as a number (percentage)
aItems were eliminated because of weak correlation with the risk of EGC
The serological examinations of the participants
| Low risk of EGC | High risk of EGC | |
|---|---|---|
| ( | ( | |
| Pepsinogen I (ug/L)a | 139.18 (94.03) | 140.32 (91.61) |
| Pepsinogen II (ug/L)a | 16.68 (27.80) | 17.26 (23.95) |
| Gastrin 17 (pmol/L)a | 8.04 (13.72) | 8.67 (16.18) |
| Pepsinogen I/IIa | 12.74 (6.19) | 12.35 (6.55) |
| HP antibody | ||
| Negative | 315 (64.68) | 72 (54.96) |
| Weakly positive | 55 (11.29) | 19 (14.50) |
| Positive | 117 (24.02) | 40 (30.53) |
aData are presented as a mean (SD), others are presented as a number (percentage)
The confusion matrix, accuracy and AUC of the four models on testing set
| Confusion matrix | Accuracy(%) | AUC | |||
|---|---|---|---|---|---|
| L | H | ||||
| C5.0 DT | L | 129 | 14 | 77.84 | 0.66 |
| H | 25 | 8 | |||
| TAN | L | 127 | 16 | 77.27 | 0.65 |
| H | 24 | 9 | |||
| MLP | L | 122 | 21 | 77.84 | 0.74 |
| H | 18 | 15 | |||
| LR | L | 120 | 23 | 73.30 | 0.62 |
| H | 24 | 9 | |||
Confusion matrix shows the number of cases at each risk of EGC on the testing set. In confusion matrix, the columns denote the actual risk of EGC and the rows denote the predicted; L and H respectively stand for low risk of EGC and high risk of EGC
Important independent variables for the risk of EGC
| Variables | C5.0 DT | TAN | MLP | LR | Total |
|---|---|---|---|---|---|
| Occupations | 0.03 | 0.10 | 0.04 | 0.04 | 0.21 |
| HP infection | 0.03 | 0.05 | 0.04 | 0.09 | 0.21 |
| HP antibody | 0.03 | 0.02 | 0.04 | 0.11 | 0.20 |
| Weight | 0.03 | 0.08 | 0.07 | 0.02 | 0.20 |
| Drinking-water source | 0.03 | 0.04 | 0.03 | 0.06 | 0.16 |
| Age | 0.03 | 0.03 | 0.06 | 0.03 | 0.15 |
| Pepsinogen I | 0.03 | 0.02 | 0.08 | 0.02 | 0.15 |
| Gastrin 17 | 0.04 | 0.02 | 0.07 | 0.02 | 0.15 |
| Education levels | 0.03 | 0.02 | 0.03 | 0.05 | 0.13 |
| Residences | 0.03 | 0.04 | 0.02 | 0.04 | 0.13 |
| BMI | 0.03 | 0.03 | 0.04 | 0.02 | 0.12 |
| PepsinogenI/II | 0.03 | 0.02 | 0.05 | 0.02 | 0.12 |
| Languages | 0.03 | 0.02 | 0.03 | 0.04 | 0.12 |
| Tea | 0.03 | 0.06 | 0.01 | 0.02 | 0.12 |
| Drinking hot water | 0.03 | 0.03 | 0.02 | 0.04 | 0.12 |
| Gastroscopy | 0.03 | 0.03 | 0.03 | 0.03 | 0.12 |
| High salt intake | 0.03 | 0.05 | 0.01 | 0.02 | 0.11 |
| Abdominal pain | 0.03 | 0.03 | 0.02 | 0.03 | 0.11 |
| Hypertension | 0.03 | 0.03 | 0.03 | 0.02 | 0.11 |
| Hyperlipidemia | 0.03 | 0.03 | 0.03 | 0.02 | 0.11 |
| Smoking | 0.03 | 0.03 | 0.02 | 0.02 | 0.10 |
| Heartburn | 0.03 | 0.02 | 0.03 | 0.02 | 0.10 |
| Pepsinogen II | 0.03 | 0.02 | 0.03 | 0.02 | 0.10 |
| Fruit | 0.03 | 0.02 | 0.02 | 0.02 | 0.09 |
| Acid reflux | 0.03 | 0.02 | 0.02 | 0.02 | 0.09 |
| Postprandial distress | 0.02 | 0.02 | 0.03 | 0.02 | 0.09 |
| Speed of eating | 0.03 | 0.03 | 0.02 | 0.01 | 0.09 |
| Abdominal distension | 0.03 | 0.01 | 0.02 | 0.03 | 0.09 |
| Drinking | 0.03 | 0.02 | 0.01 | 0.02 | 0.08 |
| Sex | 0.03 | 0.02 | 0.01 | 0.01 | 0.07 |
| Pickled foods | 0.03 | 0.01 | 0.01 | 0.02 | 0.07 |
| Early satiety | 0.03 | 0.01 | 0.01 | 0.02 | 0.07 |
| Belching | 0.03 | 0.01 | 0.01 | 0.01 | 0.06 |
| No obvious symptom | 0.01 | 0.01 | 0.01 | 0.02 | 0.05 |
The sum of the 34 independent variables’ importance calculated by each model is equal to one. The sum of the 34 independent variables’ total importance is 4
Fig. 1The gains charts of the four models. The top polygonal line is ideal curve, and the irregular curve is gains curve of a model between the ideal curve and the diagonal. For a good model, the gains curve will rise steeply toward 100% and then level off. A model that provides no predictive performance will follow the diagonal from lower left to upper right. As shown in this figure, the gain curves of the three data mining models (image a, b and c) were convexes close to the ideal curves, especially the MLP model. However, the gain curve of the LR model (image d) rose slowly away from the ideal curve