| Literature DB >> 36052291 |
Siqing Jiang1,2, Haojun Gao1,2, Jiajin He1, Jiaqi Shi1,2, Yuling Tong3, Jian Wu1,2.
Abstract
Gastric cancer remains an enormous threat to human health. It is extremely significant to make a clear diagnosis and timely treatment of gastrointestinal tumors. The traditional diagnosis method (endoscope, surgery, and pathological tissue extraction) of gastric cancer is usually invasive, expensive, and time-consuming. The machine learning method is fast and low-cost, which breaks through the limitations of the traditional methods as we can apply the machine learning method to diagnose gastric cancer. This work aims to construct a cheap, non-invasive, rapid, and high-precision gastric cancer diagnostic model using personal behavioral lifestyles and non-invasive characteristics. A retrospective study was implemented on 3,630 participants. The developed models (extreme gradient boosting, decision tree, random forest, and logistic regression) were evaluated by cross-validation and the generalization ability in our test set. We found that the model developed using fingerprints based on the extreme gradient boosting (XGBoost) algorithm produced better results compared with the other models. The overall accuracy of which test set was 85.7%, AUC was 89.6%, sensitivity 78.7%, specificity 76.9%, and positive predictive values 73.8%, verifying that the proposed model has significant medical value and good application prospects.Entities:
Keywords: behavioral lifestyles; gastric cancer; machine learning; non-invasive; retrospective study
Year: 2022 PMID: 36052291 PMCID: PMC9424643 DOI: 10.3389/frai.2022.956385
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1The workflow of current research.
Variable inclusion in ML algorithms.
|
|
|
|
|---|---|---|
| Gender | Categorical variable | Female (0), male (1) |
| Age | Continuous variable | Age of subjects |
| Family gastric cancer | Categorical variable | Gastric cancer history of first-degree relatives; No (0), yes (1) |
| H. Pylori test | Categorical variable | H. pylori infection; Negative (–,0),positive (+,1) |
| Vegetable intake | Categorical variable | Regular (0), occasional (1) |
| Fruit intake | Categorical variable | Regular (0), occasional (1) |
| Protein intake | Categorical variable | Intake of milk or beans; Regular (0), occasional (1) |
| PGI | Continuous variable | Serum pepsinogen I |
| PGII | Continuous variable | Serum pepsinogen II |
| High-salt diet | Categorical variable | Salt ≤ 10 grams a day (0), Salt>10 grams a day (1) |
| PGR | Continuous variable | The ratio of pepsinogen i and pepsinogen ii |
| Smoking | Categorical variable | >1 cigarette a day for more than a year (1), else (0) |
| Alcohol | Categorical variable | At least once a day for more than a year (1), else (0) |
Figure 2The ethical review process of subjects.
Details of annual survey subjects and information collection.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| 859 | 1,251 | 1,287 | 985 | 4,382 |
Figure 3The process of inclusion and elimination.
Figure 4The framework of extreme gradient boosting algorithm.
Parameter presupposition of XGBoost algorithm.
|
|
|
|
|
|---|---|---|---|
| learning_rate | 0.305 | n_estimator | 18 |
| max_depth | 5 | min_child_weight | 1 |
| sub_sample | 0.9 | n_ jobs | 48 |
| reg_lambda | 0.4 | reg_alpha | 0 |
Characteristics of the subjects.
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
| ||||
|
| 0.408 | ||||||
| Male, | 2,034 | 1,662 (55.7) | 372 (57.5) | 251 (56.3) | 46 (59.7) | 75 (60.5) | |
| Female, | 1,596 | 1,321 (44.3) | 275 (42.5) | 195 (43.7) | 31 (40.3) | 49 (39.5) | |
|
| <0.001 | ||||||
| <45 | 760 | 659 (22.1)ac | 101 (15.6) | 63 (14.1)b | 20 (26.0)a | 18 (14.5) | |
| 45-65 | 2252 | 1863 (62.5) | 389 (60.1) | 298 (66.8) | 46 (59.7) | 45 (36.3) | |
| >65 | 618 | 461 (15.5)[ | 157 (24.3) | 85 (19.1)[ | 11 (14.3) | 61 (49.2) | |
|
| 0.024 | ||||||
| Yes, | 433 | 339 (11.4) | 94 (14.5) | 68 (15.2) | 11 (14.3) | 15 (12.1) | |
| No, | 3,197 | 2,644 (88.6) | 553 (85.5) | 378 (84.8) | 66 (85.7) | 109 (87.9) | |
|
| <0.001 | ||||||
| Occasional, | 1,019 | 843 (28.3)[ | 176 (27.2) | 120 (26.9) | 27 (35.1) | 29 (23.4) | |
| Regular, ± | 2,611 | 2,140 (71.7) | 471 (72.8) | 326 (73.1) | 50 (64.9) | 95 (76.6) | |
|
| 0.011 | ||||||
| Occasional, | 2,508 | 2,088 (70.0)[ | 420 (64.9) | 266 (59.6) | 50 (64.9) | 114 (91.9) | |
| Regular, | 1,122 | 895 (30.0)[ | 227 (35.1) | 180 (40.4) | 27 (35.1) | 10 (8.1) | |
|
| 0.023 | ||||||
| Yes, | 2,805 | 2,301 (77.1) | 504 (77.9) | 349 (78.3) | 57 (74.0) | 98 (79.0) | |
| No, | 825 | 682 (22.9) | 143 (22.1) | 97 (21.7) | 20 (26.0) | 26 (21.0) | |
|
| 0.025 | ||||||
| Yes, | 2,756 | 2,266 (76.0) | 490 (75.7) | 328 (73.5) | 64 (83.1) | 98 (79.0) | |
| No, | 874 | 717 (24.0) | 157 (24.3) | 118 (26.5) | 13 (16.9) | 26 (21.0) | |
|
| <0.001 | ||||||
| Positive, | 2,501 | 1,930 (64.7)[ | 571 (88.3) | 376 (84.3) | 71 (92.2) | 124 (100) | |
| Negative, | 1,129 | 1,053 (35.3)[ | 76 (11.7) | 70 (15.7)[ | 6 (7.8) | 0 (0) | |
|
| 0.251 | ||||||
| Regular, | 500 | 420 (14.1) | 80 (12.4) | 52 (11.7) | 12 (15.6) | 16 (12.9) | |
| Occasional, | 3,130 | 2,563 (85.9) | 567 (87.6) | 394 (88.3) | 65 (84.4) | 108 (87.1) | |
|
| 0.006 | ||||||
| Regular, | 1,094 | 928 (31.1)[ | 166 (25.7) | 114 (25.6)[ | 53 (68.8) | 96 (77.4) | |
| Occasional, | 2,536 | 2,055 (68.9)[ | 481 (74.3) | 332 (74.4)[ | 24 (31.2) | 28 (22.6) | |
|
| 0.024 | ||||||
| Yes, | 381 | 329 (11.0)[ | 52 (8.0) | 30 (6.7) | 5 (6.5) | 17 (13.7) | |
| No, | 3,249 | 2,654 (89.0) | 595 (92.0) | 461 (93.3) | 72 (93.5) | 107 (86.3) | |
| PGI(ng/ml), | 118.9 (98.0) | 119.5[ | 115.8 (76.0) | 110.1 | 102.7 | 144.6 | 0.261 |
| PGII(ng/ml), | 11.5 (9.7) | 11.3 | 12.3 (10.4) | 11.7 | 10.5 | 15.1 | 0.022 |
| PGR, | 13.6 (13.5) | 13.8 | 13.0 (16.9) | 12.9 | 12.8 (8.9) | 13.2 | 0.201 |
GC, gastric cancer; SAG, severe atrophic; MAG, mild–moderate atrophic; NAG, non-atrophic.
Regular (≥ 3 times/week) and occasional (<3 times/week).
(MAG, SAG, and NAG) group compared with the GC group, p < 0.05.
(MAG and NAG) group compared with the SAG group, p < 0.05.
NAG group compared with the MAG group, p < 0.05.
P.
Multivariate logistics regression analysis of influencing factors of gastric cancer.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Age(years) | <45 | Reference | |||
| 45–65 | 3.172 | 1.786–5.629 | 0.379 | 1.154 | |
| >65 | 4.199 | 2.716–6.495 | 0.043 | 1.435 | |
| Family History | No, | Reference | |||
| Yes, | 1.118 | 0.605–2.067 | 0.026 | 0.111 | |
| Vegetable | Occasional,n(%) | Reference | |||
| Regular, | 0.384 | 0.255–0.583 | 0.031 | −0.959 | |
| Fruits | Occasional,n(%) | Reference | |||
| Regular, | 0.156 | 0.113–0.287 | <0.01 | −1.873 | |
| Alcohol | No, | Reference | |||
| Yes, | 2.951 | 2.398–4.650 | <0.01 | 1.082 | |
| Smoking | No, | Reference | |||
| Yes, | 1.547 | 0.840-3.791 | 0.038 | 0.438 | |
| H. pylori | Negative, | Reference | |||
| Positive, | 4.039 | 2.641–6.207 | <0.01 | 1.396 | |
| PGII(ng/ml) | <9.2 | Reference | |||
| ≥9.2 | 1.758 | 1.324–4.016 | <0.01 | 0.564 |
Clinicopathologic characteristics of subjects in the training and test dataset.
|
|
|
| |
|---|---|---|---|
| Age (years) | <45 | 657 (22.6%) | 103 (14.2%) |
| 45–65 | 1,713 (59.0%) | 539 (74.2%) | |
| >65 | 534 (18.4%) | 84 (11.6%) | |
| Family History | No, | 351 (12.1%) | 82 (11.3%) |
| Yes, | 2,553 (87.9%) | 644 (88.7%) | |
| Vegetable | Occasional, | 817 (28.1%) | 202 (27.8%) |
| Regular, | 2,087 (71.9%) | 524 (72.2%) | |
| Fruits | Occasional, | 2,064 (71.1%) | 444 (61.2%) |
| Regular, | 840 (28.9%) | 282 (38.8%) | |
| Alcohol | No, | 663 (22.8%) | 162 (22.3%) |
| Yes, | 2,241 (77.2%) | 564 (77.7%) | |
| Smoking | No, | 703 (24.2%) | 171 (23.6%) |
| Yes, | 2,201 (75.8%) | 555 (76.4%) | |
| H. pylori | Negative, | 913 (31.4%) | 216 (29.8%) |
| Positive, | 1,991 (68.6%) | 510 (70.2%) | |
| PGII (ng/ml) | <9.2 | 1,328 (45.7%) | 323 (43.6%) |
| ≥9.2 | 1,576 (54.3%) | 403 (56.4%) | |
| Gastric cancer | No, | 2,805 (96.6%) | 701 (96.5%) |
| Yes, | 99 (3.4%) | 25 (3.5%) | |
Performance among algorithms on the diagnosis of gastric cancer.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| XGBoost | 0.896 (0.005) | 0.857 (0.008) | 0.787 (0.010) | 0.769 (0.011) | 0.738 (0.009) |
| Random forest | 0.786 (0.019) | 0.724 (0.015) | 0.559 (0.003) | 0.983 (0.010) | 0.663 (0.011) |
| Logistic regression | 0.782 (0.017) | 0.715 (0.010) | 0.628 (0.019) | 0.885 (0.013) | 0.531 (0.004) |
| Decision Tree | 0.833 (0.018) | 0.784 (0.021) | 0.745 (0.010) | 0.821 (0.015) | 0.689 (0.006) |
AUC, area under the receiver operating characteristic curve; SD, standard deviation; PPV, positive predictive value; XGBoost: extreme gradient boosting.
Figure 5Receiver operating curve for algorithms.
Figure 6Characteristics ranking among different algorithms.
Comparison of different studies.
|
|
|
|
|
|---|---|---|---|
| Kuo et al. ( | serum TFF2, serumTFF3, etc. | Invasive (bleeding, stress reaction, etc.) | 73.4 |
| Zhu et al. ( | miR-(16, 25, 92a, etc.) | complexity and expensiveness | 78.2 |
| Liu et al. ( | miR-(20a, 34, 423-5p, etc.) | 84.9 | |
| Zhu et al. ( | CEA, CA19-9, CA-125, NLR, Hb, Alb, etc. | unsuitable for early diagnosis | 83.1 |
| Ours | PGII, H. pylori, smoking, etc. | Subjectivity | 85.7 |
TFF, trefoil factor; CEA, carcinoma embryonic antigen; CA, carbohydrate antigen; NLR, neutrophil-lymphocyte ratio; Alb, albumin; Hb, hemoglobin; PGII, Pepsinogen II; H. pylori, Helicobacter pylori.
The maximum accuracy obtained.