| Literature DB >> 33987321 |
Fangtao Yin1, Hongyu Zhu1, Songlin Hong2, Chen Sun2, Jie Wang3,4, Mengting Sun3,4, Lin Xu1, Xiaoxiao Wang5, Rong Yin1,3,4.
Abstract
BACKGROUND: Lung cancer is the most threatening malignant tumor to human health and life. Using a variety of machine learning algorithms and statistical analyses, this paper explores, discovers and demonstrates new indicators for the early diagnosis of lung cancer and their diagnostic performance from large samples of clinical data in the real world.Entities:
Keywords: Machine learning; early diagnosis; fibrinogen; lung cancer; sex
Year: 2021 PMID: 33987321 PMCID: PMC8106088 DOI: 10.21037/atm-20-4704
Source DB: PubMed Journal: Ann Transl Med ISSN: 2305-5839
Figure 1Overview of study workflow and ROC curves of the machine learning model. (A) Overview of the study workflow. First, the clinical data were collected from Jiangsu Cancer Hospital, and necessary data cleaning was performed. The minimum description length (MDL) algorithm was used to select and explore data characteristics. Then, based on medical knowledge, fibrinogen, alkaline phosphatase and sex were screened as clinical indicators to distinguish the differences between the stages of lung cancer. Further statistical tests were used to verify the reliability of the results. Finally, nonnegative matrix factorization (NMF) and decision tree (DT) algorithms were used to explore the main characteristic expressions of fibrinogen with sex in the two stages of lung cancer. (B) NB model with variables fibrinogen and sex (NB_ Fibrinogen, Sex). AUC, 0.715. (C) NB model with variables fibrinogen, sex and alkaline phosphatase (NB_TrainingData). AUC, 0.735. (D) NB model with the same variables used in the training model, sex, fibrinogen, and alkaline phosphatase, built from the validation data (NB_ValidationData). AUC, 0.79. (E) NB model established by the training data tested with the validation data (NB_TestWithVal.Data). AUC, 0.74. AUC, area under the curve.
Characteristics of the study population: training cohort (n=2,502) and validation cohort (n=447)
| Variables | Categories | Training cohort (n=2502) | Validation cohort (n=447) | |||
|---|---|---|---|---|---|---|
| Stage I (n=1,561) | Stage II, III and IV (n=941) | Stage I (n=336) | Stage II, III and IV (n=111) | |||
| Sex | Male | 767 | 676 | 147 | 73 | |
| Female | 794 | 265 | 189 | 38 | ||
| Age | Median | 61 | 63 | 60 | 64 | |
| Range | 28–86 | 25–87 | 30–80 | 43–82 | ||
| Tobacco use | Yes | 233 | 365 | 19 | 15 | |
| No | 376 | 430 | 313 | 93 | ||
| Quit | 0 | 0 | 2 | 3 | ||
| Unknown | 952 | 146 | 2 | 0 | ||
| Histological type | Adenocarcinoma | 940 | 453 | 216 | 58 | |
| Squamous cell carcinoma | 122 | 207 | 10 | 28 | ||
| Others | 499 | 281 | 110 | 25 | ||
| Fibrinogen | Median | 2.71 | 3.22 | 2.79 | 3.39 | |
| Range | 1.19–5.76 | 0.93–5.96 | 1.3–5.4 | 1.23–5.58 | ||
| Pleural effusion | Yes | 194 | 223 | 2 | 4 | |
| No | 281 | 411 | 63 | 15 | ||
| Null | 1,086 | 307 | 271 | 92 | ||
| Chlorine | Median | 103 | 102 | 103.3 | 102.85 | |
| Range | 82–118 | 82–112 | 87.7–116.1 | 91.6–112.3 | ||
| Albumin-globulin (A/G) ratio | Median | 1.7 | 1.585 | 1.7 | 1.5 | |
| Range | 1.04–2.9 | 0.64–2.8 | 1–2.6 | 0.8–2.5 | ||
| Glutamic-oxaloacetic transaminase | Median | 21 | 22 | 20 | 22 | |
| Range | 6–244 | 8–257 | 0.8–162 | 1–746 | ||
| Alkaline phosphatase | Median | 71 | 79 | 65 | 75 | |
| Range | 14–319 | 4–1,449 | 27–291 | 43–215 | ||
| Albumin | Median | 43 | 42 | 34.15 | 34.65 | |
| Range | 25–54 | 22–55 | 2.33–52.1 | 1.32–54.1 | ||
| Monocytes | Median | 0.445 | 0.46 | 7.47 | 7.605 | |
| Range | 0–1.74 | 0–4.46 | 3.01–30.32 | 3.75–21.79 | ||
| Initial symptoms | Hemoptysis | 5 | 18 | 1 | 2 | |
| Coughing | 72 | 154 | 45 | 39 | ||
| Physical findings | 281 | 180 | 272 | 55 | ||
| Expectoration | 3 | 5 | 0 | 1 | ||
| Chest pain | 9 | 14 | 6 | 5 | ||
| Others symptoms/null | 1,191 | 570 | 12 | 9 | ||
| High-density lipoprotein | Median | 1.32 | 1.27 | 1.11 | 1.08 | |
| Range | 0.72–2.87 | 0.53–3.66 | 0.51–2.62 | 0.57–1.94 | ||
Table of candidate indicators
| Variable | Order | Importance |
|---|---|---|
| Fibrinogen* | 1 | 0.039434316 |
| Sex* | 2 | 0.03285276 |
| Pleural effusion* | 3 | 0.005590354 |
| Chlorine* | 4 | 0.005446061 |
| Albumin-globulin (A/G) ratio* | 5 | 0.003882843 |
| Glutamic-oxaloacetic transaminase* | 6 | 0.00385326 |
| Alkaline phosphatase* | 7 | 0.003628277 |
| Lymphatic metastasis (CT) | 8 | 0.002124314 |
| Tumor (X-ray) | 9 | 0.001673887 |
| Albumin | 10 | 0.001413229 |
| Monocytes | 11 | 0.001240597 |
| Initial symptoms | 12 | 0.001149529 |
| Blood coagulation | 13 | 0.001130126 |
| High-density lipoprotein | 14 | 0.001040269 |
As recommended by the MDL algorithm, indicators with importance values greater than 0 were selected as predictive candidate indicators to explore the distinctions between early- and middle- to late-stage lung cancer. *, the top 7 nontumor marker variables are shown.
Summary statistics for lung cancer stage classification models
| Variables | Algorithm | Average accuracy | Positive predictive rate | Negative predictive rate | AUC | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|
| Sex, fibrinogen | Naive Bayesian | 67.4% | 72.4% | 62.5% | 71.5% | 72.4% | 37.5% |
| Sex, fibrinogen, alkaline phosphatase | Naive Bayesian | 67.8% | 65.3% | 70.3% | 73.5% | 73.5% | 37.1% |
AUC, area under the curve.
Model validation
| Variable | Average accuracy | Positive predictive value | Negative predictive value | AUC | Sensitivity | Specificity |
|---|---|---|---|---|---|---|
| NB_TrainingData | 67.77% | 65.29% | 70.25% | 0.73 | 73.53% | 62.88% |
| NB_ValidationData | 66.38% | 54.05% | 78.70% | 0.79 | 81.08% | 66.67% |
| NB_TestWithVal.Data | 69.55% | 67.37% | 71.72% | 0.74 | 72.63% | 65.86% |
NB_TrainingData, NB model with variables fibrinogen, sex and alkaline phosphatase; NB_ValidationData, NB model with the same variables used in the training model, sex, fibrinogen, and alkaline phosphatase, built from the validation data; NB_TestWithVal.Data, NB model established by the training data tested with the validation data; AUC, area under the curve.
The clustering results showed that the KM algorithm divided fibrinogen into 7 categories
| ID | Lower limit | Upper limit | Confidence (%) =100.00 | Support (n) |
|---|---|---|---|---|
| 1 | 1.255 | 2.265 | 95.97 | 262 |
| 2 | 2.265 | 2.77 | 85.63 | 280 |
| 3 | 2.265 | 3.275 | 100 | 657 |
| 4 | 3.275 | 3.78 | 100 | 214 |
| 5 | 3.78 | 4.285 | 80.28 | 114 |
| 6 | 4.285 | 4.79 | 81.72 | 76 |
| 7 | 4.79 | 5.8 | 100 | 110 |
Figure 2Schematic diagram of the NMF used to reduce the dimensionality of the data. NMF was used as a data-reduction technique to reduce the dimensions of the data in the early-stage lung cancer group. For this group, three variables were included, namely, patient ID, sex (male/female) and discretized fibrinogen (from category 1 to 7), with a total sample size of 1,176 that formed a nonsparse matrix. NMF, nonnegative matrix factorization.
NMF algorithm results
| Stage | Feature 1 | Feature 2 | |||
|---|---|---|---|---|---|
| Feature | Coefficient | Feature | Coefficient | ||
| Early-stage lung cancer | Sex = M | 0.867523 | Sex = F | 1.009724 | |
| Fib. level =2 | 0.243168 | Fib. level =2 | 0.385596 | ||
| Fib. level =3 | 0.203545 | Fib. level =3 | 0.251904 | ||
| Fib. level =1 | 0.190356 | Fib. level =1 | 0.166057 | ||
| Fib. level =4 | 0.101417 | Fib. level =4 | 0.079301 | ||
| Middle- to late-stage lung cancer | Sex = M | 1.177668 | Sex = F | 0.5871693 | |
| Fib. level =7 | 0.247422 | Fib. level =2 | 0.2327963 | ||
| Fib. level =3 | 0.198573 | Fib. level =3 | 0.162359 | ||
| Fib. level =4 | 0.153598 | Fib. level =1 | 0.1020188 | ||
| Fib. level =2 | 0.120517 | Fib. level =4 | 0.0931605 | ||
| Fib. level =6 | 0.102849 | ||||
| Fib. level =5 | 0.091088 | ||||
Two expression features were extracted from the early-stage and middle- to late-stage lung cancer groups. M, male; F, female; Fib. Level,
Figure 3Decision tree (DT) classification model for lung cancer stage. DT classification model established by variables sex and fibrinogen. Lung cancer stage 0: stage I; 1: stage II, III and IV. Sex 1: male; 2: female. Rectangle box: parent node. Oval box: leaf node; green: leaf node with high confidence and support; yellow: leaf node with low confidence or support.