| Literature DB >> 36203414 |
Jing Li1, Yuwei Zhang2, Qing Chen3, Zhenhua Pan4, Jun Chen4, Meixiu Sun1, Junfeng Wang5, Yingxin Li1, Qing Ye2.
Abstract
Objectives: Lung cancer (LC) is the largest single cause of death from cancer worldwide, and the lack of effective screening methods for early detection currently results in unsatisfactory curative treatments. We herein aimed to use breath analysis, a noninvasive and very simple method, to identify and validate biomarkers in breath for the screening of lung cancer. Materials and methods: We enrolled a total of 2308 participants from two centers for online breath analyses using proton transfer reaction time-of-flight mass spectrometry (PTR-TOF-MS). The derivation cohort included 1007 patients with primary LC and 1036 healthy controls, and the external validation cohort included 158 LC patients and 107 healthy controls. We used eXtreme Gradient Boosting (XGBoost) to create a panel of predictive features and derived a prediction model to identify LC. The optimal number of features was determined by the greatest area under the receiver-operating characteristic (ROC) curve (AUC).Entities:
Keywords: PTR-TOF-MS; breath analysis; lung cancer; machine learning; screening
Year: 2022 PMID: 36203414 PMCID: PMC9531270 DOI: 10.3389/fonc.2022.975563
Source DB: PubMed Journal: Front Oncol ISSN: 2234-943X Impact factor: 5.738
Figure 1Flow chart for patient recruitment in model development and validation cohorts.
Baseline characteristics of the individuals included in the study.
| Derivation | Validation | |||
|---|---|---|---|---|
| Lung cancer (n = 1007) | Healthy control (n = 1036) | Lung cancer (n = 158) | Healthy control (n = 107) | |
| Male (%) | 559 (55.51%) | 536 (51.74%) | 63 (39.87%) | 40 (37.38%) |
| Age (IQR) [Range] | 61 (54, 66) [21-81] | 45 (35, 58) [22-90] | 63 (58, 69) [33-78] | 30 (24, 43)[19-74] |
| Smoking (%) | ||||
| Smokers | 382 (37.94%) | 209 (20.17%) | 78 (49.37%) | 12 (11.22%) |
| Ex-smokers | 171 (16.98%) | 51 (4.92%) | 24 (15.19%) | 1 (0.93%) |
| Non-smokers | 454 (45.08%) | 776 (74.91%) | 56 (35.44%) | 94 (87.85%) |
| BMI(IQR) | 24.03 (22.04, 26.30) | 24.06 (21.97, 26.30) | 23.96 (21.64, 25.92) | 22.48 (20.28, 25.45) |
| Fasting (%) | 387 (38.43%) | 857 (82.72%) | 17 (10.76%) | 9 (8.41%) |
| Adenocarcinoma (%) | 628 (62.36%) | NA | 133 (84.18%) | NA |
| Squamous cell carcinoma (%) | 160 (15.89%) | NA | 15 (9.49%) | NA |
| Small-cell lung cancer (%) | 91 (9.04%) | NA | 4 (2.53%) | NA |
| Missing | 128 (12.71%) | 6 (3.80%) | ||
| Stage (%) | ||||
| 0 | 31 (3.08%) | NA | – | NA |
| I | 273 (27.11%) | NA | – | NA |
| II | 121 (12.02%) | NA | – | NA |
| III | 128 (12.71%) | NA | – | NA |
| IV | 170 (16.88%) | NA | – | NA |
| Missing | 284 (28.20%) | |||
NA, Not applicable for healthy control group.
Figure 2Feature distributions on the derivation dataset (A) and validation dataset (B), ranked by their importance (the first feature from left on the first row is the most important). For each feature, both distributions from LC patients (green) and healthy subjects (red) are shown.
Figure 3Relation between number of features selected in the model and model performance. Green bars correspond to feature importance. Black solid line corresponds to AUC calculated with top 1-14 features. Black dotted lines demonstrate the number of features selected when achieving 99% of maximum AUC.
Figure 4The ROC curves for (A) internal 10-fold cross-validation and (B) external validation. In (A), darker line represents the averaged results.
Figure 5Probability calibration plots for (A) internal 10-fold cross-validation and (B) external validation.
Confusion matrix of the derivation and validation datasets.
| Derivation data | Lung cancer diagnosed by the current gold standard | Total | ||
|---|---|---|---|---|
| Present | Absent | |||
| Model prediction | Positive | True positive = 877 | False positive = 67 | 944 |
| Negative | False negative= 130 | True negative = 969 | 1099 | |
| Total | 1007 | 1036 | 2043 | |
|
| Lung cancer diagnosed by the current gold standard | Total | ||
| Present | Absent | |||
| Model prediction | Positive | True positive = 107 | False positive = 29 | 136 |
| Negative | False negative= 51 | True negative = 78 | 129 | |
| Total | 158 | 107 | 265 | |
Model performance of diagnostic accuracy in the derivation and validation datasets.
| Training (95% CI) | Validation (95% CI) | |
|---|---|---|
| AUC | 0.963 (0.941–0.982) | 0.771 (0.718–0.823) |
| Accuracy | 0.904 (0.888–0.925) | 0.704 (0.654–0.753) |
| Sensitivity/Recall | 0.871 (0.822–0.926) | 0.677 (0.598–0.750) |
| Specificity | 0.935 (0.884–0.967) | 0.730 (0.660–0.798) |
| PPV/Precision | 0.930 (0.883–0.961) | 0.706 (0.631–0.779) |
| F-score | 0.899 (0.880–0.924) | 0.690 (0.625–0.750) |