| Literature DB >> 29678673 |
Maria Markaki1, Ioannis Tsamardinos2, Arnulf Langhammer3, Vincenzo Lagani2, Kristian Hveem4, Oluf Dimitri Røe5.
Abstract
Lung cancer causes >1·6 million deaths annually, with early diagnosis being paramount to effective treatment. Here we present a validated risk assessment model for lung cancer screening. The prospective HUNT2 population study in Norway examined 65,237 people aged >20years in 1995-97. After a median of 15·2years, 583 lung cancer cases had been diagnosed; 552 (94·7%) ever-smokers and 31 (5·3%) never-smokers. We performed multivariable analyses of 36 candidate risk predictors, using multiple imputation of missing data and backwards feature selection with Cox regression. The resulting model was validated in an independent Norwegian prospective dataset of 45,341 ever-smokers, in which 675 lung cancers had been diagnosed after a median follow-up of 11·6years. Our final HUNT Lung Cancer Model included age, pack-years, smoking intensity, years since smoking cessation, body mass index, daily cough, and hours of daily indoors exposure to smoke. External validation showed a 0·879 concordance index (95% CI [0·866-0·891]) with an area under the curve of 0·87 (95% CI [0·85-0·89]) within 6years. Only 22% of ever-smokers would need screening to identify 81·85% of all lung cancers within 6years. Our model of seven variables is simple, accurate, and useful for screening selection.Entities:
Keywords: All ages; All smokers; Data-driven; Early diagnosis; Ever-smokers; External validation; Feature selection; Lung cancer prediction
Mesh:
Year: 2018 PMID: 29678673 PMCID: PMC6013755 DOI: 10.1016/j.ebiom.2018.03.027
Source DB: PubMed Journal: EBioMedicine ISSN: 2352-3964 Impact factor: 11.205
All 36 variables included in the backwards feature selection analysis. In univariate analysis, all variables except those indicated in green had significant p values regarding the risk for LC diagnosis (p < 0.05, see Supplementary Table S1b). However, in multivariate analysis, only the red ones were selected and included in the final model.
*The variables were selected from the HUNT2 Baseline Questionnaire 1, 2 and Measurements (NT2BLQ1, NT2BLQ2, and NT2BLM respectively, see link https://www.ntnu.no/hunt/variabler and Supplementary Table S1).
Fig. 1Schematic representation of the analysis protocol. The analyses were performed on 36 selected predictors measured on a cohort of 58,343 participants involved in the HUNT study. Thirty distinct datasets were created after missing data imputation, and selected variables were transformed in a non-linear fashion. For each of the 30 complete datasets, 200 bootstrapped datasets were created, leading to a total number of 6000 training datasets. A backward feature selection (with Akaike Information Criterion as a stopping rule) was repeated on each training dataset for selecting the most relevant risk factors for lung cancer. A final model was built over all variables chosen at least once during the feature selection procedure, and by using Rubin's rule for estimating model coefficients from the 30 complete datasets. A total of 6000 separate bootstrapped validation datasets were created for assessing the predictiveness of the model and for correcting for overfit. Both the fitting of the model and its evaluation were repeated for the whole cohort and for ever-smokers only.
The HUNT Lung Cancer Model. Variables and questions to participants. Cox prediction model of lung cancer risk for 33,521 HUNT2 participants who had ever smoked⁎ and did not develop any other type of cancer in a mean follow-up time of 13·2 years.
| Variable | Questions to participants | Hazard ratio (95% CI) | P value | Beta coefficient |
|---|---|---|---|---|
| Sex | - | 1·128 (0·941–1·352) | 0·188 | 0·1205819 |
| Age | Age at participation at screening | 0·135 (0·098–0·186) | <0·001 | -2·0020557 |
| Pack-years (log) | Estimated number of pack-years | 3·200 (2·451–4·176) | <0·001 | 1·1630181 |
| Smoking quit time, years (log) | If you previously smoked, how long has it been since you stopped? (Number of years) | 0·786 (0·705–0·876) | <0·001 | -0·2407998 |
| Body mass index (log) | BMI | 0·288 (0·153–0·539) | <0·001 | -1·2462656 |
| Cough daily, yes vs no | Do you cough daily during periods of the year? | 1·501 (1·250–1·802) | <0·001 | 0·4059355 |
| Smoke exposure, hours (log) | How long are you usually in a smoky room each day? (Number of hours) | 1·181 (1·062–1·313) | 0·002 | 0·1663201 |
| Smoking intensity per 1 cigarette increase | How many cigarettes do you or did you usually smoke daily? | 0·971 (0·951–0·991) | 0·004 | -0·0295406 |
To calculate the 16-year lung cancer risk in one person with the use of categorical variables, multiply the beta coefficient of the variable by 1 if the factor is present and by 0 if it is absent. For continuous variables other than age, multiply their value – or their log value if indicated – by the beta coefficient of the variable. For age, calculate its contribution by dividing by 100, exponentiated by the power −1, and multiply by the beta coefficient of the variable. Calculate the sum of all previously calculated beta coefficient products; this sum is represented as Xβ. To obtain the person's 16-year LC risk, calculate 1 − 0.06exp(. CI denotes confidence interval.
Age had a non-linear association with LC and was transformed as (100/Age).
Key differences of the HUNT Lung Cancer Model over externally validated risk prediction models developed in prospective population-based cohorts. AUC refers to prediction of 1-, 5- (EPIC), or 6-year cancer risk (PLCO, HUNT2).
| Key studies | LLPi | EPIC | PLCOM2012 | HUNT2 | CONOR |
|---|---|---|---|---|---|
| Study group characteristics | |||||
| Cohort type | Random selection (n=8760) | Multi-country health study (n=399 393) | Multicentre randomized screening (n=80 375) | ||
| Age limit | 45–79 | 35+ | 55–74 | ||
| Median Pack-years | 18·9 | ≈30 | ≈30 | ||
| Never-smokers analysed | Yes | Yes | No | ||
| Follow-up, years | 8·7 mean | 5 max | 6 max | ||
| Feature selection | Yes backward | Yes, based on AUC and tdNRI | No, pre-specified | ||
| Number of variables | 14 (6 selected) | 12 (4 selected) | 11 | ||
| Coding of non-linearities of continuous variables | No | Yes, including stratification | Yes | ||
| Report on missing data | Yes | Yes | No | ||
| MI | No | No | No | ||
| MI with feature selection | No | No | No | ||
| Internal validation | Yes | No | Yes | ||
| External validation | Yes | EPIC test set | Yes | ||
| Discriminatory power (AUC | |||||
| Total Population | 0·849 | NR | NR | ||
| Ever-smokers | NR | NR | 0·803 (6 y) | ||
| External validation | 0·67, 0·76, 0·82 | 0·787 (5 y) ( | 0·797 (6 y) | ||
NR = not reported.
Years of smoking more than >15 cigarettes per day.
Bootstrap in each of 30 multiply imputed datasets.
MI = multiple imputation.
Area under the receiver operating curve.
Concordance index.
Fig. 2A. Nomogram to calculate the calculate the personal 5-, 10-, and 15-year risk of lung cancer risk with the use of seven independent factors discovered by backwards feature selection. B. Low-, medium-, and high-risk groups for lung cancer according to the risk prediction model for ever-smokers. The Kaplan–Meier curves are plotted for risk groups defined from 50% and 84% quantiles. Differences among the three curves were highly significant (p = 0·0008) according to the log-rank test. The number of event-free participants in every risk group at different time points is shown above the x-axis.
Fig. 4Characteristics of the lung cancer events in the validation population (CONOR) with 0–20 years of follow-up. A. Overall distribution of smoke exposure in lung cancer cases in CONOR (n = 709). B. Lung cancer event appearance in ever-smokers (n = 675) in the CONOR population after baseline (x-axis) according to age groups (colour code) and pack-years (y-axis) showing that all age groups are represented in all pack-year groups. In the three vertical boxplot (B, C and D) median time to diagnosis is not significantly different between pack-year groups (not shown). C. Lung cancer event appearance in the CONOR population (all dots) after baseline (x-axis) and pack-years (y-axis) according to NLST criteria (red dots; >30 pack-years, 55–74 years of age and <15 years quit time). Within 6 years, less than one third of the total cases would be included in the NLST screening (cases in red within the red quadrant). D. HUNT Lung Cancer Model applied to the CONOR population after baseline registration. Lung cancer event appearance in the CONOR population after baseline (x-axis) and pack-years (y-axis). Red dots are lung cancer cases predicted using the model according to medium- plus high-risk groups in HUNT (Fig. 2b) corresponding to the 16% quantile of the risk of events in HUNT, equalled to a risk of 1·75% for developing LC within 16 years (>15 points in the nomogram). Based on this threshold, 221/270 LC events within 6 years (red dots within the red square) and 527/675 events within ~20 years were correctly predicted. More specifically, using this threshold, one would need to examine 9998 out of 45,387 (22%) ever-smokers to identify 82% of future events in a 6-years period or 78% of future events in a 20-year period (median 11·6 years).
Performance of the HUNT Lung Cancer Model versus NLST criteria for lung cancer (LC) diagnosis within 6 years in the validation cohort (CONOR) of ever-smokers with complete data using as threshold the 16% quantile of risk of events in HUNT corresponding to a LC risk at least 1·75% in ~16 years or 0·64% in 6 years or ~15 points in the nomogram. Of the 45,117 ever-smokers, 1986 were picked by the NLST criteria. Sensitivity, specificity, PPV and NPV are calculated based on including all participants, 10,000, selected by the HUNT Lung Cancer Model.
| Participants with LC (N) | Participants without LC (N) | Participants total (N) | Predictive value | |
|---|---|---|---|---|
| HUNT Lung Cancer Model criteria | 270 | 45 117 | 45 387 | |
| Criteria positive | 221 TP (2·21%) | 9 779 FP (97·79%) | 10 000 | PPV 2·21% |
| Criteria negative | 49 FN (0·14%) | 35 338 TN (99·86%) | 35 387 | NPV 99·86% |
| Sensitivity | 81·85% | |||
| Specificity | 78·31% | |||
| NLST criteria | ||||
| Criteria positive | 66 TP (0·66%) | 9 934 FP (99·44%) | 10 000 | PPV 0·66% |
| Criteria negative | 204 FN (0·57%) | 35 183 TN (99·43) | 35 387 | NPV 99·43% |
| Sensitivity | 24·44% | |||
| Specificity | 77·98% |
FN = false negative; FP = false positive; NPV = negative predictive value; PPV = positive predictive value; TN = true negative; TP = true positive.
Total criteria positive selected by the HUNT Lung Cancer Model includes the 1986 picked by the NLST.
Accuracy of NLST versus the HUNT Lung Cancer Model for lung cancer (LC) diagnosis within 6 years, using the same number of screenings as NLST in the CONOR ever-smokers with complete data. As compared with NLST criteria, our model's criteria identified 103 vs 69 out of 270 cases showing an improved sensitivity (38.14% vs 25.6%, P = 0.0216) and positive predictive value (4.95% vs 3.3%, P < 0.000001), with the same specificity (95.61% vs 95.5%, P = 0.7321) and similar negative predictive value (99.6% vs. 99.5%, P = 0.95374).
| Criteria | Participants with LC (N) | Participants without LC (N) | Participants total (N) | Predictive value |
|---|---|---|---|---|
| NLST criteria | 270 | 45,117 | 45,387 | |
| Criteria positive | 69 TP (3·3%) | 2012 FP (96·7%) | 2081 | PPV 3·3% |
| Criteria negative | 201 FN (0·5%) | 43,105 TN (99·5%) | 43,306 | NPV 99·5% |
| Sensitivity | 25·6% | |||
| Specificity | 95·5% | |||
| HUNT Lung Cancer Model criteria | ||||
| Criteria positive | 103 TP (4·95%) | 1978 FP (95·05%) | 2081 | PPV 4·95% |
| Criteria negative | 167 FN (0·4%) | 43,139 TN (99·6%) | 43,306 | NPV 99·6% |
| Sensitivity | 38·14% | |||
| Specificity | 95·61% |
FN = false negative; FP = false positive; NPV = negative predictive value; PPV = positive predictive value; TN = true negative; TP = true positive.
NLST criteria for study entry included a history of cigarette smoking of at least 30 pack-years, age between 55 and 74 years and, for former smokers, cessation within the previous 15 years.
Fig. 3Smoking status among lung cancer cases in HUNT2. A. Pack-years distribution at enrolment (not at diagnosis). Of importance, the majority were current smokers, and the 10, 20, and 30 pack-years groups all had a similar size, with 70% of those who developed cancer having smoked <30 pack-years at baseline. B. Distribution of current and former smokers at enrolment; 27% of those who developed lung cancer had a smoking quit time of <30 years.