| Literature DB >> 31361174 |
J Krois1, C Graetz2, B Holtfreter3, P Brinkmann2, T Kocher3, F Schwendicke1.
Abstract
Prediction models learn patterns from available data (training) and are then validated on new data (testing). Prediction modeling is increasingly common in dental research. We aimed to evaluate how different model development and validation steps affect the predictive performance of tooth loss prediction models of patients with periodontitis. Two independent cohorts (627 patients, 11,651 teeth) were followed over a mean ± SD 18.2 ± 5.6 y (Kiel cohort) and 6.6 ± 2.9 y (Greifswald cohort). Tooth loss and 10 patient- and tooth-level predictors were recorded. The impact of different model development and validation steps was evaluated: 1) model complexity (logistic regression, recursive partitioning, random forest, extreme gradient boosting), 2) sample size (full data set or 10%, 25%, or 75% of cases dropped at random), 3) prediction periods (maximum 10, 15, or 20 y or uncensored), and 4) validation schemes (internal or external by centers/time). Tooth loss was generally a rare event (880 teeth were lost). All models showed limited sensitivity but high specificity. Patients' age and tooth loss at baseline as well as probing pocket depths showed high variable importance. More complex models (random forest, extreme gradient boosting) had no consistent advantages over simpler ones (logistic regression, recursive partitioning). Internal validation (in sample) overestimated the predictive power (area under the curve up to 0.90), while external validation (out of sample) found lower areas under the curve (range 0.62 to 0.82). Reducing the sample size decreased the predictive power, particularly for more complex models. Censoring the prediction period had only limited impact. When the model was trained in one period and tested in another, model outcomes were similar to the base case, indicating temporal validation as a valid option. No model showed higher accuracy than the no-information rate. In conclusion, none of the developed models would be useful in a clinical setting, despite high accuracy. During modeling, rigorous development and external validation should be applied and reported accordingly.Entities:
Keywords: biostatistics; dental; periodontal disease; periodontitis; regression analysis; treatment planning
Year: 2019 PMID: 31361174 PMCID: PMC6710618 DOI: 10.1177/0022034519864889
Source DB: PubMed Journal: J Dent Res ISSN: 0022-0345 Impact factor: 6.116
Characteristics of the Sample at Different Time Points.
| Parameter | Kiel | Greifswald |
|---|---|---|
| Patients, male:female | 164:226 | 102:135 |
| Age at T0, y | 45.9 ± 10.2 | 47.1 ± 10.4 |
| SPT (T1 to T2), y | 18.2 ± 5.6 | 6.6 ± 2.9 |
| Smoker:former smoker:never smoker (T0) | 50:88:252 | 31:81:125 |
| No. of tooth loss / (patient × year) (SPT) | 0.11 ± 0.15 | 0.14 ± 0.58 |
APT ranged from T0 to T1 (first visit to last APT visit) and SPT from T1 to T2 (last APT visit to last SPT visit).
APT, active periodontal therapy; SPT, supportive periodontal therapy.
Distribution of Tooth Loss according to Different Patient- and Tooth-Level Variables in the Full (Unrestricted) Data Set.
| Patient Level | Patients with Tooth Loss, | Tooth Level | Teeth Lost, |
|---|---|---|---|
| Age at T1, y [ | Tooth type | ||
| Lost | 47.1 ± 9.5 | Molar | 466 of 3,107 (15.0) |
| Retained | 45.6 ± 10.9 | Nonmolar | 414 of 8,544 (4.8) |
| Smoking status | Probing pocket depth, mm | ||
| Never | 191 of 377 (50.7) | <5 | 529 of 9,848 (5.4) |
| Former smoker | 75 of 169 (44.4) | 5 to 7 | 294 of 1,654 (17.8) |
| Current smoker | 50 of 81 (61.7) | >7 | 57 of 149 (38.3) |
| Sex | Furcation involvement | ||
| Male | 132 of 266 (49.6) | Grade 0 to 1 | 682 of 10,813 (6.3) |
| Female | 184 of 361 (51.0) | Grade 2 to 3 | 198 of 838 (23.6) |
| Bone loss | |||
| ≤25 | 104 of 3,810 (2.7) | ||
| >25 to 50 | 311 of 5,253 (5.9) | ||
| 50 to 70 | 325 of 2,153 (15.1) | ||
| >70 | 140 of 435 (32.2) | ||
| Mobility | |||
| 0 | 685 of 10,685 (6.4) | ||
| 1 | 115 of 645 (17.8) | ||
| 2 | 48 of 257 (18.7) | ||
| 3 | 32 of 64 (50.0) | ||
| Dental arch | |||
| Lower | 530 of 5,293 (10.0) | ||
| Upper | 350 of 6,358 (5.5) |
N = 627 patients, 11,651 teeth.
Mean ± SD.
Metrics for the Different Model Validation Schemas.
| AUC (95% CI) | |||||||
|---|---|---|---|---|---|---|---|
| Model | Test | Training | Specificity | Sensitivity | Accuracy (95% CI) | NIR | |
| Scenario 1: Base case. Training set: 8,821 teeth, 472 patients. Test set: 2,830 teeth, 155 patients. | |||||||
| RPA | 0.74 | 0.76 (0.74 to 0.78) | 0.97 | 0.13 | 0.91 (0.9 to 0.92) | 0.92 | 0.999 |
| RFO | 0.77 | 0.84 (0.83 to 0.85) | 0.99 | 0.1 | 0.92 (0.91 to 0.93) | 0.92 | 0.846 |
| XGB | 0.76 | 0.84 (0.84 to 0.85) | 0.98 | 0.16 | 0.91 (0.9 to 0.92) | 0.92 | 0.962 |
| logR | 0.8 | 0.8 (0.79 to 0.81) | 0.99 | 0.1 | 0.92 (0.91 to 0.93) | 0.92 | 0.406 |
| Scenario 2: Training Greifswald–test Kiel. Training set: 4,141 teeth, 237 patients. Test set: 7,510 teeth, 390 patients | |||||||
| RPA | 0.62 | 0.72 (0.68 to 0.77) | 1.0 | 0.03 | 0.9 (0.9 to 0.91) | 0.9 | 0.402 |
| RFO | 0.75 | 0.9 (0.88 to 0.91) | 1.0 | 0.0 | 0.9 (0.9 to 0.91) | 0.9 | 0.587 |
| XGB | 0.72 | 0.89 (0.88 to 0.9) | 1.0 | 0.03 | 0.9 (0.9 to 0.91) | 0.9 | 0.632 |
| logR | 0.77 | 0.84 (0.82 to 0.86) | 1.0 | 0.03 | 0.9 (0.9 to 0.91) | 0.9 | 0.343 |
| Scenario 3: Training Kiel–test Greifswald. Training set: 7,510 teeth, 390 patients. Test set: 4,141 teeth, 237 patients | |||||||
| RPA | 0.76 | 0.75 (0.73 to 0.77) | 0.95 | 0.21 | 0.93 (0.92 to 0.93) | 0.96 | ≥0.999 |
| RFO | 0.78 | 0.84 (0.83 to 0.85) | 0.97 | 0.19 | 0.94 (0.93 to 0.94) | 0.96 | ≥0.999 |
| XGB | 0.79 | 0.83 (0.83 to 0.84) | 0.98 | 0.2 | 0.95 (0.94 to 0.96) | 0.96 | ≥0.999 |
| logR | 0.82 | 0.8 (0.79 to 0.8) | 0.98 | 0.21 | 0.96 (0.95 to 0.96) | 0.96 | 0.989 |
| Scenario 4: Training Kiel–test Kiel. Training set: 5,694 teeth, 294 patients. Test set: 1,816 teeth, 96 patients | |||||||
| RPA | 0.74 | 0.73 (0.71 to 0.75) | 0.95 | 0.15 | 0.88 (0.86 to 0.89) | 0.91 | ≥0.999 |
| RFO | 0.77 | 0.84 (0.83 to 0.84) | 0.98 | 0.12 | 0.9 (0.89 to 0.92) | 0.91 | 0.73 |
| XGB | 0.73 | 0.84 (0.83 to 0.85) | 0.97 | 0.17 | 0.9 (0.88 to 0.91) | 0.91 | 0.949 |
| logR | 0.81 | 0.79 (0.78 to 0.8) | 0.99 | 0.15 | 0.91 (0.89 to 0.92) | 0.91 | 0.488 |
| Scenario 5: Training Greifswald–test Greifswald. Training set: 3,127 teeth, 178 patients. Test set: 1,014 teeth, 59 patients | |||||||
| RPA | 0.68 | 0.66 (0.6 to 0.71) | 1.0 | 0.04 | 0.95 (0.94 to 0.97) | 0.95 | 0.48 |
| RFO | 0.77 | 0.88 (0.86 to 0.9) | 1.0 | 0.08 | 0.95 (0.94 to 0.97) | 0.95 | 0.422 |
| XGB | 0.76 | 0.88 (0.86 to 0.9) | 1.0 | 0.08 | 0.95 (0.94 to 0.96) | 0.95 | 0.538 |
| logR | 0.75 | 0.85 (0.83 to 0.86) | 1.0 | 0.02 | 0.95 (0.94 to 0.96) | 0.95 | 0.594 |
| Scenario 6: Training Kiel–test Kiel. Training set (cohort from 1980 on, censored by 15 y) 4,397 teeth, 233 patients. Test set (cohort from 1995 on, censored by 15 y): 3,113 teeth, 157 patients | |||||||
| RPA | 0.72 | 0.75 (0.73 to 0.78) | 0.97 | 0.15 | 0.9 (0.89 to 0.91) | 0.91 | 0.999 |
| RFO | 0.77 | 0.84 (0.82 to 0.86) | 0.99 | 0.1 | 0.91 (0.9 to 0.92) | 0.91 | 0.838 |
| XGB | 0.75 | 0.85 (0.83 to 0.86) | 0.98 | 0.14 | 0.91 (0.89 to 0.92) | 0.91 | 0.939 |
| logR | 0.81 | 0.8 (0.78 to 0.82) | 0.99 | 0.11 | 0.91 (0.9 to 0.92) | 0.91 | 0.39 |
Six scenarios were tested, with 4 models being tested in each scenario. AUC, sensitivity, specificity, accuracy, NIR, and P value for the comparison between the accuracy and NIR for the different models.
AUC, area under the curve; logR, logistic regression; NIR, no-information rate; RFO, random forest; RPA, recursive partitioning; XGB, extreme gradient boosting.
Figure 1.Baseline models. (a) Receiver operating characteristic curves of the different models and AUC values. The different models showed similar performance. (b) Standardized variable importance for different models (see Appendix for details). Different models built on different predictor variables. AUC, area under the curve; logR, logistic regression; PPD, probing pocket depth; RFO, random forest; RPA, recursive partitioning; XGB, extreme gradient boosting.
Figure 2.Analyses on restricted data sets, testing the impact of (a) sample size and (b) prediction periods. Receiver operating characteristic curves of the different models and AUC values are displayed. (a) Dropping individuals from the cohort was performed to shrink it, assessing the impact of cohort size on model performance. Lower sample sizes came with lower model performance. (b) Censoring the prediction period was performed to assess if short-term predictions are more accurate to make than long-term ones. Prediction periods had only limited impact on model performance. AUC, area under the curve; logR, logistic regression; RFO, random forest; RPA, recursive partitioning; XGB, extreme gradient boosting.