| Literature DB >> 35864460 |
Matthias Neumair1, Michael W Kattan2, Stephen J Freedland3,4, Alexander Haese5, Lourdes Guerrios-Rivera6, Amanda M De Hoedt3, Michael A Liss7, Robin J Leach8, Stephen A Boorjian9, Matthew R Cooperberg10, Cedric Poyet11, Karim Saba11,12, Kathleen Herkommer13, Valentin H Meissner13, Andrew J Vickers14, Donna P Ankerst15,16.
Abstract
BACKGROUND: We compared six commonly used logistic regression methods for accommodating missing risk factor data from multiple heterogeneous cohorts, in which some cohorts do not collect some risk factors at all, and developed an online risk prediction tool that accommodates missing risk factors from the end-user.Entities:
Keywords: Clinical risk prediction; Missing data; Prostate cancer; Validation
Mesh:
Substances:
Year: 2022 PMID: 35864460 PMCID: PMC9306143 DOI: 10.1186/s12874-022-01674-x
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.612
Fig. 1Sample sizes represented by the height of rectangles and prevalence of significant prostate cancer represented by the width of rectangles for the 11 PBCG cohorts used in the study. The cohorts have been numbered according to their rank of clinically significant prostate cancer prevalence. The 3rd cohort in black outline was withheld to serve as an external validation cohort with the remaining 10 cohorts used for training prediction models
Fig. 2Amount of missing risk factor data by cohort on the x-axis; all patients were required to have prostate-specific antigen (PSA) and age, hence 0% missing for these covariates. The 3rd cohort separated by the black vertical line is used as an external validation set, and leave-one-cohort-out cross-validation was applied to the other cohorts. Cohorts were sorted by missing data pattern
Methods for fitting individual predictor-specific risk models for members of a test set by combining data from multiple cohorts. All individuals in the training and test cohorts have 2 predictors, PSA and age, and then any subset, including none, of 10 additional predictors for a total of 12 predictors, denoted by . The set of predictors available for the new individual is denoted by . All models use logistic regression for prediction of clinically significant prostate cancer. MICE = Multiple imputation by chained equations; BIC = Bayesian Information Criterion defined as the -2(maximized log likelihood) + (number of covariates) log(sample size)
| Method | Definition |
|---|---|
| Available cases | Pool individual-level data that have |
| Iterative BIC selection | Same as available cases, but with an iterative stepwise BIC-based model selection to determine the optimal subset of |
| Cohort ensemble | Separate models are built to each cohort by using the coinciding variables of the cohort and the patient |
| Categorization | All individuals in all cohorts are used. Predictors are categorized with missing as one of the categories so that the complete list of predictors |
| Missing indicator | Include an indicator for missing a continuous predictor value and the interaction with the predictor as additional variables in the analysis. Mostly similar to Categorization |
| Imputation | Impute missing covariates in the training set following the MICE method. Mean imputation for missing values in prediction |
Fig. 3CIL and AUC performing leave-one-cohort-out cross-validation on 10 PBCG cohorts. Median values are indicated with numbers and as vertical lines in the boxes
External validation CIL and AUC values with risks as percentages along with 95% confidence intervals (CI)
| Method | CIL | (95% CI) | AUC | (95% CI) |
|---|---|---|---|---|
| Available cases | -2.9 | (-4.0, -1.8) | 75.7 | (74.4, 77.1) |
| Iterative BIC selection | -8.6 | (-9.7, -7.5) | 75.4 | (74.0, 76.8) |
| Cohort ensemble | -7.1 | (-8.2, -6.0) | 76.4 | (75.1, 77.7) |
| Categorization | 3.5 | (2.4, 4.6) | 76.6 | (75.2, 77.9) |
| Missing indicator | 4.2 | (3.1, 5.3) | 77.4 | (76.1, 78.7) |
| Imputation | -13.3 | (-14.4, -12.2) | 75.9 | (74.5, 77.2) |
Fig. 4Calibration plots with shaded pointwise 95% confidence intervals for the 6 modeling methods applied to 10 PBCG training cohorts and validated on the external cohort. The diagonal black line is where predicted risks equal observed risks, lines below the diagonal indicate over-prediction, and lines above under-prediction, on the validation set
Fig. 5Marginal and pairwise comparisons of predictions from the 6 methods for the 5543 biopsies of the external validation set, pooled and stratified by clinically significant prostate cancer status (31.7% with clinically significant prostate cancer). Corr indicates Pearson correlation. Turquoise indicates individuals with clinically significant prostate cancer and purple not
Odds ratios from the largest, standard, and smallest models in terms of number of 12 risk factors available from an end-user. Sample sizes are the number of individuals in the training set with all risk factors available (complete cases), and number of cohorts contributing the complete cases. In total 1,024 models are available based on the option for included versus not for 10 risk factors, all except PSA and age
| Risk factor | Odds ratio | 95% CI | |
|---|---|---|---|
| Odds ratios for the full model containing 12 risk factors based on a fit to 1334 prostate biopsies from 3 cohorts | |||
| Age | 1.07 | (1.05, 1.09) | < 0.0001 |
| PSA (log2) | 2.38 | (1.98, 2.89) | < 0.0001 |
| African ancestry | |||
| No | Ref | – | – |
| Yes | 0.68 | (0.45, 1.03) | 0.08 |
| Prostate volume (log2) | 0.25 | (0.20, 0.32) | < 0.0001 |
| DRE | |||
| Normal | Ref | – | – |
| Abnormal | 1.95 | (1.46, 2.60) | < 0.0001 |
| Prior negative biopsy | |||
| No | Ref | – | – |
| Yes | 0.32 | (0.22, 0.45) | < 0.0001 |
| Hispanic ethnicity | |||
| No | Ref | – | – |
| Yes | 1.08 | (0.78, 1.50) | 0.6 |
| 5-alpha-reductase-inhibitor use | |||
| No | Ref | – | – |
| Yes | 0.96 | (0.63, 1.44) | 0.8 |
| Prior PSA screen | |||
| No | Ref | – | – |
| Yes | 0.71 | (0.38, 1.34) | 0.3 |
| First-degree prostate cancer family history | |||
| No | Ref | – | – |
| Yes | 1.93 | (1.38, 2.69) | 0.0001 |
| Second-degree prostate cancer family history | |||
| No | Ref | – | – |
| Yes | 1.30 | (0.86, 1.96) | 0.2 |
| First-degree breast cancer family history | |||
| No | Ref | – | – |
| Yes | 1.15 | (0.77, 1.70) | 0.5 |
| Odds ratios for the model containing the 6 standard risk factors based on a fit to 8432 prostate biopsies from 9 cohorts | |||
| Age | 1.05 | (1.04, 1.06) | < 0.0001 |
| PSA (log2) | 1.99 | (1.86, 2.12) | < 0.0001 |
| African ancestry | |||
| No | Ref | – | – |
| Yes | 1.26 | (1.11, 1.44) | 0.0005 |
| DRE | |||
| Normal | Ref | – | – |
| Abnormal | 2.57 | (2.29, 2.88) | < 0.0001 |
| Prior negative biopsy | |||
| No | Ref | – | – |
| Yes | 0.28 | (0.24, 0.32) | < 0.0001 |
| First-degree prostate cancer family history | |||
| No | Ref | – | – |
| Yes | 1.94 | (1.70, 2.22) | < 0.0001 |
| Odds ratios for the smallest model containing 2 risk factors based on a fit to 12,703 prostate biopsies from 10 cohorts | |||
| Age | 1.05 | (1.05, 1.06) | < 0.0001 |
| PSA (log2) | 1.72 | (1.64, 1.80) | < 0.0001 |