| Literature DB >> 34876106 |
Katrin Madjar1, Jörg Rahnenführer2.
Abstract
BACKGROUND: An important task in clinical medicine is the construction of risk prediction models for specific subgroups of patients based on high-dimensional molecular measurements such as gene expression data. Major objectives in modeling high-dimensional data are good prediction performance and feature selection to find a subset of predictors that are truly associated with a clinical outcome such as a time-to-event endpoint. In clinical practice, this task is challenging since patient cohorts are typically small and can be heterogeneous with regard to their relationship between predictors and outcome. When data of several subgroups of patients with the same or similar disease are available, it is tempting to combine them to increase sample size, such as in multicenter studies. However, heterogeneity between subgroups can lead to biased results and subgroup-specific effects may remain undetected.Entities:
Keywords: Cox proportional hazards model; Heterogeneous cohorts; High-dimensional data; Subgroup analysis; Weighted regression
Mesh:
Year: 2021 PMID: 34876106 PMCID: PMC8650299 DOI: 10.1186/s12911-021-01698-1
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Simulation set-up. Analysis pipeline for the simulation study; Brighter regions in the training and test set indicate the observations of the subgroup
Effects in the simulation study
| Gene | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0.5 | 0.75 | 0.25 | ||||||
| 0 | 0 | 1 | 1 | 0.5 | 0.25 | 0.75 |
Effects of the first 12 genes for the simulation of survival outcome
Parameter combinations in the simulation study
| Parameter | Values (per subgroup) |
|---|---|
| 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 500, 1000 | |
| 12, 100, 200 | |
| 0, 0.1, 0.2, 0.3, 0.4, 0.5, 1 |
All parameters tested in the simulation study with their respective values, resulting in 252 different combinations in total
Fig. 2Estimated weights in the simulation. Estimated weights for subgroup 1A obtained by random forest based on simulated training data with and
Fig. 3Estimated effects in the simulation. Mean estimated regression coefficients of the first 12 prognostic genes in all Cox models for group 1 (averaged across all simulated training data sets, regardless of whether a gene is selected in individual splits or not, and subgroups 1A and 1B) for and . The symbol ‘x’ represents the true simulated effects of both groups: in black the group of interest (here group 1) and in grey the other group
Fig. 4Prediction performance in the simulation. Mean C-index, averaged across all test data sets and subgroups, for
Fig. 5Estimated weights in the application. Estimated weights for all lung cancer cohorts using random forest and the top-1000-variance genes as gene filter
Fig. 6Estimated effects in the application. Different types of Cox models including the top-1000-variance genes as covariates. Mean estimated regression coefficients of selected genes (averaged across all training sets regardless of whether a gene is selected in individual splits or not). For each subgroup genes with a mean inclusion frequency larger or equal than 0.5 in any model type are selected
Fig. 7Prediction performance in the application. Different types of Cox models including the top-1000-variance genes as covariates. Boxplots of C-index based on all test sets for the prediction of each subgroup