| Literature DB >> 32410693 |
Daniel Samaga1, Roman Hornung2, Herbert Braselmann3, Julia Hess3,4,5, Horst Zitzelsberger3,4,5, Claus Belka4,5, Anne-Laure Boulesteix2, Kristian Unger3,4,5.
Abstract
BACKGROUND: Prognostic models based on high-dimensional omics data generated from clinical patient samples, such as tumor tissues or biopsies, are increasingly used for prognosis of radio-therapeutic success. The model development process requires two independent discovery and validation data sets. Each of them may contain samples collected in a single center or a collection of samples from multiple centers. Multi-center data tend to be more heterogeneous than single-center data but are less affected by potential site-specific biases. Optimal use of limited data resources for discovery and validation with respect to the expected success of a study requires dispassionate, objective decision-making. In this work, we addressed the impact of the choice of single-center and multi-center data as discovery and validation data sets, and assessed how this impact depends on the three data characteristics signal strength, number of informative features and sample size.Entities:
Keywords: Feature selection; Omics data; Predictive model; Predictive performance; Study design; Validation
Mesh:
Year: 2020 PMID: 32410693 PMCID: PMC7227093 DOI: 10.1186/s13014-020-01543-1
Source DB: PubMed Journal: Radiat Oncol ISSN: 1748-717X Impact factor: 3.481
Fig. 1Simulation scheme for the computation of performance scores of molecular prognostic models with center effects. For each parameter set 1000 Monte-Carlo runs are performed. In single-center (SC) and multi-center (MC) data sets, independent variables X and dependent variables Y are generated from a true state a, noise and the center batch pattern. Each center shares realizations of randomly sampled batch-specific parameters among its samples. MC data sets are batch corrected, a signature is fitted to the discovery data and used for prediction of validation data. Performance scores are calculated to measure the average quality of prediction
Parameters of simulations
| Sc1 | Sc2 | Sc3 | |||
|---|---|---|---|---|---|
| signal strength | = | [0; 0.5] | 0.25 | 0.125 | |
| number of genes, informative | = | 300 | [1;1000] | 300 | |
| sample size | = | 100 | 100 | [40 500] | |
| number of genes, total | = | 103 | 103 | 103 | |
| number of centers in MC | = | 8 | 8 | 8 | |
| minimum samples per center | = | 10 | 10 | 5 | |
| basal level gene | ∼ | ||||
| target | ∼ | ||||
| fixed batch effect gene | ∼ | ||||
| number of latent factors | = | 5 | 5 | 5 | |
| factor loadings | ∼ | ||||
| impact of factor | ∼ | ||||
| noise scaling of gene | ∼ | ||||
| noise | ∼ | ||||
| standard deviation of observation noise | = | 0.1 | 0.1 | 0.1 |
Each column shows the parameter set for one of three simulated scenarios. The intervals indicate the ranges in which the parameter values were varied in the respective scenarios. Fixed parameters are indicated by ‘ =’, while sources of heterogeneity as signal, noise and batch effects are characterized by the parameters of their densities, indicated by the ’ ∼’ symbol
Quality of MSPE-estimation by SC and MC validation data sets
| Signature | approx. MSPE | SEM | Validation | estim. MSPE | SEM | squared error | SEM |
|---|---|---|---|---|---|---|---|
| SC | 5.74 | 0.14 | SC | 5.49 | 0.17 | 15.68 | 2.04 |
| MC | 5.49 | 0.12 | 2.73 | 0.31 | |||
| MC | 0.87 | <0.01 | SC | 0.87 | 0.01 | 0.06 | <0.01 |
| MC | 0.88 | 0.01 | 0.02 | <0.01 |
The average true MSPE value of a signature discovered in SC or MC data is approximated in 1000 iterations by 105 sample data sets with different random batch patterns on each sample. The approximated MSPE-value is reported with its standard error of the mean as well as the MSPE estimated in the validation data and its standard error. The average squared error of this estimator ((MSPE−MSPE)2) was calculated from 1000 discovery data sets with 100 independent validation data sets each
Fig. 2Performance scores under varying signal strength. Performance scores and 99%-confidence bands for a “expected fraction of false findings in signature” FDR, b “expected error on single future predictions” MSPE, c “chance of successful validation" SV, and d “average calibration slope” CS calculated from 103 simulation runs. The parameter values are given in Table 1, signal strength is varied in terms of the parameter
Fig. 3Performance scores under varying number of informative genes. Performance scores and 99%-confidence bands for a “expected fraction of false findings in signature" FDR, b “expected error on single future predictions” MSPE, c “chance of successful validation" SV, and d “average calibration slope” CS calculated from 103 simulated prognostic modeling iterations. The parameter values are given in Table 1 and the number of informative genes is varied. Note that the overall signal () is kept constant by adapting to the number of informative features n
Fig. 4Performance scores under varying sample size. Performance scores and 99%-confidence bands for a “expected fraction of false findings in signature" FDR, b “expected error on single future predictions" MSPE, c “chance of successful validation" SV, and d “average calibration slope” CS calculated from 103 simulated prognostic modeling iterations. The parameter values are given in Table 1 and the sample size is varied