| Literature DB >> 27159447 |
Stefanie Hieke1,2, Axel Benner3, Richard F Schlenk4, Martin Schumacher1, Lars Bullinger4, Harald Binder5.
Abstract
Clinical cohorts with time-to-event endpoints are increasingly characterized by measurements of a number of single nucleotide polymorphisms that is by a magnitude larger than the number of measurements typically considered at the gene level. At the same time, the size of clinical cohorts often is still limited, calling for novel analysis strategies for identifying potentially prognostic SNPs that can help to better characterize disease processes. We propose such a strategy, drawing on univariate testing ideas from epidemiological case-controls studies on the one hand, and multivariable regression techniques as developed for gene expression data on the other hand. In particular, we focus on stable selection of a small set of SNPs and corresponding genes for subsequent validation. For univariate analysis, a permutation-based approach is proposed to test at the gene level. We use regularized multivariable regression models for considering all SNPs simultaneously and selecting a small set of potentially important prognostic SNPs. Stability is judged according to resampling inclusion frequencies for both the univariate and the multivariable approach. The overall strategy is illustrated with data from a cohort of acute myeloid leukemia patients and explored in a simulation study. The multivariable approach is seen to automatically focus on a smaller set of SNPs compared to the univariate approach, roughly in line with blocks of correlated SNPs. This more targeted extraction of SNPs results in more stable selection at the SNP as well as at the gene level. Thus, the multivariable regression approach with resampling provides a perspective in the proposed analysis strategy for SNP data in clinical cohorts highlighting what can be added by regularized regression techniques compared to univariate analyses.Entities:
Mesh:
Year: 2016 PMID: 27159447 PMCID: PMC4861340 DOI: 10.1371/journal.pone.0155226
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Componentwise likelihood-based boosting algorithm for time to event endpoint.
Mean type I error and power of the different procedures with respect to non-informative and informative SNPs.
| mean Type I error | Power | |||||||
|---|---|---|---|---|---|---|---|---|
| cor | uncor | SNP2 | SNP5 | SNP9 | SNP33 | SNP37 | SNP3023 | |
| univariate approach | 0.36 | ≪ 0.05 | 0.98 | 1.00 | 0.95 | 0.73 | 0.79 | 0.25 |
| multivariable CV | 0.08 | ≪ 0.05 | 0.96 | 0.97 | 0.90 | 0.88 | 0.87 | 0.85 |
| multivariable 500 steps | 0.10 | ≪ 0.05 | 0.97 | 0.99 | 0.91 | 0.92 | 0.92 | 0.92 |
Approaches are univariate testing, multivariable modeling using componentwise likelihood-based boosting with number of boosting steps selected by 10-fold cross-validation, and alternatively, with a fixed number of 500 boosting steps. cor: 24 non-informative SNPs located within one of the three different blocks carrying six informative SNPs. uncor: 299970 non-informative SNPs not included within the blocks carrying the informative SNPs. SNP2, SNP5, SNP9, SNP33 and SNP37: informative SNPs located within the first and second block, respectively, on one selected gene and SNP3023: informative SNP located on another selected gene.
Fig 2Relapse-free survival curves.
The number of patients at risk at different time points are given below the graphs.
Number of SNPs (top part) and number of genes (bottom part) with resampling inclusion frequencies (IF) larger than 0 and 10, and maximum IF in 100 resampling data sets as well as the overlap between the multivariable model and the univariate approach.
| multivariable model | overlap | univariate approach | |
|---|---|---|---|
| SNP level | SNP level | SNP level | |
| IF > 0 | 395 | 58 | 218 |
| IF ≥ 10 | 3 | 3 | 3 |
| max | 45 | rs256215 | 28 |
| gene level | gene level | gene level | |
| IF > 0 | 556 | 102 | 235 |
| IF ≥ 10 | 4 | 2 | 2 |
| max | 72 | 34 |
The values are given for the multivariable and the univariate approach. Based on the FDR, for the latter the SNPs with p-values smaller or equal to 0.05 are considered.
Inclusion frequencies (IF) for gene FSTL4 and inclusion frequencies of the corresponding SNPs (ordered by position) included in at least one resampling data set from the multivariate and the univariate approach, respectively.
| multivariable (IF) | univariate (IF) | ||||
|---|---|---|---|---|---|
| Gene/SNP total (17659/390443) | gene level | SNP level | gene level | SNP level | |
| 72 | 34 | <0.001 | |||
| rs10479044 | 0 | 2 | |||
| rs256258 | 3 | 2 | |||
| rs256259 | 25 | 23 | |||
| rs256209 | 1 | 2 | |||
| rs256215 | 45 | 28 | |||
| rs256219 | 1 | 2 | |||
| rs256221 | 0 | 1 | |||
| rs256225 | 20 | 21 | |||
| rs256228 | 1 | 0 | |||
| rs4958139 | 0 | 1 | |||
| rs2867328 | 0 | 1 | |||
p-value: permutation-based p-value from the gene region-level summary for moving from the SNP to the gene level in the univariate approach (Section Gene level analyses).
*: Selected SNPs from the boosting model for the original data set.
⋄: Selected SNPs from the univariate Cox models with FDR smaller or equal to 0.05 in the original data.
Fig 3Linkage disequilibrium (R2) for the SNPs mapped to FSTL4.
SNPs with non-zero inclusion frequencies are indicated by stars. The left panel shows all SNPs, the right panel only the region with the most frequently selected SNPs (IF > 20).