| Literature DB >> 28764644 |
Katrin Hainke1, Sebastian Szugat1, Roland Fried1, Jörg Rahnenführer2.
Abstract
BACKGROUND: Disease progression models are important for understanding the critical steps during the development of diseases. The models are imbedded in a statistical framework to deal with random variations due to biology and the sampling process when observing only a finite population. Conditional probabilities are used to describe dependencies between events that characterise the critical steps in the disease process. Many different model classes have been proposed in the literature, from simple path models to complex Bayesian networks. A popular and easy to understand but yet flexible model class are oncogenetic trees. These have been applied to describe the accumulation of genetic aberrations in cancer and HIV data. However, the number of potentially relevant aberrations is often by far larger than the maximal number of events that can be used for reliably estimating the progression models. Still, there are only a few approaches to variable selection, which have not yet been investigated in detail.Entities:
Keywords: Disease progression model; Oncogenetic tree; Variable selection
Mesh:
Year: 2017 PMID: 28764644 PMCID: PMC5539896 DOI: 10.1186/s12859-017-1762-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Example of an oncogenetic tree model with n=6 events. The edge weights represent the conditional probability that the child event occurs given that the parent event already occurred
Overview of all variable selection methods considered here
| Name | Short name | Short description of criterion for selected events |
|---|---|---|
| Univariate Frequency | freq | Frequency above cutoff |
| Method of Brodeur | brod | High frequency, compared to uniform distribution |
| Pairwise Correlation | cor | Event pairs with high correlation |
| Fisher’s Exact Test | fisher | Event pairs with significant dependence |
| Fisher’s z-transformation | z | Event pairs with significant dependence |
| Weights of Edmonds’ Algorithm | weight | Event pairs with large weights in algorithm |
| Conditional Probabilites in Tree | OT | Large conditional probabilities in oncogenetic tree |
| Independence in Tree | single | Remove single independent events in fitted tree |
| Largest Clique Identification | lcliq | Member of the largest subgraph |
| Maximal Clique Identification | mcliq | Member of the maximum weight subgraph |
Fig. 2Results of the simulation study. The eight different parameter settings are displayed on the x-axis whereas the means of the 100 L 1-distances for combinations of method and threshold are shown on the y-axis. Top left: Results for the univariate frequency method with all chosen thresholds. Top right: Results for the largest cliques method with all chosen thresholds. Bottom left: Comparison of seven different selection methods, each with one threshold that was globally best for all parameter situations. Bottom right: Comparison of three different selection methods. The chosen threshold is given in brackets, because there was no globally best one
Recommendation of the thresholds to be used for each method and each data situation
|
| 50/1000 | 50/1000 | 50/1000 | 50/1000 |
|---|---|---|---|---|
|
| 2 | 12 | 2 | 12 |
|
| [0,0.2] | [0,0.2] | [0.2,0.4] | [0.2,0.4] |
| freq | 0.2 | 0.2 | 0.2 | 0.2 |
| cor | 0.3 | 0.3 | 0.3 | 0.3 |
| fisher | 0.01 | 0.01 | 0.01 | 0.01 |
| z | 0.9 | 0.9 | 0.9 | 0.9 |
| weight | 0.3 | 0.05 | 0.3 | 0.05 |
| OT | 0.25 | 0.25 | 0.25 | 0.25 |
| lcliq | 0.05 | 0.05 | 0.2 | 0.2 |
| mcliq | 0.05 | 0.05 | 0.15 | 0.15 |
The method of Brodeur generates its threshold implicitly and the single method does not need any threshold at all
Fig. 3Comparison of all variable selection methods. Based on the results from Fig. 2 we need to distinguish between situations with low and high proportion of noise variables (π ∈I 0.1 vs. π ∈I 0.3)
Fig. 4Results of the simulation study. The eight different parameter settings are displayed on the x-axis whereas the means of the 100 values for sens and spec are shown on the y-axis. For all figures it holds that α =0.5 (instead of α =0.2 for the L 1-distance). Top row: Results for the criterion sens, left: comparison of all seven methods with one overall best threshold, right: comparison of all three methods with two thresholds depending on the underlying data situation. Middle row: Results for the criterion spec, left: comparison of all seven methods with one overall best threshold, right: comparison of all three methods with two thresholds depending on the underlying data situation. Bottom row: Comparison of all variable selection methods for the two criteria sens (left) and spec (right)
List of events (meningioma and HIV data set) respectively number of events (glioblastoma data set) that were chosen by our variable selection methods using the thresholds from the simulation study (x = event was selected)
| Method | freq | brod | cor | fisher | z | weight | weight | OT | single | lcliq | lcliq | mcliq | mcliq |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| threshold | 0.2 | - | 0.3 | 0.01 | 0.9 | 0.05 | 0.3 | 0.25 | - | 0.05 | 0.2 | 0.05 | 0.15 |
| MENINGIOMA data set | |||||||||||||
| Chr14- | x | x | x | x | x | x | x | x | x | ||||
| Chr22- | x | x | x | x | x | x | x | x | x | ||||
| Chr1p- | x | x | x | x | x | x | x | x | |||||
| Chr6- | x | x | x | x | x | x | |||||||
| Chr10- | x | x | x | x | x | x | |||||||
| Chr18- | x | x | x | x | x | x | |||||||
| Chr19- | x | x | x | x | x | ||||||||
| ChrY- | x | x | x | x | |||||||||
| ChrX- | x | x | x | ||||||||||
| HIV data set | |||||||||||||
| 215 F,Y | x | x | x | x | x | x | x | x | x | x | x | x | |
| 41 L | x | x | x | x | x | x | x | x | x | x | x | ||
| 70 R | x | x | x | x | x | x | x | x | x | x | |||
| 67 N | x | x | x | x | x | x | x | x | x | ||||
| 219 E,Q | x | x | x | x | x | x | x | x | |||||
| 210 W | x | x | x | x | x | x | x | ||||||
| GLIOBLASTOMA data set | |||||||||||||
| 23 | 29 | 73 | 99 | 102 | 89 | 102 | 85 | 131 | 22 | 10 | 22 | 11 | |
The thresholds for the method of Brodeur are 0.1,0.33 and 0.17 respectively
Fig. 5Some trees resulting from the variable selection process concerning the three data sets. The three rows represent the meningioma (top), HIV (middle) and glioblastoma (bottom) data sets, respectively. The columns show as a kind of reference tree the tree with all events (left), then two trees based on the frequency (middle) and clique selection (right)
List of events from the glioblastoma data set that were chosen by our variable selection methods (x = event was selected)
| Method | freq | brod | cor | fisher | weight | OT | lcliq | mcliq |
|---|---|---|---|---|---|---|---|---|
| threshold | 0.41 | 0.1725 | 0.70 | 10−26 | 0.0018 | 0.90 | 0.2 | 0.15 |
|
| x | x | x | x | x | x | x | |
| Chr7q+ | x | x | x | x | x | x | x | |
| Chr19p+ | x | x | x | x | x | x | ||
| Chr20p+ | x | x | x | x | x | x | ||
| Chr20q+ | x | x | x | x | x | x | ||
|
| x | x | x | x | x | |||
|
| x | x | x | x | x | |||
|
| x | x | x | x | x | |||
|
| x | x | x | x | ||||
| Chr19q+ | x | x | x | x | ||||
| Chr9q++ | x | x | x | |||||
| Chr12p++ | x | x | x | |||||
| Chr18p++ | x | x | x | |||||
| Chr18q++ | x | x | x | |||||
| Chr21q++ | x | x | x | |||||
|
| x | x | x | |||||
|
| x | x | ||||||
| Chr2q++ | x | x | ||||||
| Chr3p++ | x | x | ||||||
| Chr8q++ | x | x | ||||||
| Chr11p- | x | x | ||||||
| Chr11q- | x | x | ||||||
|
| x | x | ||||||
| Chr14q- | x | x | ||||||
| Chr15q- | x | x | ||||||
| Chr1q+ | x | |||||||
| Chr1q- | x | |||||||
| Chr3q- | x | |||||||
| Chr4q- | x | |||||||
| Chr6p- | x | |||||||
| Chr6q- | x | |||||||
| Chr8p- | x | |||||||
| Chr9q- | x | |||||||
| Chr12q+ | x | |||||||
| Chr12q- | x | |||||||
| Chr15q+ | x | |||||||
| Chr21p- | x | |||||||
| Chr7q- | x | |||||||
| Chr13q+ | x | |||||||
| Chr18p+ | x |
The events are sorted according to their selection frequency. Events already mentioned in the literature are printed in bold
Overview of the occurrence rates for all events for simulated data and the three data sets
| Data set | Minimum | 1st quartile | Medium | Mean | 3rd quartile | Maximum |
|---|---|---|---|---|---|---|
| simulation data | 0.00 | 0.14 | 0.29 | 0.33 | 0.49 | 0.96 |
| simulation data with noise | 0.00 | 0.12 | 0.23 | 0.26 | 0.36 | 0.96 |
| meningioma | 0.02 | 0.07 | 0.04 | 0.08 | 0.06 | 0.38 |
| HIV | 0.12 | 0.20 | 0.24 | 0.27 | 0.36 | 0.42 |
| glioblastoma | 0.00 | 0.00 | 0.04 | 0.12 | 0.14 | 0.85 |
Proportion of events from the three data sets that fit to the estimated model
| Data set | All events | Frequency tree | Clique tree |
|---|---|---|---|
| meningioma | 0.90 | 1.00 | 0.97 |
| HIV | 0.87 | 0.94 | 0.88 |
| glioblastoma | 0.79 | 0.69 | 0.59 |
For the glioblastoma data the numbers are lower due to the tree depth of 4 and 6 for freq and cliq, respectively. For the simulated data, minimum, 1st quartile, median, mean, 3rd quartile and maximum are 0.47, 0.94, 0.99, 0.92, 1.00 and 1.00