| Literature DB >> 25879190 |
Abstract
BACKGROUND: Cancer progression is caused by the sequential accumulation of mutations, but not all orders of accumulation are equally likely. When the fixation of some mutations depends on the presence of previous ones, identifying restrictions in the order of accumulation of mutations can lead to the discovery of therapeutic targets and diagnostic markers. The purpose of this study is to conduct a comprehensive comparison of the performance of all available methods to identify these restrictions from cross-sectional data. I used simulated data sets (where the true restrictions are known) but, in contrast to previous work, I embedded restrictions within evolutionary models of tumor progression that included passengers (mutations not responsible for the development of cancer, known to be very common). This allowed me to assess, for the first time, the effects of having to filter out passengers, of sampling schemes (when, how, and how many samples), and of deviations from order restrictions.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25879190 PMCID: PMC4339747 DOI: 10.1186/s12859-015-0466-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Inferring order restrictions. (a) Main steps in the analysis of patient data. (b) Main steps used in this paper for the generation (simulation) of data and its analysis. Terms in monospaced blue font are those in Table 1, and terms in italics, as in Table 1, correspond to within-data set factors. Numbers indicate the chronological order of the steps. In step 1, cancer development is simulated for the specified values of Model, sh, and True Graph. This simulation generates tumor cell data for the equivalent of a single patient in panel (a). In step 2, data for S.Size patients are sampled (cross-sectional sampling) according to the settings of S.Time and S.Type, producing a data set (a collection of genotypes: a matrix of subjects by genes). If the identity of the true drivers is not known, Filtering in step 3 removes from the data set the genes that do not meet certain frequency criteria. The data set is then passed on, in step 4, to one of the specified methods to infer the graph that encodes the order restrictions. This inferred graph is compared, in step 5, with the true graph (which was used in step 1 to generate the cancer cell data) yielding the four performance measures Diff, PFD, PND and FPF. The process illustrated here was repeated 20 times for all possible combinations of Model, sh, True Graph, S.Time, S.Type, S.Size. Every data set was subject to all Filtering procedures and analyzed with all six Methods.
Factors considered and their levels or possible values, together with acronyms used through the text
|
|
|
|
|---|---|---|
| Model | Evolutionary model of cancer progression | exp, Bozic, McF_4, McF_6 |
| sh | Penalization of deviations from monotonicity | 0, Inf (for |
| True graph | The true graph: the structure that encodes the order restrictions. All possible combinations of Number of nodes and Conjunction | 11-A, 11-B, 9-A, 9-B, 7-A, 7-B |
| Number of nodes (NumNodes) | Number of genes or alterations | 11, 9, 7 |
| Conjunction | Whether or not the graph has conjunctions | Yes, No |
| Sample size (S.Size) | Number of samples used for reconstructing the graph | 100, 200, 1000 |
| Sampling time (S.Time) | When the sample is taken | Last, unif (for uniform) |
| Sampling type (S.Type) | How tissue is collected | singleC (for single cell), wholeT_0.5 (whole tumor, detection threshold=0.5), wholeT_0.01 (whole tumor, detection threshold=0.1) |
|
| Method for selecting drivers, or filtering passengers, when the true drivers are not known | S1, S5, J1, J5 (for frequency of Single event and Joint frequency of events, with thresholds 1% and 5% respectively) |
|
| Method for inferring the order restrictions | CBN, CBN-A, DiP, DiP-A, OT, OT-A |
The within-data set factors, Filtering and Method (see text), are shown in italics. All other factors are among-data set factors. Sampling scheme, used through the text, refers to when (S.Time) and how (S.Type) we sample.
Main parameters for each of the tumor progression models
|
|
|
|
|
|
|---|---|---|---|---|
|
|
| |||
|
| ||||
|
| ||||
| Bozic | 1 | (1− | 10−6 | >109 cells |
| exp | (1+ | 1 |
| >109 cells |
| McF_4 | (1+ | log(1+ | 5∗10−7 | Number of |
| drivers ≥4 | ||||
| McF_6 | (1+ | log(1+ | 5∗10−7 | Number of |
| drivers ≥6 |
j is the number of drivers with their dependencies met, and p the number of drivers with dependencies not met. In all cases s = 0.1. s is set to either 0 (so it has no effect) or ∞ (so fitness of that clone is 0). N: population size. K = 2000. +:Strictly, birth rate = max(0, (1 + s) (1 −s )).
Ranking of all 36 combinations of Method and Sampling scheme (time, type) when drivers are known with respect to each performance measure
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| ||
| OT-A, last, singleC |
|
| 14 | 10 |
|
|
| 15 | |
| OT-A, last, wholeT_0.5 |
|
| 15 | 12 |
|
|
| 14 | |
| OT-A, last, wholeT_0.01 |
|
| 7 | 22 |
| 6 |
| 22 | |
| OT-A, unif, singleC |
| 6 | 19 | 13 |
| 9 | 12.5 | 9.5 | |
| OT, unif, singleC |
|
| 20 | 11 |
| 7 | 12.5 | 9.5 | |
| OT-A, unif, wholeT_0.01 | 8 | 11 | 16 | 24 | 8 | 11 |
| 24 | |
| OT, last, singleC | 10 | 9 | 23 |
| 10 |
| 17 |
| |
| OT, last, wholeT_0.01 | 11 |
| 18 | 18 | 12 |
| 14 | 18 | |
| OT, last, wholeT_0.5 | 12 | 7 | 24 |
| 11 |
| 23 |
| |
| CBN-A, unif, wholeT_0.01 | 13 | 13 |
| 26 | 13 | 13 |
| 26 | |
| CBN-A, unif, singleC | 14 | 16 |
| 28 | 15 | 15 | 9 | 29 | |
| CBN-A, unif, wholeT_0.5 | 15 | 18 |
| 34 | 14 | 16 | 8 | 31 | |
| CBN, unif, singleC | 16 | 17 |
| 29 | 17 | 19 | 10 | 34 | |
| CBN, unif, wholeT_0.01 | 17 | 14 |
| 27 | 18 | 14 | 6 | 27 | |
| DiP-A, unif, singleC | 31 | 28 | 31 |
| 31 | 30 | 31 | 6 | |
| DiP, last, wholeT_0.5 | 33 | 35 | 34 |
| 34 | 34 | 36 |
| |
| DiP, unif, singleC | 35 | 31 | 33 |
| 35 | 32 | 33 |
| |
| DiP, unif, wholeT_0.5 | 36 | 36 | 36 | 6 | 36 | 36 | 35 |
| |
Methods have been ordered by their performance in the first performance measure. Best five methods are shown in bold. Only methods that are within the best five in at least one performance measure are shown (full table as well as tables split by S.Size are available from Additional file 2).
Frequencies of most common confidence sets using multiple comparisons with the best with a coverage of 0.90, when drivers are known
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| ||
| OT, OT-A | 0.60 | 0.47 | 0.03 | 0.05 | 0.65 | 0.53 | 0.23 | 0.05 | |
| DiP, DiP-A, OT, OT-A | 0.02 | 0.08 | - | 0.57 | 0.03 | 0.16 | 0.05 | 0.59 | |
| CBN, CBN-A | 0.02 | 0.09 | 0.61 | - | - | 0.04 | 0.32 | 0.01 | |
| DiP, DiP-A | - | 0.04 | 0.01 | 0.17 | 0.01 | 0.02 | - | 0.16 | |
| CBN, CBN-A, OT, OT-A | 0.02 | 0.02 | 0.07 | - | 0.02 | 0.03 | 0.12 | 0.01 | |
| DiP-A, OT, OT-A | - | 0.06 | - | 0.02 | 0.02 | 0.05 | 0.02 | 0.03 | |
| OT-A | 0.16 | 0.09 | 0.06 | - | 0.15 | 0.04 | 0.13 | - | |
| OT | 0.07 | 0.05 | - | 0.01 | 0.03 | 0.03 | - | - | |
| DiP, OT | - | 0.04 | - | 0.08 | 0.01 | 0.05 | - | 0.08 | |
| CBN-A | 0.03 | - | 0.06 | - | - | - | 0.01 | - | |
Combinations not shown have a frequency less than 0.05 for all columns. Frequencies normalized by column total (N = 432).
Figure 2Drivers known, plot of the coefficients (posterior mean and 0.025 and 0.975 quantiles) for Conjunction, Method, S.Time, S.Type and S.Size from the GLMMs for each performance measure. X-axis labeled by the exponential of the coefficient (i.e., relative change in the odds ratio or in the scale of the Poisson parameter for Diff): smaller (or lefter) is better. The vertical dashed line denotes no change relative to the overall mean (the intercept). The x-axis has been scaled to make it symmetric (e.g., a ratio of 1.25 is the same distance from the vertical line as a ratio of 1/1.25). Coefficients that correspond to a change larger than 25% (i.e., r a t i o>1.25 or <1/1.25) shown in larger red dots. The coefficients shown are only those that represent a change larger than 25% for at least one performance measure, or coefficients that are marginal to those shown (e.g., any main effect from an interaction that includes it).
Figure 3Drivers known, plot of the coefficients model, sh, Graph,and their interactions with all other terms from the GLMMs for each performance measure. See legend for Figure 2.
Figure 4Mean of each performance measure for the different combinations of method and conjunction in (a) the drivers known and (b) the drivers unknown scenarios. Y-axis is in the scale of the variable (fractions for PFD, PND, FPF and sum of differences for Diff). Each mean value shown is the mean of 8640 and 34560 values for drivers known and unknown, respectively.
Figure 5Mean of each performance measure for the different combinations of method and model in (a) the drivers known and (b) the drivers unknown scenarios. Each mean value shown is the mean of 4320 and 17280 values for drivers known and unknown, respectively.
Figure 6Mean of each performance measure in the drivers known scenario, for the different combinations of method and sh. Each value shown is the mean of 8640 values.
Figure 7Mean of each performance measure for the different combinations of model and sh in (a) the drivers known and (b) the drivers unknown scenarios. Each value shown is the mean of 12960 and 51840 values for drivers known and unknown, respectively.
Figure 8Mean number of genes selected for the different combinations of model and filter by (a) S.Time and (b) sh. Different symbol shapes identify the number of true nodes (NumNodes) of the true graph. Note that the number of genes selected is a function of Filtering, not Method. Each value shown is the mean of 720 values.
Figure 9Drivers unknown, plot of the coefficients for conjunction, filtering, method, S.Time, S.Type and S.Size from the GLMMs for each performance measure. See legend for Figure 2.
Figure 10Drivers unknown, plot of the coefficients for model, sh, graph, and their interactions with all other terms from the GLMMs for each performance measure. See legend for Figure 2.
Figure 11Mean of each performance measure in the drivers unknown scenario, for the different combinations of model and filtering. Each value shown is the mean of 25920 values.
Ranking of all 144 combinations of method, filtering, and sampling scheme (time, type) when drivers are unknown with respect to each performance measure
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| ||
| S5, OT-A, last, singleC |
|
| 15 | 60 |
| 13 | 13 | 60 | |
| S5, OT-A, last, wholeT_0.5 |
|
| 23 | 59 |
|
| 23 | 59 | |
| J5, OT-A, last, wholeT_0.01 |
|
| 26 | 71 |
|
| 29 | 74 | |
| S5, OT-A, last, wholeT_0.01 |
|
|
| 94 |
|
|
| 94 | |
| S5, OT-A, unif, wholeT_0.01 |
| 17 | 33 | 69 |
| 17 | 32.5 | 68.5 | |
| S5, OT, unif, singleC |
| 41 | 61.5 | 28 |
| 38 | 55.5 | 20.5 | |
| S5, OT-A, unif, singleC |
| 50 | 61.5 | 31 |
| 49 | 55.5 | 20.5 | |
| J1, OT-A, last, singleC |
| 15 | 38 | 80 |
| 12 | 37 | 75 | |
| J1, OT-A, last, wholeT_0.5 |
|
| 39 | 79 |
|
| 39 | 71 | |
| S5, OT-A, unif, wholeT_0.5 |
| 47 | 67 | 29.5 | 14 | 45 | 60.5 | 27.5 | |
| S5, OT, unif, wholeT_0.01 | 11 | 11 | 34 | 68 | 11 |
| 32.5 | 68.5 | |
| J1, OT-A, last, wholeT_0.01 | 13 | 21 |
| 109 | 15 | 19 |
| 108 | |
| S1, OT-A, last, wholeT_0.5 | 18 | 39 |
| 117 |
| 27 |
| 116 | |
| S1, OT-A, last, singleC | 21 | 38 |
| 120 | 12 | 32 |
| 120 | |
| S5, OT, last, singleC | 23 | 16 | 51 | 21 | 20 |
| 42 | 19 | |
| S5, OT, last, wholeT_0.5 | 24 |
| 55 | 16 | 23 |
| 49 | 17 | |
| J1, OT, unif, wholeT_0.01 | 29 |
| 36 | 91 | 30 | 14 | 36 | 91 | |
| S5, CBN-A, unif, wholeT_0.01 | 30 |
| 37 | 89 | 40 | 20 | 40 | 103 | |
| J5, OT, last, wholeT_0.01 | 31 |
| 59 | 32 | 36 |
| 62 | 50 | |
| S1, OT-A, last, wholeT_0.01 | 36 | 48 |
| 134 | 24 | 50 |
| 134 | |
| S5, OT, last, wholeT_0.01 | 38 |
| 29 | 83 | 34 |
| 24 | 82 | |
| J1, OT, last, wholeT_0.5 | 47 | 14 | 89 | 61 | 38 |
| 83 | 43 | |
| J5, OT, last, singleC | 48 | 35 | 101 |
| 58 | 43 | 109 |
| |
| J5, DiP-A, unif, singleC | 49 | 125 | 139 | 10.5 | 51 | 128 | 138 |
| |
| J5, OT, last, wholeT_0.5 | 57 | 20 | 102 |
| 75 | 23 | 112 |
| |
| J5, DiP, unif, singleC | 61 | 131 | 143 |
| 60 | 132 | 144 |
| |
| J5, DiP, unif, wholeT_0.5 | 64 | 144 | 144 |
| 64 | 144 | 143 |
| |
| J5, DiP, last, singleC | 79 | 115 | 140 |
| 81 | 119 | 140 |
| |
| J5, DiP, last, wholeT_0.5 | 82 | 137 | 138 |
| 83 | 136 | 139 |
| |
| S5, DiP, last, wholeT_0.5 | 83 | 134 | 120 |
| 87 | 130 | 110 |
| |
| J1, DiP, last, wholeT_0.5 | 91 | 132 | 134 |
| 89 | 134 | 134 |
| |
| J1, DiP, last, singleC | 92 | 110 | 135 |
| 93 | 115 | 132 |
| |
| S1, OT-A, unif, wholeT_0.01 | 102 | 65 |
| 136 | 70 | 59 |
| 136 | |
| S1, OT, unif, wholeT_0.01 | 109 | 62 |
| 135 | 74 | 55 |
| 135 | |
| S1, CBN-A, unif, wholeT_0.01 | 137 | 82 |
| 143 | 140 | 81 |
| 143 | |
| S1, CBN-A, last, wholeT_0.01 | 142 | 93 |
| 137 | 139 | 94 | 12 | 137 | |
| S1, CBN, unif, wholeT_0.01 | 143 | 84 |
| 144 | 143 | 90 |
| 144 | |
Methods have been ordered by their performance in the first performance measure. Best 10 methods are shown in bold. Only methods that are within the best 10 in at least one performance measure are shown (full table as well as tables split by S.Size are available from Additional file 2).
Frequencies of most common confidence sets using multiple comparisons with the best with a coverage of 0.90, when drivers are unknown
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| ||
| S1:OT-A, S5:OT-A | 0.01 | - | 0.04 | - | 0.01 | - | 0.07 | - | |
| S1:OT, S1:OT-A | 0.07 | 0.07 | 0.11 | - | 0.08 | 0.07 | 0.22 | - | |
| S1:OT-A | 0.02 | - | 0.06 | - | 0.02 | - | 0.05 | - | |
| S5:OT-A | 0.05 | - | - | - | 0.06 | - | - | - | |
| S5:OT, S5:OT-A | 0.10 | 0.03 | - | - | 0.12 | 0.02 | - | - | |
| S5:DiP, S5:DiP-A, S5:OT, S5:OT-A | 0.04 | - | - | - | 0.05 | - | - | - | |
| S1:OT, S1:OT-A, S5:OT, S5:OT-A | 0.02 | - | 0.02 | - | 0.05 | - | 0.07 | - | |
| J5:OT, J5:OT-A, S5:OT, S5:OT-A | 0.01 | 0.06 | - | - | - | 0.04 | - | - | |
| S1:CBN, S1:CBN-A | 0.01 | - | 0.05 | - | 0.01 | 0.01 | 0.01 | - | |
| S1:CBN, S1:CBN-A, S1:OT, S1:OT-A | - | 0.01 | 0.25 | - | 0.02 | - | 0.14 | - | |
| J1:DiP, J1:OT, J5:DiP, J5:OT, S1:DiP, S5:DiP, S5:OT | - | 0.02 | - | 0.04 | - | 0.03 | - | 0.05 | |
Combinations not shown have a frequency less than 0.05 for all columns or are composed of more than 10 individual best methods. Frequencies normalized by column total (N = 432). ‘A:B’ denotes filtering with A and using method B.