| Literature DB >> 29470520 |
Abstract
One of the goals of cancer research is to identify a set of genes that cause or control disease progression. However, although multiple such gene sets were published, these are usually in very poor agreement with each other, and very few of the genes proved to be functional therapeutic targets. Furthermore, recent findings from a breast cancer gene-expression cohort showed that sets of genes selected randomly can be used to predict survival with a much higher probability than expected. These results imply that many of the genes identified in breast cancer gene expression analysis may not be causal of cancer progression, even though they can still be highly predictive of prognosis. We performed a similar analysis on all the cancer types available in the cancer genome atlas (TCGA), namely, estimating the predictive power of random gene sets for survival. Our work shows that most cancer types exhibit the property that random selections of genes are more predictive of survival than expected. In contrast to previous work, this property is not removed by using a proliferation signature, which implies that proliferation may not always be the confounder that drives this property. We suggest one possible solution in the form of data-driven sub-classification to reduce this property significantly. Our results suggest that the predictive power of random gene sets may be used to identify the existence of sub-classes in the data, and thus may allow better understanding of patient stratification. Furthermore, by reducing the observed bias this may allow more direct identification of biologically relevant, and potentially causal, genes.Entities:
Mesh:
Year: 2018 PMID: 29470520 PMCID: PMC5839591 DOI: 10.1371/journal.pcbi.1006026
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Analysis of random bias in TCGA datasets.
| Dataset | Signif % | P-value | PCNA % | PCNA p-val |
|---|---|---|---|---|
| ACC | 80 | 0 | 45 | 2.7e-122 |
| BLCA | 55 | 2e-187 | 51 | 1e-157 |
| BRCA | 21 | 1.3e-27 | 14 | 1.1e-12 |
| CESC | 6 | 0.38 | 7 | 0.072 |
| CHOL | 1 | 6.7e-10 | 1 | 2.2e-09 |
| COAD | 3 | 0.0063 | 1 | 5.7e-07 |
| COADREAD | 5 | 0.64 | 2 | 0.00012 |
| DLBC | 5 | 0.8 | 3 | 0.033 |
| ESCA | 0 | 1.1e-12 | 0 | 8.9e-12 |
| GBM | 7 | 0.075 | 7 | 0.11 |
| GBMLGG | 99 | 0 | 88 | 0 |
| HNSC | 25 | 8.5e-38 | 28 | 4.9e-48 |
| KICH | 8 | 0.0095 | 6 | 0.4 |
| KIPAN | 64 | 1.9e-279 | 24 | 1.3e-35 |
| KIRC | 82 | 0 | 68 | 0 |
| KIRP | 63 | 3.4e-260 | 10 | 4.1e-05 |
| LAML | 0 | 4.9e-11 | 0 | 1.1e-10 |
| LGG | 80 | 0 | 66 | 5.9e-302 |
| LIHC | 34 | 5.2e-70 | 4 | 0.35 |
| LUAD | 45 | 8.8e-118 | 19 | 3.8e-22 |
| LUSC | 20 | 1.9e-25 | 12 | 5.5e-09 |
| MESO | 53 | 7.7e-172 | 20 | 1.1e-25 |
| OV | 4 | 0.12 | 4 | 0.3 |
| PAAD | 43 | 2.1e-107 | 6 | 0.37 |
| PCPG | 4 | 0.16 | 4 | 0.17 |
| PRAD | 1 | 4.3e-06 | 1 | 3.9e-08 |
| READ | 2 | 0.00012 | 2 | 0.00061 |
| SKCM | 3 | 0.016 | 4 | 0.54 |
| TGCT | 0 | 2e-12 | 0 | 4e-13 |
| THCA | 3 | 0.0049 | 4 | 0.23 |
| THYM | 18 | 6.2e-21 | 20 | 2.7e-25 |
| UCEC | 62 | 1.1e-250 | 46 | 2.8e-127 |
| UCS | 2 | 1.7e-05 | 1 | 7.9e-06 |
| UVM | 38 | 5.3e-84 | 21 | 4e-28 |
In this table ‘Dataset’ indicates the abbreviation of the dataset as defined by the TCGA consortium; ‘Signif %’ is the proportion of significant random sets of size 64; ‘P-value’ is the significance by a proportion test for obtaining the proportion of significant %; ‘PCNA %’ is the proportion of significant random sets of size 64 after adjusting for the PCNA signature; and ‘PCNA p-val’ is the significance of the value in PCNA %. The analysis was performed with 5000 random sets.
Consistency of random gene sets.
| Dataset | Samples | Signif % | Repeat % | P-value | OR |
|---|---|---|---|---|---|
| ACC | 79 | 52 | 32 | 5.7e-24 | 1.8 |
| BLCA | 408 | 28 | 12 | 9.9e-40 | 2.4 |
| BRCA | 1093 | 13 | 3 | 3e-3 | 1.4 |
| GBMLGG | 667 | 99 | 98 | 0.5 | 1.3 |
| HNSC | 520 | 13 | 3 | 1e-4 | 1.5 |
| KIPAN | 888 | 44 | 21 | 3.1e-5 | 1.3 |
| KIRC | 533 | 71 | 52 | 0.18 | 1.1 |
| KIRP | 289 | 32 | 13 | 2.0e-20 | 1.8 |
| LGG | 515 | 66 | 47 | 5.4e-5 | 1.3 |
| LIHC | 362 | 19 | 6 | 1.0e-14 | 1.9 |
| LUAD | 514 | 25 | 8 | 2.6e-7 | 1.4 |
| LUSC | 499 | 12 | 3 | 1.8e-12 | 2.3 |
| MESO | 86 | 33 | 12 | 4.5e-10 | 1.5 |
| PAAD | 178 | 21 | 7 | 1.1e-15 | 1.9 |
| THYM | 120 | 9 | 2 | 1.5e-10 | 2.5 |
| UCEC | 369 | 39 | 17 | 1.3e-6 | 1.3 |
| UVM | 80 | 25 | 8 | 3.7e-17 | 1.8 |
In this table ‘Dataset’ indicates the abbreviation of the dataset as defined by the TCGA consortium; ‘Samples’ is the number of samples in the dataset; ‘Signif %’ is the average proportion of significant random sets of size 64 in each half of the original dataset; ‘Repeat %’ is the average proportion of random sets that are significantly correlated to survival in both halves of the dataset; ‘P-value’ is the significance by a Fisher exact test for obtaining the proportion of repeat %; ‘OR’ is the odds ratio for obtaining a random set that is significant in both halves compared to random. The analysis was performed for each dataset with a total of 5000 random sets consisting of 50 choices of halves and 100 random set for each such division.
Fig 1Random bias vs. random set size.
The x axis represents the size of the random gene sets that were chosen, and the y axis represents the proportion of significant sets of that size. A. The results of the analysis using the raw data, and B. The results of the analysis after adjusting for the PCNA meta-gene.
Fig 2The effect of clustering on random bias.
Each horizontal line represents a single TCGA dataset, where the location of each dot along the x axis represents the proportion of significant sets in a single cluster and the location of the short vertical gray line indicates the proportion of significant sets in the complete dataset. Both proportions were calculated using random sets of size N = 64. A. The analysis using the phenoClust clusters B. The analysis using random sub-sampling of the datasets with groups of samples of the same size as the clusters determined by phenoClust.