| Literature DB >> 27417679 |
Matan Hofree1,2,3, Hannah Carter2,3,4, Jason F Kreisberg1,3, Sourav Bandyopadhyay5, Paul S Mischel6, Stephen Friend7, Trey Ideker1,2,3,4.
Abstract
Massively parallel sequencing has permitted an unprecedented examination of the cancer exome, leading to predictions that all genes important to cancer will soon be identified by genetic analysis of tumours. To examine this potential, here we evaluate the ability of state-of-the-art sequence analysis methods to specifically recover known cancer genes. While some cancer genes are identified by analysis of recurrence, spatial clustering or predicted impact of somatic mutations, many remain undetected due to lack of power to discriminate driver mutations from the background mutational load (13-60% recall of cancer genes impacted by somatic single-nucleotide variants, depending on the method). Cancer genes not detected by mutation recurrence also tend to be missed by all types of exome analysis. Nonetheless, these genes are implicated by other experiments such as functional genetic screens and expression profiling. These challenges are only partially addressed by increasing sample size and will likely hold even as greater numbers of tumours are analysed.Entities:
Mesh:
Year: 2016 PMID: 27417679 PMCID: PMC4947162 DOI: 10.1038/ncomms12096
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Figure 1Original experimental techniques used to identify currently known cancer genes.
(a) Shown is the cumulative number of cancer genes known to be perturbed by somatic single-nucleotide variations, as recorded in the COSMIC CGC, according to the year of first cancer-related publication indexed in PubMed. Each bar is coloured by the experimental technique categories used by these first publications. In parenthesis is the number of genes associated with each experimental category as of 2013. (b) Proportion of the different types of somatic alteration included in the CGC. In blue are the proportions for all somatically altered genes; in green are the same proportions for genes also known to have single-nucleotide alterations.
Prominent methods for cancer gene discovery by somatic exome analysis.
| Method | Data type (method) | Analysis principle | No. tissue cohorts (no. patients) | Total genes identified | Genes non-unique/unique to method | CGC non-unique/unique to method | Ref. |
|---|---|---|---|---|---|---|---|
| MutSig Suite | SNV (WES) | Combined (frequency, function, clustering) | 21 (4,742) | 260 | 191/69 | 98/7 | |
| OncodriverFM | SNV (WES) | Function | 28 (6,792) | 426 | 281/145 | 127/31 | |
| OncodriverCL | SNV (WES) | Clustering | 28 (6,792) | 79 | 72/7 | 52/2 | |
| ActiveDriver | SNV (WES) | Clustering (+phos-associated mutations) | 12 (3,205) | 106 | 74/32 | 30/5 | |
| MuSIC | SNV (WES) | Combined (frequency, function, clustering, correlation with clinical phenotype) | 12 (3,205) | 182 | 141/41 | 81/3 | |
| Gistic2.0—amplifications | CNV (SNP6) | Frequency | 34 (10,752) | 1,569 | 432/1137 | 53/21 | |
| Gistic2.0—deletions | CNV (SNP6) | Frequency | 34 (10,752) | 6,897 | 671/6226 | 98/65 | |
| IntOGen—CNV | CNV (SNP6) | Frequency+RNA expression | 16 (4,068) | 29 | 28/1 | 25/0 | |
| Dendrix | SNV (WES) | Mutual exclusivity | 12 (3,281) | 17 | 28/2 | 23/1 | |
| HotNet2 | SNV+CNV (WES+SNP6) | Network | 12 (3,281) | 147 | 96/51 | 43/0 | |
| Fusion/translocations | FUS (RNA-seq) | Recurrent fusions | 13 (4,366) | 492 | 236/256 | 41/18 | |
| TOTALS: | 42 | 8,871 | 906/7,967 | 175/153 | |||
*Data types: CNV, copy number variant; FUS, gene fusion; SNV, single nucleotide variant. Methods: RNA-seq, RNA sequencing; SNP6, affymetrix SNP array; WES, whole-exome sequencing.
†Number of genes identified within the CGC-positive reference set.
Overview of positive cancer reference sets.
| Number of genes | Curation process | Alteration type | Somatic/germline | Ref. | |
|---|---|---|---|---|---|
| CGC-Somatic | 532 | Manual | SNV, CNV, Trans/fusion | Somatic | |
| CGC-SNV | 188 | Manual | SNV | Somatic | |
| CGC-TRANS | 327 | Manual | Trans/fusion (not SNV) | Somatic | |
| CGC-CNV | 15 | Manual | CNV (not SNV) | Somatic | |
| CGC-Germline | 38 | Manual | SNV, CNV, Trans/fusion | Germline (not somatic) | |
| UniprotKB | 412 | Manual | Unspecified | Both | |
| Text-mining | 711 | Automated | Unspecified | Both | |
| AGO | 1,430 | Manual | Unspecified | Both |
*Genes altered by translocations/fusions or CNVs, respectively, but not by SNVs.
†Genes altered in germline only; excludes genes also altered somatically.
Figure 2Performance of methods.
Heatmaps showing the (a) recall and (b) precision of each method (rows) tested against each positive cancer reference set (columns). Dashed box highlights the performance of MAIN-METHODS on the CGC-SNV reference set. To compute precision, we assume the proportion of cancer genes is 5% of all human genes; precision values for other proportions are shown in Supplementary Fig. 1 with qualitatively similar results. (c) Precision/recall plot detailing results from a and b for CGC-SNV cancer genes. (d) Summary of CGC-SNV genes curated for particular cancer tissues versus their cancer detection status based on genome analysis by four different methods and their union. (e) Count of CGC-SNV genes as a function of the number of cancer tissue types in which each gene has been detected thus far.
Figure 3Experimental support for reference cancer gene lists.
(a–c) Support for CGC cancer genes detected by any of the MAIN-METHODS for analysing tumour genomes (Cancer Detected) versus those cancer genes that were undetected by any of these (cancer undetected). Also shown is support for the AGO-NEG negative control set of non-cancer genes (Likely non-cancer) and the remainder of genes in the genome-wide background (all other genes). Whisker plots indicate mean and the 95% confidence interval of the mean. Support is evaluated using: (a) RNA-seq tumour-normal differential expression in The Cancer Genome Atlas (TCGA). (b) Number of times a gene has been identified in independent cancer genetic screens in mice. (c) Number of Project Achilles cell lines with a measured impact (top/bottom 10%) on growth as a result of shRNA knockdown. An asterisk (*) indicates a significant difference in medians was found between the two sets. (d) The number of cancer publications by year comparing detected and undetected CGC cancer genes.
Figure 4Power to detect recurrently mutated genes as the number of tumour exomes increases.
(a) Number of patient samples (y axis) necessary for detecting a cancer gene, as a function of the background somatic mutation rate of the tissue (x axis) and the fold increase in mutation rate of the cancer gene above this background (coloured lines). The total 10-year U.S. incidences of major cancer types are indicated (grey circles with horizontal bars), along with the number of patients currently sequenced as listed by the ICGC database v20 (dotted circles). (b) Mutated genes of a single breast adenocarcinoma patient, ranked by mutation frequency within tumours of this tissue type. (c) Same analysis showing the median behaviour for 881 The Cancer Genome Atlas (TCGA) patients with breast cancer. Mutated genes in each patient are ranked by mutation frequency; the median mutation frequency over all patients is plotted for each percentile.