| Literature DB >> 24479672 |
Benjamin J Raphael1, Jason R Dobson2, Layla Oesper3, Fabio Vandin1.
Abstract
High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer.Entities:
Year: 2014 PMID: 24479672 PMCID: PMC3978567 DOI: 10.1186/gm524
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Somatic mutation detection in tumor samples. DNA-sequence reads from a tumor sample are aligned to a reference genome (shown in gray). Single-nucleotide differences between reads and the reference genome indicate germline single-nucleotide variants (SNVs; green circles), somatic SNVs (red circles), or sequencing errors (black diamonds). (a) In a pure tumor sample, a location containing mismatches or single nucleotide substitutions in approximately half of the reads covering the location indicates a heterozygous germline SNV or a heterozygous somatic SNV - assuming that there is no copy number aberration at the locus. Algorithms for detecting SNVs distinguish true SNVs from sequencing errors by requiring multiple reads with the same single-letter substitution to be aligned at the position (gray boxes). (b) As tumor purity decreases, the fraction of reads containing somatic mutations decreases: cancerous and normal cells, and the reads originating from each, are shown in blue and orange, respectively. The number of reads reporting a somatic mutation decreases with tumor purity, diminishing the signal to distinguish true somatic mutations from sequencing errors. In this example, only one heterozygous somatic SNV and one hetererozygous germline SNV are detected (gray boxes) as the mutation in the middle set of aligned reads is not distinguishable from sequencing errors.
Methods for detecting somatic mutations
| Somatic mutation detection | SNV | MuTect
[ | Designed to detect low-frequency mutations in both whole-genome and exome data. |
| Strelka
[ | Can be applied to both whole-genome and whole-exome data. Uses stringent post-call filtration. | ||
| VarScan 2
[ | Demonstrates high sensitivity for detecting SNVs in relatively pure tumor samples from both whole-genome and exome data. | ||
| JointSNVMix
[ | A probabilistic model that describes the observed allelic counts in both tumor and normal samples. | ||
| CNA or SV | BIC-Seq
[ | Detects CNAs from whole-genome data. | |
| APOLLOH
[ | Predicts loss of heterozygosity regions from whole-genome sequencing data. | ||
| CoNIFER
[ | Detects CNAs from exome data. | ||
| BreakDancer
[ | Cluster paired-end alignments to detect SVs. One version to detect large aberrations and another to detect smaller indels. | ||
| VariationHunter-CommonLaw
[ | Cluster paired-reads, including reads with multiple possible alignments. Support simultaneous analysis of multiple samples. | ||
| GASV/GASVPro
[ | Combine paired-read and read-depth analysis to detect SVs. | ||
| Meerkat
[ | Combines paired-end split-read and multiple alignment information to detect structural aberrations. | ||
| Delly
[ | Combines paired-end and split-read signals to detect structural aberrations. | ||
| Tumor purity estimation | SNV | ABSOLUTE
[ | Originally designed for SNP array data, but may be adapted for whole-genome sequencing data. Handles subclonal populations as outliers. |
| ASCAT
[ | Designed for SNP array data, but may be adapted for whole-genome sequencing data. Only considers a single tumor population. | ||
| CNA | THetA
[ | Able to consider multiple subclonal tumor populations, but only if they differ by large CNAs. Designed for whole-genome sequencing data. | |
| SomatiCA
[ | Only uses aberrations that are identified as clonal to estimate tumor purity. |
CNA, copy number aberration; SNV, single-nucleotide variant; SV, structural variant.
A representative list of software available for the detection of somatic mutations from high-throughput sequencing data of cancer genomes. Some methods detect more than one type of mutation but are listed only once for clarity.
Figure 2Overview of strategies for cancer-genome sequencing. A cancer-genome sequencing project begins with whole-genome or whole-exome sequencing. Various methods are used to detect somatic mutations in the resulting sequence (see Table 1), yielding a long list of somatic mutations. Several strategies can then be employed to prioritize these mutations for experimental or functional validation. These strategies include: testing for recurrent mutations, predicting functional impact, and assessing combinations of mutations (see Table 2). None of these approaches are perfect, and each returns a subset of driver mutations as well as passenger mutations. The mutations returned by these approaches can then be validated using a variety of experimental techniques.
Methods for prediction of driver mutations and genes
| Recurrent somatic mutation identification | SNV | MutSigCV
[ | Uses coverage information and genomic features (e.g. DNA replication time) to estimate the background mutation rate of a gene. |
| MuSiC
[ | Uses a per-gene background mutation rate; allows for user-defined regions of interest. | ||
| Youn | Includes predicted impact on protein function in determining recurrent mutations. | ||
| Sjöblom | Defines a cancer mutation prevalence score for each gene. | ||
| DrGaP
[ | Uses Bayesian approach to estimate background mutation rate; helpful for cancer types with low mutation rate. | ||
| CNA | GISTIC2
[ | Uses ‘peel-off’ techniques to find smaller recurrent aberrations inside larger aberrations. | |
| CMDS
[ | Identifies recurrent CNAs from unsegmented data. | ||
| ADMIRE
[ | Multi-scale smoothing of copy number profiles. | ||
| Functional impact prediction | General | SIFT
[ | Uses conservation of amino acids to predict functional impact of a non-synonymous amino-acid change. |
| Polyphen-2
[ | Infers functional impact of non-synonymous amino-acid changes through alignments of related peptide sequences and a machine-learning-based probabilistic classifier. | ||
| MutationAssessor
[ | Uses protein homologs to calculate a score based on the divergence in conservation caused by an amino-acid change. | ||
| PROVEAN
[ | Benchmarks favorably against MutationAssessor, Polyphen-2 and SIFT. | ||
| Cancer-specific | CHASM
[ | Uses a machine-learning approach to classify mutations as drivers or passengers based on sequence conservation, protein domains, and protein structure. | |
| Oncodrive-FM
[ | Combines scores from SIFT, Polyphen-2, and MutationAccessor into a single ranking. | ||
| Positional or structural clustering | NMC
[ | Finds clusters of non-synonymous mutations across patients. Typically used with missense mutations to detect so-called ‘activating’ mutations. | |
| iPAC
[ | Extends the NMC approach to search for clusters of mutations in three-dimensional space using crystal structures of proteins. | ||
| Pathway analysis and combinations of mutations | Known pathways | GSEA
[ | A general technique for testing ranked lists of genes for enrichment in known gene sets. Can be used on rankings derived from significance of observed mutations. |
| PathScan
[ | Finds pathways with excess of mutations in a gene set (pathway), by combining | ||
| Patient-oriented gene sets
[ | Tests known pathways using a binary indicator for a pathway in each patient. | ||
| Interaction networks | NetBox
[ | Finds network modules in a user-provided list of genes. Significance depends only on the topology of the genes in the network, and not on mutation scores. | |
| HotNet
[ | Finds subnetworks with significantly more aberrations than would be expected by chance, using both network topology and user-defined gene or protein scores. | ||
| MEMo
[ | Finds subnetworks whose interacting pairs of genes have mutually exclusive aberrations
[ | ||
| Dendrix
[ | Identifies groups of genes with mutually exclusive aberrations. | ||
| Multi-Dendrix
[ | Simultaneously finds multiple groups of genes with mutually exclusive aberrations. | ||
| RME
[ | Finds groups of genes with mutually exclusive aberrations by building from gene pairs; best results obtained when restricting to genes with high mutation frequencies (e.g. |
CNA, copy number aberration; SNV, single-nucleotide variant.
A representative list of software available to predict driver mutations or genes by detecting their recurrence across multiple samples, functional impact, or interactions with other mutations in pathways or combinations. Some methods fall into multiple categories but are listed only once for clarity.
Figure 3Overview of approaches to predict driver mutations. (a) Recurrent mutations that are found in more samples than would be expected by chance are good candidates for driver mutations. To identify such recurrent mutations, a statistical test is performed (see Table 2), which usually collapses all of the non-synonymous mutations in a gene into a binary mutation matrix that indicates the mutation status of a gene in each sample. (b) Assessing combinations of mutations overcomes some limitations of single-gene tests of recurrence. Three approaches to identify combinations of driver mutations are: (1) to identify recurrent mutations in predefined groups (such as pathways and protein complexes from databases); (2) to identify recurrent mutations in large protein-protein interaction networks; (3) de novo identification of combinations, without relying on a priori definition of gene sets. These approaches sequentially decrease the amount of prior information in the gene sets that are tested, thus allowing the discovery of novel combinations of driver mutations. However, the decrease in prior knowledge comes at the expense of a steep increase in the number of hypotheses considered, posing computational and statistical challenges. Different methods to identify combinations of driver mutations lie on different positions of the spectrum that represents the trade-off between prior knowledge and number of hypotheses tested.
Figure 4Overview of the HotNet algorithm. HotNet [102] uses a heat-diffusion process to identify significantly mutated subnetworks within an interaction network. (a) Heat is assigned to each gene according to the proportion of samples containing a single-nucleotide variant (SNV) or copy number aberration (CNA) in the gene. (b) The initial heat then spreads on the edges of the network for a fixed amount of time. Removing cold edges connecting genes that do not exchange large amounts of heat breaks the network into smaller subnetworks. (c) HotNet assesses the number and size of the resulting subnetworks using a two-stage statistical test.