Literature DB >> 31690036

Comprehensive Outline of Whole Exome Sequencing Data Analysis Tools Available in Clinical Oncology.

Abstract

Whole exome sequencing (WES) enables the analysis of all protein coding sequences in the human genome. This technology enables the investigation of cancer-related genetic aberrations that are predominantly located in the exonic regions. WES delivers high-throughput results at a reasonable price. Here, we review analysis tools enabling utilization of WES data in clinical and research settings. Technically, WES initially allows the detection of single nucleotide variants (SNVs) and copy number variations (CNVs), and data obtained through these methods can be combined and further utilized. Variant calling algorithms for SNVs range from standalone tools to machine learning-based combined pipelines. Tools for CNV detection compare the number of reads aligned to a dedicated segment. Both SNVs and CNVs help to identify mutations resulting in pharmacologically druggable alterations. The identification of homologous recombination deficiency enables the use of PARP inhibitors. Determining microsatellite instability and tumor mutation burden helps to select patients eligible for immunotherapy. To pave the way for clinical applications, we have to recognize some limitations of WES, including its restricted ability to detect CNVs, low coverage compared to targeted sequencing, and the missing consensus regarding references and minimal application requirements. Recently, Galaxy became the leading platform in non-command line-based WES data processing. The maturation of next-generation sequencing is reinforced by Food and Drug Administration (FDA)-approved methods for cancer screening, detection, and follow-up. WES is on the verge of becoming an affordable and sufficiently evolved technology for everyday clinical use.

Entities: Chemical Disease Gene Mutation Species

Keywords: bioinformatics; cancer; whole exome sequencing

Year: 2019 PMID： 31690036 PMCID： PMC6895801 DOI： 10.3390/cancers11111725

Source DB: PubMed Journal: Cancers (Basel) ISSN： 2072-6694 Impact factor: 6.639

1. Introduction

In the last decade, the price of genome sequencing has shrunk significantly, most of the work has become automated, and preparation guidelines have evolved. Due to these achievements, sequencing a whole genome has become a readily available possibility. Sequencing only targeting regions or the exome, however, implies a significantly smaller financial burden. In whole exome sequencing (WES), we primarily target specific fragments of the genome, the protein-coding part, and we therefore are able to identify genetic variants that will affect proteins. Since most of the known disease-causing mutations fall into this category, exome sequencing is a method that significantly reduces sequencing costs and therefore represents a clinically feasible approach for patient diagnostics. In this paper, we provide a summary of bioinformatic methods addressing the detection of the most frequent genetic aberrations influencing the development and progression of cancer. Cancer is characterized by a set of essential steps that each renegade cell has to master before it can evolve to cancer [1]. The multitude of experimental methods that are at hand to investigate these cancer hallmarks have been systematically reviewed recently [2]. Whole exome sequencing provides a versatile tool to simultaneously monitor multiple different genomic changes in the tumor tissue. Mutations in both coding and noncoding DNA sequence regions have proven to be influential in the development of cancer [3,4]. Nucleic acid changes in the exome can result in amino acid changes in protein sequences. Amino acid changes lead to weakened activity of tumor suppressors, such as APC in colorectal cancer, VHL in renal cell cancer, or BRCA in breast cancer [5,6,7]. Copy number changes in cell cycle regulators, such as TP53 and RB1 [8], as well as malfunctions in repair mechanisms including the homologous recombination and DNA mismatch repair systems, predispose cells to cancer development. The activity of these repair systems can be monitored by measuring tumor mutational burden or microsatellite instability [9,10].

2. First Steps of Whole Exome Sequencing

At present, there are two main categories of next-generation sequencing (NGS) methods, consisting of DNA amplification-based sequencing (Illumina, Ion Torrent) and single molecule real-time sequencing (Pacific Biosciences, Oxford Nanopore). The investigated tissue samples can be freshly frozen, formalin-fixed and paraffin-embedded (FFPE), or liquid-based (blood sample); typically, each of these samples has its own isolation kits. A critical initial step of NGS is adequate pathological examination, as a properly selected and dissected tissue sample is a necessity for any further investigation [11]. Samples should contain a sufficient proportion of tumor cells to differentiate germline and somatic mutations. DNA from an adjacent normal tissue or from a blood sample is needed to identify all germ-line mutations. DNA quality deteriorates with time and after FFPE conservation, which has a degrading effect on the DNA. As the fragmentation of the DNA increases, the genome assembly following sequencing becomes more challenging [12]. During library construction, the exons are captured after an initial fragmentation step. Exome capture can be microarray-based or magnetic-bead based. In this second case, specific probes are hybridized to the sample, which are then pulled out using the magnetic beads. Then, the intronic sequences are discarded, and sequencing is performed using all the exonic sequences. The magnetic-bead-based capture methods are more widespread due to their simplicity [13]. To reach sufficient depth of coverage, properly capturing the targeted regions is necessary. Overall, currently used technologies deliver high efficiency [14]. Actual sequencing comes following exome capture and PCR amplification. The overall process of WES, including data processing and utilization, is summarized in Figure 1.

Figure 1

From tissue to data—steps of whole exome sequencing. Tissue preprocessing starts with the identification of tumor regions by an experienced pathologist, followed by DNA extraction, library construction, and amplification. Data procession commences with the quality check of reads. If the quality of trimmed reads is sufficient, the alignment of the reads to a reference genome is launched. When Binary Alignment Map (BAM) files are processed, the calling of single nucleotide variants, insertions and deletions, and copy number variants comes next, using one or more of the numerous existing algorithms. The data can be further utilized to detect microsatellite instability status, intratumor heterogeneity, tumor mutational burden, and homologous recombination deficiency.

Usually, the data processing part starts with quality control and trimming at which low-quality reads are removed. This step is followed by the alignment of reads to a chosen reference genome followed by a second quality check step and removal of the duplicate reads. After these data processing steps, the variant calling splits, and at this point, a plethora of tools are available, depending on the clinical question one is attempting to answer.

3. Short Nucleotide Variants

Whole exome sequencing is capable of delivering information for all protein-coding regions of the genome, which makes it a useful tool to identify germline and somatic mutations from a tumor sample (Figure 2). Compared to targeted sequencing, WES has the advantage of being able to elucidate the whole exome profile of a sample and to provide information on those low-frequency mutations that can collectively ground a complex phenotypic appearance [15]. Single nucleotide variants are able to increase the expression of key druggable targets, as has been suggested in lung [16], breast [17], colon [18], and gastric cancer [19].

Figure 2

Effects of sequence alterations. Sequence variants in regulatory regions can activate or inhibit transcription. Mutations in exons result in an altered mRNA. Repair mechanisms, such as nonsense-mediated mRNA decay (NMD), can eliminate such abnormal mRNAs. As a result, missense mutations cause amino acid changes, while synonymous mutations result in the original amino acid sequence. Premature stop codons result in terminated amino acid sequences. Base insertions or deletions lead to frameshift mutations resulting in completely different proteins.

Accurate variant calling is a crucial component in the identification of such short variants. Currently, the most common variant caller tools in use include MuTect [20], VarScan2 [21], SomaticSniper [22], Strelka [23], and FreeBayes [24]. In addition, several clinical studies used a combination of these applications for variant calling [25,26,27,28,29,30,31,32,33,34,35]. A comprehensive list of all available tools is presented in Table 1, and the most common tools are presented in Figure 3.

Table 1

Bioinformatic methods available for single nucleotide variant calling. Tools marked with an asterisk (*) are suitable for both whole genome sequencing (WGS) and whole exome sequencing (WES) data analysis.

Name	Published	Cited in 2018	Control Needed	InDel detection	Contamination Correction	Trained on Cancer Data	Environment	Ref
Varscan2	2012	2229	+	+	−	+	Java, Perl, R, Galaxy	[21]
MuTect2 *	2013	2005	+	−	+	+	Java, R	[20]
FreeBayes	2012	1121	−	+	−	+	C, C++, Galaxy	[24]
Strelka *	2012	759	+	+	−	+	C++, Perl	[23]
Platypus *	2014	462	−	+	−	+	C, Cython, Python	[36]
SomaticSniper *	2012	373	+	−	−	+	C, Galaxy	[22]
LoFreq *	2012	349	−	+	+	+	Python	[37]
VarDict *	2016	171	−	+	−	+	Perl	[38]
JointSNVMix *	2012	160	+	−	−	+	C, C++, Python, Galaxy	[39]
MutationSeq *	2012	108	+	−	−	+	C++, Python	[40]
EBCall *	2013	85	+	+	−	+	C++, Perl, R, Shell	[41]
MuSE *	2016	65	+	−	+	+	C, C++	[42]
RADIA	2014	53	+	−	+	+	Python	[43]
Virmid	2013	49	+	−	+	+	Java	[44]
deepSNV *	2014	47	+	−	−	+	R	[45]
Shimmer *	2013	45	+	−	+	+	C, Perl, R	[46]
qSNP *	2013	40	+	−	+	−	Java	[47]
BAYSIC	2014	39	+	−	−	+	R	[48]
SomaticSeq *	2015	38	+	+	−	+	Python, R	[49]
CaVEMan *	2016	31	+	−	+	+	C	[50]
SNooPer *	2016	26	−	+	+	+	Perl	[51]
SNVSniffer *	2016	17	−	+	−	+	C++	[52]
HapMuC	2014	15	−	+	−	+	C++, Python, Ruby	[53]
FaSD-somatic	2014	13	−	−	−	+	C, C++	[54]
LocHap *	2016	8	+	+	+	+	g++ complier, GNU Make	[55]
LoLoPicker *	2017	6	+	−	+	+	Python	[56]

Figure 3

Overview of the most common methods for aberration detection useful in cancer diagnostics.

According to a comparative analysis [57], selection of the right variant caller algorithm depends on the interest of variants. Some tools excel when dealing with low-coverage data (SomaticSniper [22], FaSD-somatic [54], and SNVSniffer [52]), while others perform better in regard to analyze low-frequency variants from high-coverage data (Strelka [23], MuTect [20], LoFreq [37], EBCall [41], deepSNV [45], LoLoPicker [56], and MuSE [42]). Other investigations also supported the approach of using specific variant callers: VarScan identified more high-quality single nucleotide variants (SNVs), while MuTect showed better performance in low-quality detection; therefore, the combinational usage of these can provide improved accuracy [58]. Examination of data from five breast cancer patients with nine variant caller algorithms affirmed the discrepant effect of coverage variability on the results [59]. Comparison of the four most frequently used applications (MuTect2, Strelka, VarsScan2, and SomaticSniper) lead to comparable results [60]. Each caller delivered a divergent outcome, although MuTect2 and Strelka outperformed VarScan and SomaticSniper in some cases. At the end, the authors conclude that the combination of tools could increase performance but with the sacrifice of a vast amount of detected calls [60]. Similar conclusions of complementary algorithms were drawn in another study evaluating four variant callers using whole exome sequencing and simulated data [61]. These researchers also noted differences based on different aligner tools. A further study also underlined the importance of the adequate mixture of aligner and variant caller selection and recommended the combination of the BWA-MEM aligner and SAMtools for SNP calling and the BWA-MEM GATK-Haplotype caller for indel detection [62]. It is important to note that in most comparative studies, the authors used the default settings of the tools; thus, for several methods, the performance might be improved by fine tuning and customization of filters.

4. Integrated Tools

Overall, different algorithms produce divergent output results. The utilization of combined pipelines can successfully filter the false positive hits and provide a platform for the customization of variant calling pipelines for the designated research objective. Such applications developed to deliver consensus Variant Call Format (VCF) files include VCFtools [63] NGS-pipe [64], VariantTools [65], vcfr [66], and myVCF [67]. These tools are notably useful when one aims to build pipelines that analyze VCF files generated in other tools (listed in the previous chapter). Other algorithms, such as Cake, can use BAM files as inputs. Cake runs all the variant caller tools separately and then unites the SNVs confirmed by at least two of the caller tools. Cake also offers numerous postprocessing filtering options [68]. Isma, an R package for the integrative analysis of mutations detected by multiple pipelines, provides a common platform for Strelka, MuTect/MuTect 2, MuSE, SomaticSniper, and VarScan2. Isma provides qualification for the used calling algorithms and highlights outlier results [69]. Using machine learning methods might further improve the specificity, sensitivity and comparability of these applications. BAYSIC integrates, among others, FreeBayes, SamTools, and GATK, and it can accept input from any variant caller algorithm [48]. SomaticSeq merges five algorithms (MuTect, VarScan2, SomaticSniper, JointSNVMix2, and VarDict), providing another machine learning-based ensembled application for SNV and indel identification [49]. SMuRF is another machine learning-based pipeline combining MuTect2, Freebayes, VarDict, and VarScan. SMuRF had the advantage of faster computing speed than other machine learning tools. While SMuRF outperformed several methods, it showed slightly poorer results than SomaticSeq; however, the time needed for SMuRF to compute the results was unsubstantial compared to SomaticSeq (10 min vs. 24 h) [70]. NeoMutate, a recently developed framework, also has the advantages of a mixture of separate tools and a machine learning-based perspective [71]. The application of machine learning ensemble methods has become increasingly accepted and shows a possible path for the development of future variant calling methods. However, currently implemented tools have an important drawback, as their sensitivity depends on that of the included algorithms.

5. Galaxy—An Open Source, Web-Based Platform

To use the applications discussed above, one has to possess advanced or at least intermediate programing skills, not to mention that many of these algorithms require different programming languages. Numerous user-friendly platforms have been established in the past years to overcome this obstacle. Generally, these are capable to give a platform in which users can build workflows made of genomic analysis tools. Researchers can use local workflow management systems like Taverna [72] or KNIME [73]. However, computing power is limited by the performance of the local computer. Cloud computing can serve as a possible solution for this issue [74]. Platforms like Cancer Genomics Cloud (CGC) [75], GenePattern [76], or Galaxy [77] are becoming more and more popular amongst scientists. Additional platforms available are listed in Table 2. Of these tools, Galaxy is the most widespread, due to the wide range of tools included and free availability. Users can utilize publicly available Galaxy servers or can set up their own private server.

Table 2

Platforms available for bioinformatic analysis.

Name	Description	Year	Citation	License	System type	Ref.
Galaxy	Open-source web-platform with several analysis tools	2005	1977	free	cloud-based	[77]
GenePattern	Workflow management system, provides access to multiple genomic analysis tools	2006	1573	free	cloud-based	[76]
KNIME	Software enabling creation, analysis, and visualization of data	2008	1476	free	local installation needed	[73]
UGENE	Workflow management system installed on a local computer	2012	876	free	local installation needed	[78]
Taverna	Open source software tool for designing and executing workflows	2013	643	free	local installation needed	[72]
Cancer Genomics Cloud	Provides access to data, tools, and computing resources	2017	32	commercial	cloud-based	[75]
SciApps	Platform for building, running, and sharing scientific workflows	2018	5	free	cloud-based	[79]
Terra	Bioinformatic workspace, including a repository of public best practices, methods, and public data sets	−	−	commercial	cloud-based	−

When setting up a private server, one can include any of more than 5500 tools and algorithms from the Galaxy toolshed, which serves as an “AppStore” of applications [80]. However, establishing a private server requires constant maintenance and a skilled system administrator. Using a publicly available server, on the other hand, requires only a registration to the designated server, and the leading Galaxy servers already contain most commonly used tools. In addition to accessible research, Galaxy also has two additional important advantages: it makes it easier to reproduce analyses and provides a platform for users to communicate. In regard to variant calling, Galaxy ToolShed provides numerous algorithms. The Galaxy training materials suggest a few recommended tools: VarScan for the identification of germline and somatic variants from tumor-normal sample pairs and FreeBayes for germ line variant calling. As the clinical significance of variant caller methods expands, demands are increasing to solve specific problems. These problems include the detection of low-frequency variants—one possible solution could be utilization of unique molecule identifiers—and the accommodation of non-Illumina platforms. The perpetual improvement of the algorithmic tools is foreseeable if they want to compete with deep learning algorithms [57]. On the other hand, it is important to note that even the most well-established pipelines can be inefficient if the quality of utilized data is poor, e.g., inadequate exome capture, low coverage or modest sequencing quality [62].

6. Copy Number Variations

Copy number variations (CNVs) are structural changes of DNA, sized between a couple of hundred base changes and amplification or deletion of millions of base pairs [81]. The clinical relevance of CNVs in oncology has risen in the past several years, and CNVs have been indicated to be important in several types of cancer, such as adenomatous polyposis coli, familiar breast cancer, and ovarian cancer [8]. The clinically used gold standards for CNV detection are array Comparative Genome Hybridization (aCGH), Fluorescent In Situ Hybridization (FISH), and qPCR [82]. Current Food and Drug Administration (FDA)-approved methods for CNV detection are mainly FISH-based such as the “Dako TOP2A FISH PharmDX kit” for the detection of Topoisomerase 2-alpha aberrations or targeted sequencing based on the “FoundationOneCDx” NGS panel, which is capable of measuring the copy number changes in 324 genes. Each of the gold standard techniques is relatively inexpensive and provides reliable clinical data. Nonetheless, the opportunity to use sequencing can provide a robust amount of additional data with versatile further utility. Using whole genome sequencing (WGS) data for CNV detection has already been demonstrated to be useful [83]. However, due to financial issues, WGS is unlikely to become a clinical tool in the near future. WES, on the other hand, is a more affordable option to identify CNV changes. Currently, dozens of algorithms and pipelines exist to detect CNVs from WES data; we have summarized these in Table 3, and the most common tools are listed in Figure 3. Most of the algorithms are based on the Read Depth approach, and they attempt to measure the CNV changes based on the number of reads aligned to a dedicated segment [84]. Although these algorithms can be relatively precise, normalization problems and other biases present as limitations of NGS technology. These limitations include contamination with normal cells, multiple types of clones among one sample and other experimental noises [85]. Only a few of the methods are capable of detecting CNV from cancer data, and substantial discrepancies can be observed when paralleling these tools. Although several studies have been conducted to compare these applications, only a few have focused on patients suffering from cancer as the study population.

Table 3

Computational methods available for copy number variation estimation from whole exome sequencing data. Tools marked with an asterisk are suitable for both WGS and WES data analysis.

Name	Published	Control Needed	Contamination Correction	GC-Content Correction	Trained on Cancer Data	Cited in 2018	Environment	Ref.
Varscan2	2012	+	−	−	+	2229	Java, Perl, R, Galaxy	[21]
CNVnator	2011	+	−	+	−	767	C++	[86]
CNV-Seq	2009	+	−	−	−	463	Perl, R	[87]
CoNIFER	2012	−	+	−	−	378	Python	[88]
Control-FREEC *	2012	−	+	+	+	342	C, C++, R	[89]
ExomeCNV	2011	+	+	−	+	338	R	[90]
XHMM	2012	−	+	+	+	322	C++	[91]
ExomeDepth	2012	+	−	+	−	264	R	[92]
cn.MOPS	2012	−	+	+	−	249	R	[93]
Cnvkit *	2016	+	+	+	+	219	Python, Galaxy	[94]
CONTRA	2012	−	−	+	−	194	Python, R	[95]
Sequenza *	2015	+	−	+	+	167	Python, R	[96]
EXCAVATOR	2013	+	+	+	+	155	Perl	[97]
CODEX	2015	−	+	+	+	72	R	[98]
ADTEx	2014	+	+	−	+	57	Python, R	[99]
Seqgene	2011	+	−	−	+	43	R	[100]
FishingCNV	2013	−	−	−	−	41	Java, R	[101]
HMZDelFinder	2017	−	−	−	−	33	R	[102]
ExoCNVTest	2012	+	−	−	−	27	Java, R	[103]
CLAMMS	2016	−	−	+	−	23	C	[104]
falcon	2015	+	+	−	+	22	C	[105]
saasCNV *	2015	+	+	−	+	17	R	[106]
WISExome	2017	−	−	−	−	1	C, C++	[107]

Zare et. al. examined five algorithms on tumorous samples and concluded that some applications have achieved relatively good results on specificity and sensitivity [108]. In particular, ExomeCNV [90] showed high specificity and sensitivity with a moderate false discovery rate. SAAS-CNV [106] might be a useful tool for CNV detection; however, the specificity and sensitivity of the algorithm are inferior compared to the array methods [109]. Regarding overall specificity and sensitivity using simulated data [110], ADTEx [99] produced the best results followed by ControlFREEC [89], VarScan2 and ExomeCNV, but ExomeCNV and VarScan2 missed several homozygous deletions. Using breast cancer data in the same comparative study, ExomeCNV [90] showed the best results, while it produced moderate concordance with SNP arrays. Overall, ControlFREEC presented the best algorithm due to the balanced performance on both simulated and cancer data [110]. Based on the study examining six methods (ADTEx, CONTRA [95], ControlFREEC, EXCAVATOR, ExomeCNV, and VarScan2), these can identify homozygous deletions or large gains from WES data, but heterozygous deletions or low-level amplifications cannot be detected with sufficient consistency [111]. The results provided by ADTEx and EXCAVATOR were the most reliable [111]. Taken together, all the cited studies compare algorithms that were designed for somatic CNV detection from cancer-related data, and each came to a similar conclusion. At present, neither sensitivity nor specificity is precise enough to compete with the existing non-WES methods. Furthermore, multiple studies highlighted that using these algorithms on stimulated data shows better performance than on patient data, which indicates that the tools are not sufficiently fine-tuned to address tumor complexity, although some of them, such as ADTEx and ExomeCNV, have a built-in tool to tackle this issue. Each application has different strengths and weaknesses; for instance, ADTEx can detect medium-sized CNVs, while EXCAVATOR is suitable for the identification of larger CNVs. Similar to SNVs, merging, fine tuning and recalibration of these tools could be a means of improving the specificity and sensitivity [112,113]. It is important to mention, however, that these discrepancies are not specific to somatic mutation detection, as similar issues appeared in germline mutation-based comparison [84]. Dealing with NGS data demands well-trained bioinformaticians because most of the algorithms can only be used in command line-based platforms. The availability of the aforementioned applications in Galaxy is slightly limited—to date, VarScan2 and a CNV caller part of the bcftools package are available in the basic Galaxy setup. Several further algorithms can be installed in the case of a private Galaxy server.

7. Homologous Recombination Deficiency

DNA double-strand breaks are one of the most mutagenic forms of DNA damage [114,115]. Cells have developed multiple solutions to confront these effects, such as homologous recombination and nonhomologous end-joining [116]. Germline mutations of the BRCA genes have been described as reliable markers to identify homologous recombination deficiency (HRD). Currently, one FDA-approved clinical tool is available to detect germline BRCA mutations, the BRACAnalysisCDx platform (Myriad Genetics; Salt Lake City, UT, USA), which is used to identify BRCA status in patients with ovarian cancer. The presence of a BRCA mutation enables treatment with a PARP inhibitor. PARP repairs single strand breaks, and the loss of both double-strand and single-strand break repair renders the tumor highly vulnerable to chemotherapy. HRDetect is a WGS-based method to identify the presence of homologous recombination repair mechanism mutations; this tool has proven to be effective and reliable regardless of germline and somatic mutation or tissue type. However, using this tool on WES data revealed a considerable decrease in the detection sensitivity [117]. Another recent WES-based tool promises comparable results with SNP array examinations based on genomic scar analysis and might be a useful tool to detect BRCA status [118]. Since HRD detection mainly focuses on BRCA status, we currently have a lack of application capable of measuring overall HRD status involving all related genetic aberrations. Meanwhile, several other genes have also been shown to play important roles in HRD [119]. An improved future WES-based algorithm could enable the simultaneous investigation of all involved genes.

8. Response to Immunotherapy

Immune checkpoint inhibitors and immunomodulatory agents have become standard treatments for solid tumors, including renal-cell carcinoma, melanoma and NSCLC [120]. The number of mutations per coding sequence in the tumor genome is a reliable predictive biomarker of immunotherapy response [121]. At present, the application of WES to detect tumor mutational burden (TMB) is a widely accepted gold standard. In addition, multiple targeted panels have also been accepted as targeted sequencing show comparable results in the detection of TMB status as exome sequencing [122]. Although TMB bears strong potential as a predictive biomarker, there is a lack of unambiguous consensus on the correct determination, definition, and cut-off values. The Friends of Cancer Research established a working group to create a universal reference and harmonize these methods to address this issue [123]. Because of the lack of solid guidelines, various studies have used numerous methods and computational techniques for TMB status determination. We evaluated eleven phase II and III clinical studies, and MuTect was the most frequently used tool for somatic variant detection, while the applications applied for InDel detection showed a wide variety [25,26,27,28,29,30,31,32,33,34,35]. A significant set of publications use the pipeline proposed by the Genome Analysis Toolkit—supplementing it with additional tools—which recommends GATK-Mutect2, which is based on MuTect and the GATK-HaplotypeCaller. Another concept recently gaining attention is the examination of mutational signatures. Mutations in cancer can originate in different mutagenic effects or defects in repair mechanisms. Each genetic aberration has its unique mutational signature which can include base substitutions, small insertions and deletions, CNV changes, or genomic rearrangements [124]. As the quantity of explored signatures is growing, a systematic and curated archive of genetic patterns is needed. The Catalogue of Somatic Mutations in Cancer (COSMIC) provides such a repository for mutational signatures and specific summary vignettes. Deciphering characteristic mutational patterns in a chosen cancer type requires bioinformatic analysis as well. Currently, there are several algorithms designed for mutational landscape identification, such as SigProfiler [125], deconstrutSigs [126], and mutationalPatterns [127]. HRDetect, a tool developed as a kind of mutational signature detecting algorithm designed for the identification of homologue recombination repair deficiency, has been already discussed in a separate paragraph. Accepted analysis standards for these methods are still missing [128]. Clinical cancer diagnostics might benefit from the application of mutational signature detection, as aberration patterns can be useful for targeted treatment selection [129]. A different predictive biomarker for immune modulatory response is the evaluation of Microsatellite Instability (MSI). From the time when the FDA approved pembrolizumab for the treatment of adult and pediatric microsatellite instability high (MSI-H) or mismatch repair-deficient (dMMR) solid tumors, MSI detection gathered significant clinical attention [29]. A recent study suggests that impaired mismatch repair activity might result in higher mutational burden resulting in augmented response to immunomodulatory agents [130]. The currently existing method for MSI detection, known as the combination of PCR with fluorescent primers and capillary electrophoresis, is becoming obsolete with the introduction of WES and targeted gene panel sequencing [131]. At present, the number of applications for MSI identification from exome sequencing data is not as high as the number of those for CNV or short variant detection. Comparing some of these tools in six cancer types revealed that MANTIS produces better sensitivity and specificity than MSIsensor and mSINGS [132]. MSIseq show results comparable to MSISensor and mSINGS, while the MSIseq R package runs much faster than the two other [133]. MSIseq and MSIpred have the advantage that these algorithms can measure MSI from tumor data only. Based on data comparison using TCGA data, MSIpred exhibited higher accuracy and sensitivity than MSIseq [134]. MIRMMR also displayed similarities in accuracy and sensitivity with MSIsensor and mSINGS [135]. A recently implemented tool based on the examination of 5930 tumor exomes across 18 cancer types, called MOSAIC, produced remarkable sensitivity and specificity [136]. Overall, out of the seven algorithms available, MOSAIC has the strongest and most well-established analytical background, while MSIpred shows better performance than others with the advantage that it can operate without a normal reference sample. Unfortunately, no specific tool has been developed for MSI detection from exome sequencing data for those who have less experience in command line coding. This finding indicates that the Galaxy platform is the only alternative. Finally, predicting response to immunotherapy has an additional option as—according to a state-of-the-art paper—elevated DNA damage might be a possible biomarker of response [137].

9. Tumor Heterogeneity

Tumor heterogeneity stands for diversity within one tumor population, where several different populations coexist. These cancerous populations coexist with normal cells and infiltrating immune-related cells in a special microenvironment. The subclonal populations can cooperatively evolve and are even capable of adapting to altered circumstances, including the emergence of therapy-resistant clones following systemic anticancer treatments [138]. Currently, there is no broadly accepted consensus method for the estimation of tumor heterogeneity. Identification of the clonal subpopulations is possible by all three sequencing methods—WGS, WES, and targeted panel—and by single-cell approaches. A widely accepted way to measure tumor diversity is the use of WES to measure the genetic heterogeneity of a tumor sample by counting Shannon’s diversity index of the estimated SNVs [139]. The determination of tumor clonality and evolutionary background from bulk sequencing data is a multistep process. This method begins with the cancer cell fraction estimation, then the identification of tumor subclones followed by the construction of a phylogenic tree based on the distribution of somatic variants and/or CNV status. Finally, temporal differentiation can assist in distinguishing between passenger and driver mutations [140]. In addition to the aforementioned approach, numerous algorithms have been developed to illuminate subclone phylogenesis. Unfortunately, due to the scarcity of comparative studies, we have only limited guidance on proper algorithm selection at this time. In a study of nine methods, LICHeE and CloneFinder produced decent accuracy compared to the others [141]. In a recent comparative study currently available in a preprint server only, the authors examined seven clonality prediction methods. CloneFinder, MACHINA, Treeomics, and LICHeE showed the best performance, but it is important to mention that none of the applications showed impaired overall performance on all the stimulated datasets [142]. Overall, the examination of tumor heterogeneity by NGS-based methods has a limited history, and because of this reason, many of the currently existing methods require further fine-tuning. In vitro experiments might serve as guidance for adequate algorithm calibration and could provide further information on the detection threshold and coverage cut-off value selection. Recently, we have shown that cellular movement can also lead to a significant technical bias when using NGS to determine the clonal composition of a tumor [143]. With the technical development of both bulk sequencing and single cell methods, we will soon be able to confidently obtain an accurate picture of a cancer population in its complete heterogeneity.

10. Discussion

The first U.S. Food and Drug Administration (FDA) approval for NGS technology was issued in 2013, and a few years later, the approval of the first tests for diagnostic and screening was granted. We provide an overview of NGS-based tests approved for somatic or germline mutation detection in Table 4.

Table 4

Food and Drug Administration (FDA)-approved next-generation sequencing (NGS)-based methods suitable for cancer predisposition identification, cancer detection, or follow-up.

Tradename	Description	Year	Target	Tumor	Utility
Illumina MiSeqDX platform	High throughput DNA sequence analyzer	2013	-	-	technology
FoundationFocus CDxBRCA	NGS oncology panel, somatic or germline variant detection system	2016	BRCA	ovarian	diagnosis
MSK-IMPACT	NGS-based tumor profiling test	2017	468 genes	various	predisposition, diagnosis
FoundationOne CDx	NGS oncology panel, somatic or germline variant detection system	2017	324 genes	various	predisposition, diagnosis
Oncomine Dx Target Test	NGS oncology panel, somatic or germline variant detection system	2017	24 genes	lung	diagnosis
Praxis Extended RAS Panel	NGS oncology panel, somatic or germline variant detection system	2017	RAS	colon	diagnosis
Adaptive Biotechnologies clonoSEQ	DNA-based test for minimal residual disease for hematologic malignancies	2018	BCL1, BCL2	leukemia, myeloma	follow-up

We are now in the big data era borne by the vast amount of data delivered by new sequencing methods. Deciphering this information requires complex bioinformatical analytical tools. At the same time, we have to account for the unquestionable weaknesses of exome sequencing [144]. These disadvantages include the limited power to detect structural gene fusions and the limited ability to delineate tumor purity and differentiate from normal cell contamination. The previously discussed machine learning algorithms in short variant detection can improve the accuracy of TMB and MSI detection, as punctual short variant identification is a crucial part of both. Improved detection of copy number changes can lead to more accurate HRD and tumor heterogeneity analysis [145]. The final outcome of our paper is that, due to discrepancies amongst tools used during sample preparation and data preprocessing and processing, it is almost impossible to define a gold standard guideline of the most handy algorithms. Of note, anyone can customize the selected algorithms specifically for their own experiment rather than using it on default settings. The clinical significance of NGS-based methods is consistently expanding. Although discrepancies can be observed among the currently available tools, the continuous fine-tuning and the merged utilization of these applications paves the way for clinically reliable applications in the coming years. Overall, WES is emerging as a future “Swiss army knife” of cancer genome profiling. After as bioinformatic processes have evolved to trustworthy pipelines, WES will be an affordable and mature technology for everyday clinical use.

139 in total

1. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.

Authors: Menachem Fromer; Jennifer L Moran; Kimberly Chambert; Eric Banks; Sarah E Bergen; Douglas M Ruderfer; Robert E Handsaker; Steven A McCarroll; Michael C O'Donovan; Michael J Owen; George Kirov; Patrick F Sullivan; Christina M Hultman; Pamela Sklar; Shaun M Purcell
Journal: Am J Hum Genet Date: 2012-10-05 Impact factor: 11.025

2. Genetic basis for clinical response to CTLA-4 blockade in melanoma.

Authors: Alexandra Snyder; Vladimir Makarov; Taha Merghoub; Jianda Yuan; Jedd D Wolchok; Timothy A Chan; Jesse M Zaretsky; Alexis Desrichard; Logan A Walsh; Michael A Postow; Phillip Wong; Teresa S Ho; Travis J Hollmann; Cameron Bruggeman; Kasthuri Kannan; Yanyun Li; Ceyhan Elipenahli; Cailian Liu; Christopher T Harbison; Lisu Wang; Antoni Ribas
Journal: N Engl J Med Date: 2014-11-19 Impact factor: 91.245

3. SomaticSniper: identification of somatic point mutations in whole genome sequencing data.

Authors: David E Larson; Christopher C Harris; Ken Chen; Daniel C Koboldt; Travis E Abbott; David J Dooling; Timothy J Ley; Elaine R Mardis; Richard K Wilson; Li Ding
Journal: Bioinformatics Date: 2011-12-06 Impact factor: 6.937

4. The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research.

Authors: Jessica W Lau; Erik Lehnert; Anurag Sethi; Raunaq Malhotra; Gaurav Kaushik; Zeynep Onder; Nick Groves-Kirkby; Aleksandar Mihajlovic; Jack DiGiovanna; Mladen Srdic; Dragan Bajcic; Jelena Radenkovic; Vladimir Mladenovic; Damir Krstanovic; Vladan Arsenijevic; Djordje Klisic; Milan Mitrovic; Igor Bogicevic; Deniz Kural; Brandi Davis-Dusenbery
Journal: Cancer Res Date: 2017-11-01 Impact factor: 12.701

5. ESMO recommendations on microsatellite instability testing for immunotherapy in cancer, and its relationship with PD-1/PD-L1 expression and tumour mutational burden: a systematic review-based approach.

Authors: C Luchini; F Bibeau; M J L Ligtenberg; N Singh; A Nottegar; T Bosse; R Miller; N Riaz; J-Y Douillard; F Andre; A Scarpa
Journal: Ann Oncol Date: 2019-08-01 Impact factor: 32.976

6. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data.

Authors: F Favero; T Joshi; A M Marquard; N J Birkbak; M Krzystanek; Q Li; Z Szallasi; A C Eklund
Journal: Ann Oncol Date: 2014-10-15 Impact factor: 32.976

7. HapMuC: somatic mutation calling using heterozygous germ line variants near candidate mutations.

Authors: Naoto Usuyama; Yuichi Shiraishi; Yusuke Sato; Haruki Kume; Yukio Homma; Seishi Ogawa; Satoru Miyano; Seiya Imoto
Journal: Bioinformatics Date: 2014-08-14 Impact factor: 6.937

8. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing.

Authors: Jean-François Spinella; Pamela Mehanna; Ramon Vidal; Virginie Saillour; Pauline Cassart; Chantal Richer; Manon Ouimet; Jasmine Healy; Daniel Sinnett
Journal: BMC Genomics Date: 2016-11-14 Impact factor: 3.969

9. VariantTools: an extensible framework for developing and testing variant callers.

Authors: Michael Lawrence; Robert Gentleman
Journal: Bioinformatics Date: 2017-10-15 Impact factor: 6.937

10. Two mechanisms produce mutation hotspots at DNA breaks in Escherichia coli.

Authors: Chandan Shee; Janet L Gibson; Susan M Rosenberg
Journal: Cell Rep Date: 2012-10-04 Impact factor: 9.423

8 in total

Review 1. Online informatics resources to facilitate cancer target and chemical probe discovery.

Authors: Xuan Yang; Haian Fu; Andrey A Ivanov
Journal: RSC Med Chem Date: 2020-04-09

Review 2. vcfView: An Extensible Data Visualization and Quality Assurance Platform for Integrated Somatic Variant Analysis.

Authors: Brian O'Sullivan; Cathal Seoighe
Journal: Cancer Inform Date: 2020-11-11

Review 3. Tumour mutational burden as a biomarker for immunotherapy: Current data and emerging concepts.

Authors: Jean-David Fumet; Caroline Truntzer; Mark Yarchoan; Francois Ghiringhelli
Journal: Eur J Cancer Date: 2020-04-09 Impact factor: 10.002

4. Copy Number Variant Detection with Low-Coverage Whole-Genome Sequencing Represents a Viable Alternative to the Conventional Array-CGH.

Authors: Marcel Kucharík; Jaroslav Budiš; Michaela Hýblová; Gabriel Minárik; Tomáš Szemes
Journal: Diagnostics (Basel) Date: 2021-04-15

5. Evaluating the efficacy of a priming dose of cyclophosphamide prior to pembrolizumab to treat metastatic triple negative breast cancer.

Authors: Carey K Anders; Mark G Woodcock; Benjamin G Vincent; Jonathan S Serody; Amanda E D Van Swearingen; Dominic T Moore; Maria J Sambade; Sonia Laurie; Alexander Robeson; Oleg Kolupaev; Luz A Cuaboy; Amy L Garrett; Karen McKinnon; Kristen Cowens; Dante Bortone; Benjamin C Calhoun; Alec D Wilkinson; Lisa Carey; Trevor Jolly; Hyman Muss; Katherine Reeder-Hayes; Rebecca Kaltman; Rachel Jankowitz; Vinay Gudena; Oludamilola Olajide; Charles Perou; E Claire Dees
Journal: J Immunother Cancer Date: 2022-02 Impact factor: 12.469

6. Mesenchymal tumor cells drive adaptive resistance of Trp53^-/- breast tumor cells to inactivated mutant Kras.

Authors: Linda J van Weele; Sabra I Djomehri; Shang Cai; Jane Antony; Shaheen S Sikandar; Dalong Qian; William H D Ho; Robert B West; Ferenc A Scheeren; Michael F Clarke
Journal: Mol Oncol Date: 2022-04-23 Impact factor: 7.449

7. Local data commons: the sleeping beauty in the community of data commons.

Authors: Jong Cheol Jeong; Isaac Hands; Jill M Kolesar; Mahadev Rao; Bront Davis; York Dobyns; Joseph Hurt-Mueller; Justin Levens; Jenny Gregory; John Williams; Lisa Witt; Eun Mi Kim; Carlee Burton; Amir A Elbiheary; Mingguang Chang; Eric B Durbin
Journal: BMC Bioinformatics Date: 2022-09-23 Impact factor: 3.307

8. Enrichment of low abundance DNA/RNA by oligonucleotide-clicked iron oxide nanoparticles.

Authors: Fereshte Damavandi; Weiwei Wang; Wei-Zheng Shen; Sibel Cetinel; Tracy Jordan; Juan Jovel; Carlo Montemagno; Gane Ka-Shu Wong
Journal: Sci Rep Date: 2021-06-22 Impact factor: 4.379

8 in total