Literature DB >> 28185561

Optimized pipeline of MuTect and GATK tools to improve the detection of somatic single nucleotide polymorphisms in whole-exome sequencing data.

Ítalo Faria do Valle^1,2, Enrico Giampieri¹, Giorgia Simonetti³, Antonella Padella³, Marco Manfrini³, Anna Ferrari³, Cristina Papayannidis³, Isabella Zironi¹, Marianna Garonzi⁴, Simona Bernardi⁵, Massimo Delledonne^4,6, Giovanni Martinelli³, Daniel Remondini⁷, Gastone Castellani¹.

Abstract

BACKGROUND: Detecting somatic mutations in whole exome sequencing data of cancer samples has become a popular approach for profiling cancer development, progression and chemotherapy resistance. Several studies have proposed software packages, filters and parametrizations. However, many research groups reported low concordance among different methods. We aimed to develop a pipeline which detects a wide range of single nucleotide mutations with high validation rates. We combined two standard tools - Genome Analysis Toolkit (GATK) and MuTect - to create the GATK-LODN method. As proof of principle, we applied our pipeline to exome sequencing data of hematological (Acute Myeloid and Acute Lymphoblastic Leukemias) and solid (Gastrointestinal Stromal Tumor and Lung Adenocarcinoma) tumors. We performed experiments on simulated data to test the sensitivity and specificity of our pipeline.
RESULTS: The software MuTect presented the highest validation rate (90 %) for mutation detection, but limited number of somatic mutations detected. The GATK detected a high number of mutations but with low specificity. The GATK-LODN increased the performance of the GATK variant detection (from 5 of 14 to 3 of 4 confirmed variants), while preserving mutations not detected by MuTect. However, GATK-LODN filtered more variants in the hematological samples than in the solid tumors. Experiments in simulated data demonstrated that GATK-LODN increased both specificity and sensitivity of GATK results.
CONCLUSION: We presented a pipeline that detects a wide range of somatic single nucleotide variants, with good validation rates, from exome sequencing data of cancer samples. We also showed the advantage of combining standard algorithms to create the GATK-LODN method, that increased specificity and sensitivity of GATK results. This pipeline can be helpful in discovery studies aimed to profile the somatic mutational landscape of cancer genomes.

Entities: Chemical Disease Gene Species

Keywords: Cancer; Somatic single nucleotide variants; Whole exome sequencing

Mesh：

Year: 2016 PMID： 28185561 PMCID： PMC5123378 DOI： 10.1186/s12859-016-1190-7

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Somatic mutations play a key role in cancer development, progression and chemotherapy resistance. Therefore, several studies have been profiling somatic mutations in cancer samples by applying next generation sequencing technologies, allowing the discovery of drug targets, prognostic DNA markers and protocols of targeted therapies. Whole Exome Sequencing (WES) has become a popular approach because it is cost effective and it detects approximately 25,000 single nucleotide variants (SNVs) in the coding region of human DNA. However, the detection of somatic mutations in normal-cancer paired samples presents unique challenges: 1) detecting low allelic frequency mutations due to tumor heterogeneity, subclonality and copy number variation events; 2) differentiating true mutations from alignment artifacts and sequencing errors; 3) classifying mutations as somatic or germ-line polymorphisms; and 4) analyzing tumor samples contaminated by normal cells and vice-versa [1]. The understanding of the mutational landscape of cancer genomes requires the development of methods that detect somatic mutations and deal with these challenges. Several studies have compared the performance of different pipelines, softwares and parametrizations [2-7]. In general, the available tools classify the somatic mutations by either independently or simultaneously analyzing the tumor and normal samples; but, since they have different prior assumptions and error modeling approaches, many research groups have reported low concordance among methods [4, 8]. The available tools either detect too many false positives in order to get all true positives or lose too many true positives in order to reduce the number of false positives [9]. In the first case, the researcher spends much time and resource validating the set of candidate variants to select the true ones. In the second case, important mutations that explain the biological characteristics of the cancer cells, may be missed. This evidence, along with the variability in the performance of each software according to studies and tumor type, indicates that the research community faces a big challenge choosing the right pipeline among all available options. In this study, we aimed to develop a pipeline that detects a wide and high confident profile of single nucleotide variants in sequencing data of cancer samples. Our pipeline brings together the benefits of two standard tools: Genome Analysis Toolkit (GATK) and MuTect. GATK independently calls variants in the normal and tumor samples, while MuTect performs the analysis simultaneously. We created the GATK-LODN method, which is part of the MuTect algorithm, that is applied downstream to the GATK analysis in order to ensure the somatic classification of the GATK results and reduce its false positive calls. As proof of principle, we applied our pipeline to hematological (Acute Myeloid and Acute Lymphoblastic Leukemias) and solid (Gastrointestinal Stromal Tumor and Lung Adenocarcinoma) tumors. We also tested our pipeline on simulated data and technical replicate samples to evaluate its sensitivity and specificity. Our results show that the pipeline performed well and we believe that it can be helpful in discovery studies aimed to profile the somatic mutational landscape of cancer genomes.

Methods

Sequencing data

Primary samples were collected from Acute Myeloid Leukemia (n = 37) and Acute Lymphoblastic Leukemia patients (n = 41) after obtaining informed consent as approved by the Institutional Ethical Committee (protocol number 253/2013/O/Tess) of Azienda Ospedaliero-Universitaria, Policlinico Sant’Orsola-Malpighi (Bologna, Italy) in accordance with the Declaration of Helsinki. Leukocytes were enriched from bone marrow and peripheral blood samples by separation on Ficoll density gradient. Saliva samples, used as normal matching, were collected with the Oragene Discover kit (DNA Genotek). The DNA was extracted from leukocytes by column purification (AllPrep DNA/RNA/Protein Mini Kit and QIAcube, Qiagen) and from saliva by paramagnetic particles (Maxwell® 16 LEV DNA Blood Purification Kit and Maxwell® MDx Instrument), according to manufacturer’s protocol. The exonic regions were captured by TrueSeq™ Exome Enrichment Kit and Nextera Rapid Capture Expanded Exome, comprising a targeted region of 62 Mb, and 201,121 exonic regions. Illumina HiSeq2000 sequencing produced an average of 55.2 and 63 million 100 bp paired-end reads per sample in AML and ALL cohorts, respectively. The AML and ALL data sets are available upon request to the Next Generation Sequencing for Targeted Personalized Therapy of Leukemia consortium. We also selected two public datasets of Illumina HiSeq 2000 whole exome sequencing from the NCBI Sequence Read Archive: 1) seven Gastrointestinal Stromal Tumors (GIST) samples, and their matching peripheral blood samples, with an average of 35.5 million 100 bp paired-end reads per sample [SRA: SRR1299130-141 and SRR1299144-147] [10]; and 2) two Lung Adenocarcinoma samples, and their normal counterparts, with an average of 56.5 million 100 pb paired-end reads per sample [SRA: ERR160124, ERR160136, ERR166338, and ERR166339] [11]. After the quality control check, the average of final coverages were: 72X (±30X), 119X (±28X), 76X (±7X), 133X (±64X); for AML, ALL, GIST, and Lung Adenocarcinoma, respectively (Additional file 1 provides, for each tumor type, the samples IDs and coverage information).

Pipeline for somatic variant discovery

Initially, the sequencing reads were submitted to a quality control check by using the scripts fastq_quality_filter.pl and fastq_quality_trimmer.pl from FASTX-Toolkit [12]. The phred value 20 was chosen as the minimum threshold for base quality. The reads having more than 80 % of low quality bases were removed or had their 3′ extremity bases trimmed when the minimum threshold was not reached. After, the reads were aligned to the human reference genome hg19/GRCh37 using BWA-MEM [13] with default parameters and Picard [14] was applied for post-alignment procedures as sorting, indexing, and marking duplicates. The alignments were submitted to local realignment around INDELs and base quality score recalibration (BQSR) by using the Genome Analysis Toolkit (GATK) version 3.0 [15]. MuTect [16] and GATK (Haplotype Caller) were used for the single nucleotide variant calling. GATK variants were filtered with the Variant Quality Score Recalibration tool following the best practices on the GATK website. GATK performs the variant calling and filtration in the normal and tumor samples independently, thus the subtraction between the tumor and the normal variants resulted in our first set of candidate somatic variants. To ensure the somatic classification of the SNVs called by GATK, we adapted the MuTect algorithm and applied its LODN classifier after the GATK variant calling and filtering. The LODN is a bayesian classifier that compares the likelihood of two models: (1) the mutation does not exist in the normal sample and all non-reference bases are explained by sequencing noise, and (2) the mutation truly exists in the normal sample as a germ-line heterozygous variant. The ratio of these two likelihoods is called LOD (Log Odds) score and when it exceeds a decision threshold, the mutation can be classified as somatic. For this filtering, we considered only sites that had total read depth greater or equal than 8 in the normal sample and greater or equal than 14 in the tumor sample. Our final candidate list consisted in the union of MuTect and GATK-LODN results. The variants were annotated by ANNOVAR [17], with the Ensembl Gene annotation database for human genome build 37 (http://www.ensembl.org/), and searched for matches in the dbSNP138 and 1000 Genomes data. We selected exonic single nucleotide variants (SNVs) that were non-synonymous and gain or loss of stop codon. Variants present in dbSNP138 and 1000 Genomes with minor allele frequency (MAF) greater than 0.05 were removed. Figure 1 shows the summary of the pipeline steps. The scripts for running the main pipeline steps are availabe in the link: https://bitbucket.org/BBDA-UNIBO/wes-pipeline.

Fig. 1

Pipeline of SNV detection in sequencing data of cancer samples. Summary of steps and their respective tools in the detection of SNVs in paired normal-cancer sequencing data

Pipeline of SNV detection in sequencing data of cancer samples. Summary of steps and their respective tools in the detection of SNVs in paired normal-cancer sequencing data A subset of variants from MuTect, GATK and GATK-LODN calls were selected for validation. Variants with allelic frequency higher than 0.2 were validated by Sanger Sequencing and those with allelic frequency lower than 0.2 were validated by using the Illumina TruSight Myeloid Sequencing Panel and Illumina MiSeq sequencing. Data were analyzed by the VariantStudio software (Illumina), according to manufacturer’s instruction.

Pipeline testing

As MuTect eventually miscalled variants already profiled by Sanger sequencing at the moment of diagnosis, we tested adapting the MuTect algorithm by lowering its two main parameters and thresholds – ΘT > = 6.5 and ΘN|dbSNP site > = 5.5 – that determine the mutation detection and classification as somatic or germ-line. We calculated the ΘT and ΘN values for each variant in the GATK raw output and set the thresholds to the minimum values that would permit the correct classification of 10 variants previously identified by Sanger sequencing. We simulated datasets to evaluate the specificity and sensitivity of the three variant calling methods: MuTect, GATK and GATK-LODN. The specificity was evaluated by splitting the sequencing data of the same sample in two, applying the three variant calling methods, and counting the number of total SNVs called. One saliva sample of our AML cohort (80X) had its reads randomized (reads sorted by query name) and it was split in two by using the bamutils tool of NGSUtils package [18]. The resultant alignment files were applied to each variant calling method. The sensitivity was calculated by creating artificial tumor samples, applying the variant calling methods, and counting the number of true positives called. We adapted the mutate_sample.py script from the Shimmer package [19] to create mutations in a saliva sample alignment. Three artificial tumors were created with 22, 25 and 25 SNVs, which had variant allelic fractions range of 0.02 to 0.25, 0.5 to 0.86, and 0.97 to 1.0, respectively (Table 1). For each artificial tumor sample, we created subsets by randomly excluding reads and simulated sequencing coverages in the range of 5X to 80X, with intervals of 5X. The creation of the subsets was performed by the DownsampleBam tool of Picard. We then evaluated the performance of each variant calling method at different coverage levels.

Table 1

Artificial tumor samples. Coordinate list of the single nucleotide variants inserted in the artificial tumor samples and their variant allelic frequencies

Chromosome	Position	REF > ALT	Artificial tumors variant allelic frequencies			Normal variant allelic frequencies
Chromosome	Position	REF > ALT	0.02 – 0.26	0.5 – 0.86	0.97 – 1
11	19854088	G > A	0.03	0.69	1.00	0
11	36484167	C > T	0.08	0.62	1.00	0.027
11	4608116	T > C	0.13	0.71	1.00	0.020
11	4661826	T > C	0.11	0.60	0.97	0.028
11	4673788	G > A	0.26	0.64	1.00	0.021
11	4928841	T > C	0.13	0.61	1.00	0
11	5372856	A > G	0.24	0.69	1.00	0.023
11	5373562	C > A	0.09	0.68	1.00	0.029
11	5443887	T > C	0.10	0.86	1.00	0
11	5443893	G > A	0.10	0.86	1.00	0
11	5462255	C > G	0.16	0.56	1.00	0
11	5906203	T > G	0.19	0.70	1.00	0
11	6519642	G > A	0.08	0.61	1.00	0
11	824789	T > C	0.11	0.63	1.00	0.026
12	25398281	C > T	0.12	0.63	1.00	0
12	75715330	C > A	0.13	0.60	1.00	0
22	24891418	A > C	0.21	0.70	1.00	0.030
22	44083442	T > C	NA	0.78	1.00	0
13	101289801	C > A	0.13	0.65	1.00	0
20	61537337	G > T	0.13	0.65	1.00	0
17	48557299	G > T	0.11	0.74	1.00	0
5	45262378	G > T	0.08	0.50	1.00	0
1	94476902	T > C	0.15	0.65	1.00	0
2	110372199	G > T	NA	0.57	1.00	0
5	64907465	C > A	0.10	0.57	1.00	0

Artificial tumor samples. Coordinate list of the single nucleotide variants inserted in the artificial tumor samples and their variant allelic frequencies

Results

We built a pipeline for discovery of single nucleotide variants (SNVs) in whole exome sequencing data and applied it to Acute Myeloid Leukemia (AML), Acute Lymphoid Leukemia (ALL), Gastrointestinal Stromal Tumor (GIST), and Lung Adenocarcinoma samples. First, we compared the results of the three variant calling procedures: MuTect, GATK, and GATK-LODN. GATK detected 3 to 20 times more SNVs than MuTect (Fig. 2a) and the results for the Lung Adenocarcinoma dataset presented the highest concordance (30 %) between the two methods. GATK-LODN strongly reduced the number of SNVs in GATK results for the hematological tumors (Fig. 2b). For the solid tumors, approximately 10 % of GATK specific SNVs remained after applying GATK-LODN, and, for the GIST dataset, it detected about three times more variants than MuTect.

Fig. 2

The GATK-LODN method reduces the number of GATK false positive calls. Comparison of the number of SNVs between GATK and MuTect before (a) and after (b) applying the GATK-LODN method for each cancer whole exome sequencing dataset. AML: Acute Myeloid Leukemia, ALL: Acute Lymphoblastic Leukemia, GIST: Gastrointestinal Stromal Tumor, LA: Lung Adenocarcinoma The MuTect algorithm has two main parameters: ΘT and ΘN. We calculated these values for a set of variants candidates (AML dataset) from GATK results and tested if we could reduce the number of false negatives by lowering these thresholds. We set the two parameters for ΘT > = 4.5 and ΘN|dbSNP site > = 3 and it permitted the detection of 10 variants previously profiled by Sanger sequencing, but not detected by the original MuTect analysis. However, the number of final candidates increased about 1.3 to 10 times in comparison with the original MuTect output (Table 2).

Table 2

Patients	MuTect	MuTect Adapted^a
a1024	11	39
a1025	31	41
b1014	22	54
b2002	10	25
b2035	43	419
b2042	58	338

aApplying the computation of ΘT and ΘN, from the MuTect algorithm, with lowered threshold values (4.5 and 3, respectively) downstream to the GATK analysis

Relaxing MuTect parameters increases the number of false positive calls. Number of variants found by MuTect, before and after relaxing the ΘT and ΘN parameters for six Acute Myeloid Leukemia (AML) normal-cancer sample pairs aApplying the computation of ΘT and ΘN, from the MuTect algorithm, with lowered threshold values (4.5 and 3, respectively) downstream to the GATK analysis We selected a set of candidate variants from the AML dataset and performed the validation experiment of each method in two rounds. In the first, we tested just the tumor samples, in order to evaluate the performance of each method in detecting the mutations. In the second round, we tested both tumor and normal samples, in order to evaluate the performance of each method in classifying mutations as somatic events. We observed that 18 out of 48 and 5 out of 18 GATK variants were correctly detected and classified, respectively, while MuTect presented high performance in both rounds (6 out of 7 and 2 out of 3, respectively). The GATK-LODN presented better validation rates than GATK for both mutation detection (18 out of 48 to 6 out of 9) and classification (5 out of 14 to 3 out of 4) (Table 3).

Table 3

	Mutation Detection^a		Mutation Classification^b
	Tested	Validated	Tested	Validated
GATK-LOD_N - specific	4	1	2	2
GATK-LOD_N (All variants)	9	6	4	3
GATK (without LOD_N) - specific	37	11	9	2
GATK (without LOD_N) (All Variants)	48	18	14	5
MuTect - specific	22	21	8	8
MuTect (All Variants)	29	27	11	10
MuTect & GATK	7	6	3	2

avariants tested for correct mutation detection

bvariants tested for correct classification as somatic events

The GATK-LODN method increases the GATK performance for both mutation detection and classification. The Sanger sequencing validation was performed in two rounds: in the first round we tested whether the methods correctly detected the mutation and in the second one we assessed whether the methods correctly classified the mutations as somatic events. The variant subsets tested (AML datatset) presented variants method specific and variants detected by one or more methods avariants tested for correct mutation detection bvariants tested for correct classification as somatic events Simulated data permitted the evaluation of sensitivity and specificity of the three variant calling methods. We measured the specificity by splitting a saliva sample alignment (80X) in two, applying to the pipeline and counting the number of called SNVs. Mutect, GATK, and GATK-LODN resulted in 8, 76 and 35 false positives, respectively. Then, we applied technical replicates of the same saliva sample to the pipeline and it resulted in 7, 84 and 33 false positives, respectively. We measured the sensitivity by simulating three artificial tumors with different Variant Allelic Frequency (VAF) ranges: one with high-frequency variants (n = 25, VAF: 0.97 to 1.0), one with intermediate-frequency variants (n = 25, VAF: 0.5 to 0.86), and another with low-frequency variants (n = 22, VAF: 0.02 to 0.25). MuTect presented a Positive Predictive Value (PPV) of 19/22 for low VAF mutations and its false negatives were composed by: one variant with VAF = 0.02, and two variants that had either VAF < 0.1 and total read depth smaller than 24 (Table 4). GATK presented the smallest performance for somatic variants, since it detected 2206 candidates out of 22 or 25 true positive variants. GATK-LODN presented a PPV of 17/22 for the low allelic frequency variants, but it missed variants with VAF < 0.095 (Table 4). MuTect detected all intermediate and high allelic frequency variants, while GATK-LODN presented PPVs of 23/30 and 23/31, respectively (Table 4).

Table 4

The GATK-LODN method presented good performance in artificial tumor samples. Performance of MuTect and GATK-LODN for artificial tumor samples that had variants with diverse allelic frequencies

		Artificial Tumor Samples
		Low Frequency Variants (n = 22) VAF: 0.02 – 0.26	Intermediate Frequency Variants (n = 25) VAF: 0.5 – 0.86	High Frequency Variants (n = 25) VAF: 0.97 – 1
MuTect	Somatic Candidates	22	25	25
	TP	19	25	25
	FN	0	0	0
	FP	3	0	0
	PPV	19/22	25/25	25/25
	FDR	3/22	0/25	0/25
GATK-LOD_N	Somatic Candidates	27	32	33
	TP	17	23	23
	FN	5	5	2
	FP	5	7	8
	PPV	17/22	23/30	23/31
	FDR	5/22	7/30	8/31

TP True positives, FN False negatives, FP False positives, PPV Positive Predictive Value (#TP / #FP + #TP), FDR False Discovery Rate (#FP / #FP + #TP), VAF Variant Allelic Frequency

GATK results were not reported in the table since it detected more than 2200 candidates out of 22 or 25 TPs

The GATK-LODN method presented good performance in artificial tumor samples. Performance of MuTect and GATK-LODN for artificial tumor samples that had variants with diverse allelic frequencies TP True positives, FN False negatives, FP False positives, PPV Positive Predictive Value (#TP / #FP + #TP), FDR False Discovery Rate (#FP / #FP + #TP), VAF Variant Allelic Frequency GATK results were not reported in the table since it detected more than 2200 candidates out of 22 or 25 TPs For each artificial tumor, we simulated different sequencing coverages and evaluated the number of false negatives and true positives detected. We observed that, at different coverage levels, GATK-LODN and MuTect presented almost identical performance for the artificial tumors with high and intermediate variant frequency SNVs, except in the number of false negatives detected by GATK-LODN in the coverage interval of 5 to 20X. GATK-LODN presentedincreased number of detected true positives than MuTect in the coverage interval of 50 to 55X for high and intermediate-frequency variants, and in the coverage 20X for low-frequency variants (Fig. 3).

Fig. 3

Number of False Negatives and True positives at different coverage levels. Three artificial tumors were created with 22, 25 and 25 SNVs, which had variant allelic fractions range of 0.02 to 0.25, 0.5 to 0.86, and 0.97 to 1.0, respectively. We counted the number of False Negatives (FN) and True positives (TP) for different levels of simulated sequencing coverage

Discussion

Our data show that the combination of standard tools - Genome Analysis Toolkit (GATK) and MuTect – improves the range of detected single nucleotide variants (SNVs) in whole exome sequencing data of cancer samples. We also developed the GATK-LODN method, which reduced the number of GATK false positive calls. Our study has the advantage of actually combining two different algorithms rather than proposing ways of unifying results of different tools [9, 20]. As one method originally presented high amounts of false positive calls (type I error) and the other high amounts of false negative calls (type II error), the GATK-LODN is an option of amplifying the range of detected SNVs without severely compromising sensitivity and specificity. The GATK uses a Bayesian model to estimate the likelihood of a genotype given the observed sequence reads that cover the locus. It independently calls genotypes in tumor and normal samples, being the somatic mutations classified as those only present in the tumor sample. However, GATK detects many false positives likely due to germ-line variants with low sequencing coverage or low allelic frequency, that are not called in the normal samples. MuTect jointly analyzes tumor and normal samples, presenting high sensitivity, specificity and validation rates. Each method detects variants that the other does not detect, and a previous study demonstrated that the SNVs found only by GATK had relatively high validation rates [4]. One option would be taking into account just the results obtained from one tool, but it risks the selection of errors for which the algorithm is vulnerable [21]. Another option would be taking the intersection of multiple variant callers, but it will result in high false negative rates, since each tool uniquely identifies true variants [4]. We discarded the option of relaxing the MuTect parameters, since we observed that it included the detection of variants previously miscalled, but with the cost of including many false positives. Our study demonstrates the advantage of merging the results of MuTect and GATK-LODN, since GATK-LODN reduces the number of GATK false positives and detect variants not detected by MuTect. The GATK-LODN increased the performance of GATK in the sequencing validation experiments and in the simulated artificial tumor samples. We observed that the GATK-LODN also outperformed MuTect in some simulated sequencing coverages. As sequencing datasets usually present large variability in coverage and quality, the different error modeling approaches and prior assumptions associated to the two methods should permit good performances in a wide scenario. We performed the validation experiments just for variants from the hematological tumors (available in our laboratories), thus the validation rate might change for solid tumors. The results show that GATK-LODN filtered more variants in the hematological tumors than in the solid tumors and we hypothesized that the normal samples from hematological tumors may be more prone to contamination by cancer cells. Although GATK-LODN provided a small number of variants in the hematological datasets, even a single variant can give insights into the mechanisms of malignant transformation and help design personalized therapeutic approaches [22, 23]. We observed that the Lung Adenocarcinoma presented the biggest concordance between methods, maybe because patients with this type of cancer usually presents high mutation frequencies and harbors more somatic mutations compared with other cancer types [24-27]. The results also show that different methods may present bias to certain nucleotide substitution mutations, but more studies involving larger groups of tumors are needed. The GATK-LODN is suitable for application together with other post-calling filtering features as: strand bias, nearby polymorphisms and technology specific sequencing errors removal [28-30]. For instance, Carson et al. [7] suggested new thresholds for genotype and variant filters to be used in conjunction with the GATK pipeline analysis, that could increase the GATK-LODN performance in population-based studies. Altogether, the GATK-LODN allows enough flexibility to deal with different study designs and requirements about how stringent the analysis must be. Here, we presented a tested pipeline that combines standard tools, aiming to detect a wide range of somatic single nucleotide variants with high specificity and sensitivity. We developed the GATK-LODN method, which can be helpful in large-cohort discovery studies aimed to profile the somatic mutational landscape from whole exome sequencing data of cancer samples.

Conclusion

Next generation sequencing analysis has drastically improved the biological knowledge of human cancers. Several tools and strategies are available to detect single nucleotide variants in normal-cancer paired samples, but many research groups report low concordance among them. In this study, we proposed a pipeline that applies two standard tools (MuTect and GATK) and one adapted method (GATK-LODN) that increased the performance of its original algorithm. The GATK-LODN method improved the overall performance by reducing the number of false positive calls and permitted the detection of variants not detected by MuTect. We believe that the proposed pipeline will help in the understanding of cancer biology through the discovery of somatic single nucleotide variants in cancer sequencing data.

27 in total

1. Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing.

Authors: Joke Reumers; Peter De Rijk; Hui Zhao; Anthony Liekens; Dominiek Smeets; John Cleary; Peter Van Loo; Maarten Van Den Bossche; Kirsten Catthoor; Bernard Sabbe; Evelyn Despierre; Ignace Vergote; Brian Hilbush; Diether Lambrechts; Jurgen Del-Favero
Journal: Nat Biotechnol Date: 2011-12-18 Impact factor: 54.908

Review 2. Analysis of next-generation genomic data in cancer: accomplishments and challenges.

Authors: Li Ding; Michael C Wendl; Daniel C Koboldt; Elaine R Mardis
Journal: Hum Mol Genet Date: 2010-09-15 Impact factor: 6.150

3. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data.

Authors: David H Spencer; Manoj Tyagi; Francesco Vallania; Andrew J Bredemeyer; John D Pfeifer; Rob D Mitra; Eric J Duncavage
Journal: J Mol Diagn Date: 2013-11-05 Impact factor: 5.568

Review 4. Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress.

Authors: Gholson J Lyon; Kai Wang
Journal: Genome Med Date: 2012-07-26 Impact factor: 11.117

5. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers.

Authors: Michael A Quail; Miriam Smith; Paul Coupland; Thomas D Otto; Simon R Harris; Thomas R Connor; Anna Bertoni; Harold P Swerdlow; Yong Gu
Journal: BMC Genomics Date: 2012-07-24 Impact factor: 3.969

6. Novel scripts for improved annotation and selection of variants from whole exome sequencing in cancer research.

Authors: Marcus Celik Hansen; Line Nederby; Anne Roug; Palle Villesen; Eigil Kjeldsen; Charlotte Guldborg Nyvold; Peter Hokland
Journal: MethodsX Date: 2015-03-12

7. Integrated genomic analyses identify frequent gene fusion events and VHL inactivation in gastrointestinal stromal tumors.

Authors: Guhyun Kang; Hongseok Yun; Choong-Hyun Sun; Inho Park; Seungmook Lee; Jekeun Kwon; Ingu Do; Min Eui Hong; Michael Van Vrancken; Jeeyun Lee; Joon Oh Park; Jeonghee Cho; Kyoung-Mee Kim; Tae Sung Sohn
Journal: Oncotarget Date: 2016-02-09

8. A comparative analysis of algorithms for somatic SNV detection in cancer.

Authors: Nicola D Roberts; R Daniel Kortschak; Wendy T Parker; Andreas W Schreiber; Susan Branford; Hamish S Scott; Garique Glonek; David L Adelson
Journal: Bioinformatics Date: 2013-07-09 Impact factor: 6.937

9. Combining calls from multiple somatic mutation-callers.

Authors: Su Yeon Kim; Laurent Jacob; Terence P Speed
Journal: BMC Bioinformatics Date: 2014-05-21 Impact factor: 3.169

10. Effective filtering strategies to improve data quality from population-based whole exome sequencing studies.

Authors: Andrew R Carson; Erin N Smith; Hiroko Matsui; Sigrid K Brækkan; Kristen Jepsen; John-Bjarne Hansen; Kelly A Frazer
Journal: BMC Bioinformatics Date: 2014-05-02 Impact factor: 3.169

44 in total

Review 1. Informatics for cancer immunotherapy.

Authors: J Hammerbacher; A Snyder
Journal: Ann Oncol Date: 2017-12-01 Impact factor: 32.976

2. Chemoresistance Evolution in Triple-Negative Breast Cancer Delineated by Single-Cell Sequencing.

Authors: Charissa Kim; Ruli Gao; Emi Sei; Rachel Brandt; Johan Hartman; Thomas Hatschek; Nicola Crosetto; Theodoros Foukakis; Nicholas E Navin
Journal: Cell Date: 2018-04-19 Impact factor: 41.582

3. In silico epitope prediction analyses highlight the potential for distracting antigen immunodominance with allogeneic cancer vaccines.

Authors: C Alston James; Peter Ronning; Darren Cullinan; Kelsy C Cotto; Erica K Barnell; Katie M Campbell; Zachary L Skidmore; Dominic E Sanford; S Peter Goedegebuure; William E Gillanders; Obi L Griffith; William G Hawkins; Malachi Griffith
Journal: Cancer Res Commun Date: 2021-11

4. Prediction of risk-associated genes and high-risk liver cancer patients from their mutation profile: benchmarking of mutation calling techniques.

Authors: Sumeet Patiyal; Anjali Dhall; Gajendra P S Raghava
Journal: Biol Methods Protoc Date: 2022-05-27

5. Prevalence and Molecular Characteristics of Polymyxin-Resistant Pseudomonas aeruginosa in a Chinese Tertiary Teaching Hospital.

Authors: Chenlu Xiao; Yan Zhu; Zhitao Yang; Dake Shi; Yuxing Ni; Li Hua; Jian Li
Journal: Antibiotics (Basel) Date: 2022-06-14

6. Analysis of Circulating Tumor DNA to Predict Neoadjuvant Therapy Effectiveness and Breast Cancer Recurrence.

Authors: Shuai Hao; Wuguo Tian; Jianjie Zhao; Yi Chen; Xiaohua Zhang; Bo Gao; Yujun He; Donglin Luo
Journal: J Breast Cancer Date: 2020-07-10 Impact factor: 3.588

7. FGFR2 Extracellular Domain In-Frame Deletions Are Therapeutically Targetable Genomic Alterations That Function as Oncogenic Drivers in Cholangiocarcinoma.

Authors: James M Cleary; Srivatsan Raghavan; Qibiao Wu; Yvonne Y Li; Liam F Spurr; Hersh V Gupta; Douglas A Rubinson; Isobel J Fetter; Jason L Hornick; Jonathan A Nowak; Giulia Siravegna; Lipika Goyal; Lei Shi; Lauren K Brais; Maureen Loftus; Atul B Shinagare; Thomas A Abrams; Thomas E Clancy; Jiping Wang; Anuj K Patel; Franck Brichory; Anne Vaslin Chessex; Ryan J Sullivan; Rachel B Keller; Sarah Denning; Emma R Hill; Geoffrey I Shapiro; Anna Pokorska-Bocci; Claudio Zanna; Kimmie Ng; Deborah Schrag; Pasi A Jänne; William C Hahn; Andrew D Cherniack; Ryan B Corcoran; Matthew Meyerson; Antoine Daina; Vincent Zoete; Nabeel Bardeesy; Brian M Wolpin
Journal: Cancer Discov Date: 2021-04-29 Impact factor: 39.397

8. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes.

Authors: Ruli Gao; Shanshan Bai; Ying C Henderson; Yiyun Lin; Aislyn Schalck; Yun Yan; Tapsi Kumar; Min Hu; Emi Sei; Alexander Davis; Fang Wang; Simona F Shaitelman; Jennifer Rui Wang; Ken Chen; Stacy Moulder; Stephen Y Lai; Nicholas E Navin
Journal: Nat Biotechnol Date: 2021-01-18 Impact factor: 54.908

9. Genetic Evidence for Early Peritoneal Spreading in Pelvic High-Grade Serous Cancer.

Authors: Jeremy Chien; Lisa Neums; Alexis F L A Powell; Michelle Torres; Kimberly R Kalli; Francesco Multinu; Viji Shridhar; Andrea Mariani
Journal: Front Oncol Date: 2018-03-07 Impact factor: 6.244

10. VCF.Filter: interactive prioritization of disease-linked genetic variants from sequencing data.

Authors: Heiko Müller; Raul Jimenez-Heredia; Ana Krolo; Tatjana Hirschmugl; Jasmin Dmytrus; Kaan Boztug; Christoph Bock
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971