Sehyun Oh1,2, Ludwig Geistlinger1,2, Marcel Ramos1,2, Martin Morgan3, Levi Waldron1,2, Markus Riester4. 1. Graduate School of Public Health and Health Policy, City University of New York, New York, NY. 2. Institute for Implementation Science and Population Health, City University of New York, New York, NY. 3. Roswell Park Cancer Institute, Buffalo, NY. 4. Novartis Institutes for BioMedical Research, Cambridge, MA.
Abstract
PURPOSE: Allele-specific copy number alteration (CNA) analysis is essential to study the functional impact of single-nucleotide variants (SNVs) and the process of tumorigenesis. However, controversy over whether it can be performed with sufficient accuracy in data without matched normal profiles and a lack of open-source implementations have limited its application in clinical research and diagnosis. METHODS: We benchmark allele-specific CNA analysis performance of whole-exome sequencing (WES) data against gold standard whole-genome SNP6 microarray data and against WES data sets with matched normal samples. We provide a workflow based on the open-source PureCN R/Bioconductor package in conjunction with widely used variant-calling and copy number segmentation algorithms for allele-specific CNA analysis from WES without matched normals. This workflow further classifies SNVs by somatic status and then uses this information to infer somatic mutational signatures and tumor mutational burden (TMB). RESULTS: Application of our workflow to tumor-only WES data produces tumor purity and ploidy estimates that are highly concordant with estimates from SNP6 microarray data and matched normal WES data. The presence of cancer type-specific somatic mutational signatures was inferred with high accuracy. We also demonstrate high concordance of TMB between our tumor-only workflow and matched normal pipelines. CONCLUSION: The proposed workflow provides, to our knowledge, the only open-source option with demonstrated high accuracy for comprehensive allele-specific CNA analysis and SNV classification of tumor-only WES. An implementation of the workflow is available on the Terra Cloud platform of the Broad Institute (Cambridge, MA).
PURPOSE: Allele-specific copy number alteration (CNA) analysis is essential to study the functional impact of single-nucleotide variants (SNVs) and the process of tumorigenesis. However, controversy over whether it can be performed with sufficient accuracy in data without matched normal profiles and a lack of open-source implementations have limited its application in clinical research and diagnosis. METHODS: We benchmark allele-specific CNA analysis performance of whole-exome sequencing (WES) data against gold standard whole-genome SNP6 microarray data and against WES data sets with matched normal samples. We provide a workflow based on the open-source PureCN R/Bioconductor package in conjunction with widely used variant-calling and copy number segmentation algorithms for allele-specific CNA analysis from WES without matched normals. This workflow further classifies SNVs by somatic status and then uses this information to infer somatic mutational signatures and tumor mutational burden (TMB). RESULTS: Application of our workflow to tumor-only WES data produces tumor purity and ploidy estimates that are highly concordant with estimates from SNP6 microarray data and matched normal WES data. The presence of cancer type-specific somatic mutational signatures was inferred with high accuracy. We also demonstrate high concordance of TMB between our tumor-only workflow and matched normal pipelines. CONCLUSION: The proposed workflow provides, to our knowledge, the only open-source option with demonstrated high accuracy for comprehensive allele-specific CNA analysis and SNV classification of tumor-only WES. An implementation of the workflow is available on the Terra Cloud platform of the Broad Institute (Cambridge, MA).
Copy number alterations (CNAs) are typically measured by the ratio of tumor to normal DNA abundance. However, tumor purity and ploidy affect this ratio and must be incorporated to infer absolute copy numbers.[1,2] Information from germline single-nucleotide polymorphisms (SNPs) further allows deconvolution of absolute copy number into the 2 parental copy numbers. This parental or allele-specific copy number provides a direct readout of loss of heterozygosity (LOH; when either the maternal or paternal copy is lost), which can indicate the complete loss of wild-type function when a somatic mutation in a putative tumor suppressor is identified.[3] Inferring allele-specific copy number is further crucial to understanding mutagenesis, allowing determination of clonality and timing of copy number changes at the same locus.[2,4,5]Whole-exome sequencing (WES) and targeted panel sequencing have become routine applications in the clinic, providing comprehensive data while saving cost and scarce tumor tissue by eliminating the need for multiple single-analyte assays. Therefore, such comprehensive tests may aid treatment decision making by increasing the detection of actionable alterations, which includes point mutations and amplifications of oncogenes in targeted therapies, microsatellite instability (MSI), and tumor mutational burden (TMB) in immunotherapy.[6,7]Key ObjectiveThe current study explores the feasibility of tumor-only sequencing to determine various complex biomarkers beyond known driver mutations and provides open-source implementations.Knowledge GeneratedWe demonstrate that sophisticated algorithms can, in many cases, minimize the need for sequencing matched normal specimens. Our workflow is available for download and in the Terra Cloud platform.RelevanceClinical tumor-only sequencing reduces time and cost over matched tumor and normal sequencing and enables analyses of the large number of specimens for which blood samples are unavailable.Sequencing both tumor and matched normal specimens provides certain benefits over tumor-only sequencing, even in diagnostic settings where alterations of uncertain significance are usually ignored. For example, high-depth sequencing of blood samples can more reliably identify clonal hematopoiesis, hotspot mutations that arose in heme rather than in tumor cells.[8-10] Matched normal samples are also commonly required for existing algorithms to detect complex biomarkers such as MSI, TMB, or LOH. Obtaining comprehensive information from clinical tumor-only sequencing data could reduce time and cost, while enabling analyses of the large number of archived specimens for which blood samples are unavailable. However, the reliability of tumor-only sequencing is not well assessed,[11,12] and validated open-source analysis tools are lacking.Without matched normal samples, it is necessary to distinguish algorithmically between somatic mutations and germline variants. Existing approaches commonly involve machine learning using public germline and somatic databases, in silico predictions of the functional impact of mutations, and allelic fractions (the ratios of nonreference to total sequencing reads) of mutations and their neighboring SNPs.[13,14] Recently developed tools additionally use allele-specific copy number, allowing the calculation of accurate posterior probabilities for all possible somatic and germline genotypes.[15-17] However, in the absence of complete workflows and thorough benchmarking, controversy has persisted over the reliability of tumor-only sequencing.[12]We present a complete workflow, along with a Cloud-based implementation, for tumor-only hybrid-capture data. The workflow is based on our previously published tool PureCN.[15] We benchmark an improved version against gold standard data sets of matched normal WES and Affymetrix SNP6 microarrays (Affymetrix, Santa Clara, CA) and compare it to alternative recently published methods.[17,18] Using the ovarian carcinoma (OV) and lung adenocarcinoma (LUAD) data sets of The Cancer Genome Atlas (TCGA), which represent opposing extremes with respect to tumor purity, copy number heterogeneity, and TMB, we demonstrate high reliability of tumor-only analyses for inference of allele-specific copy number, identification of functional mutations, LOH, mutational signatures, and TMB.
METHODS
Data Download
BAM files were downloaded through the GDC Data Transfer Tool using manifest files built by the GenomicDataCommons R/Bioconductor package.[19] The TCGAutils R/Bioconductor package[20] was used to annotate the manifest file: TCGAutils::UUIDtoBarcode for transferring universally unique identifiers to TCGA barcodes and TCGAutils::TCGAbiospec for extracting biospecimen data from TCGA barcodes. Capture kit information was obtained via the GDC API. BAM files mapping to multiple capture kits were excluded. BED files containing the locations of baits based on hg19 were lifted over to GRCh38 using hg19ToHg38 liftover chain file downloaded from the University of California, Santa Cruz Genome Browser.[21] None of the data analyzed in this study were used to develop or tune the algorithm or parameters and thus represent true validation sets.
Data Processing
ABSOLUTE analysis of TCGA SNP6 microarray data has been described previously.[2,22,23] The manually curated ABSOLUTE output was obtained from Synapse[24] and lifted over to GRCh38. In addition to the PureCN-based[15] workflow described in detail in the Appendix, we applied the FACETS 0.5.6[18] copy number tool to all samples. Tumor and normal BAM file pairs were processed by snp-pileup with the parameters -g -q15 -Q20 -P100 -r25,0, and the outputs from which were imported using readSnpMatrix and further processed by preProcSample, procSample with cval = 150, and emcncf.Single-nucleotide variants were called with Mutect 1.1.7[25] (Appendix). SGZ[17] in version 1.0.0 was used to classify the mutation calls into somatic versus germline (Appendix). Variants labeled “germline,” “probable germline,” “somatic,” “probable somatic,” or “somatic subclonal” by SGZ were considered called, and all others were considered uncalled. Finally, we applied deconstructSigs[26] to identify the 30 mutational signatures[27] curated by the Wellcome Trust Sanger Institute[28] (Appendix).
Statement of Reproducible Research
Analyses presented in this article are reproducible using the code and instructions available through GitHub.[29]
RESULTS
Reliable analysis of clinical tumor-only sequencing data involves multiple nontrivial steps that are distinct from the analysis of matched tumor and normal sequencing. Here, we describe and benchmark a detailed workflow for hybrid-capture tumor-only sequencing data including variant calling, coverage normalization for copy number calling, purity and ploidy inference, and classification of variants by somatic status (Appendix Fig A1).
FIG A1.
Copy number alterations analysis workflow. Raw input data files and the intermediate/processed data files are depicted as blue and gray oval shapes, respectively. R scripts provided by PureCN are depicted by rose squares, and third-party tools are depicted by gray squares. Gray solid lines indicate how the target region information is processed. Blue and red solid lines describe how normal and tumor BAM files are processed, respectively. Dashed and solid teal lines show how germline single-nucleotide polymorphisms and somatic mutations were prepared with or without matched normal, respectively.
Tumor Purity and Ploidy Inference
We selected OV and LUAD WES data from TCGA as complementary, representative data sets for our benchmarking study.[30,31] Among the TCGA data sets, OV shows the highest tumor purity as a result of the availability of large surgical specimens. High purity complicates somatic versus germline classification because of the overlapping distributions of expected allelic fractions. The LUAD data set, obtained by core needle biopsies, in contrast, ranks among the lowest in tumor purity, presenting a different challenge for copy number calling because of the dilution of signal.[15,17] LUAD is additionally challenging because of increased copy number heterogeneity.[2,22] Subclonal copy number changes increase the number of copy states, making ploidy inference often ambiguous.[2,32]We first compared maximum likelihood purity and ploidy estimates from our workflow using tumor WES with those from manually curated ABSOLUTE SNP6 microarray calls (Figs 1A to 1D, Data Supplement, Appendix). We analyzed 233 OV and 442 LUAD samples and found a high correlation of microarray and WES results for tumor purity (Pearson correlation, r = 0.75 and r = 0.84 for OV and LUAD, respectively) and tumor ploidy for OV (87.1% concordant, defined as ploidy difference < 0.5; Pearson correlation, r = 0.73). Note that since SNP6 and WES data were generated from different tissue slides, a perfect correlation of purity is not expected, whereas ploidy should be in general similar. Ploidy estimates for LUAD were also concordant in the majority of samples (77.1% concordant; Pearson correlation, r = 0.57). In addition, we applied FACETS, a widely used allele-specific CNA analysis tool for tumor and matched normal sequencing,[18] to both OV and LUAD paired WES data (Appendix Fig A2). For 68.9% of all samples, all 3 tools generated concordant purity and ploidy calls (Appendix Fig A3). For OV, PureCN showed a higher ploidy concordance with ABSOLUTE than FACETS (87.1% v 73.8%, respectively), whereas for LUAD, its concordance was slightly lower (77.1% v 79.6%, respectively). Samples of discordant ploidy, compared with concordant samples, had lower purity (39.2% v 54.3%, respectively; 2-sided Mann-Whitney, P < .0001) and lower mean coverage (100.4× v 107.2×; 2-sided Mann-Whitney, P = .03).
FIG 1.
Accuracy of purity, ploidy, and exome-wide copy number inference. (A-D) Comparison of purity and ploidy estimates from paired SNP6 microarray data (ABSOLUTE[2]) against those from tumor-only whole-exome sequencing (WES) data (PureCN) in ovarian cancer (OV) and lung adenocarcinoma (LUAD) samples. (E-F) Shown are concordances of the major and minor allele copy numbers of ABSOLUTE copy number alteration (can) calls with the corresponding tumor-only WES PureCN calls for all altered regions where both tools could make a call. Bubbles on the diagonal represent concordant calls. States where the minor copy number is 0 (1-0, 2-0, 3-0, 4-0) are regions in loss of heterozygosity (LOH). (G-H) Concordance of LOH calls between the 2 analyses was further reviewed on HLA-A/B/C and TP53 loci for the cases with sufficient power to detect LOH: LOH observed in both microarray and WES analyses (both, dark red); absent in both analyses (neither, orange); detected only from microarray data (SNP6 array, dark blue); and detected only from WES data (tumor WES, light blue). TCGA, The Cancer Genome Atlas.
FIG A2.
Purity and ploidy estimates using an alternative tool. Purity and ploidy estimates from paired whole-exome sequencing (WES) data were obtained using FACETS. As in Figure 1, 233 ovarian cancer (OV) and 442 lung adenocarcinoma (LUAD) samples were analyzed and compared with ABSOLUTE calls. (A) Purity and (B) ploidy estimates of OV. (C) Purity (436 cases are plotted because FACETS did not return a purity estimate for 6 of the LUAD samples as a result of insufficient information) and (D) ploidy estimates of LUAD. TCGA, The Cancer Genome Atlas.
FIG A3.
Concordance of PureCN and FACETS with ABSOLUTE. From 233 ovarian cancer (OV) and 436 lung adenocarcinoma (LUAD) cases, concordance was calculated of whole-exome sequencing (WES)–based estimates from PureCN and FACETS with SNP6 array-based ABSOLUTE calls. Concordance was defined as a purity difference < 0.1 and a ploidy difference < 0.5. Estimates agreed by all 3 methods (both, orange); agreed by ABSOLUTE and PureCN only (PureCN, red); or agreed by ABSOLUTE and FACETS only (FACETS, purple); or neither PureCN nor FACETS agreed with ABSOLUTE (neither, blue).
Accuracy of purity, ploidy, and exome-wide copy number inference. (A-D) Comparison of purity and ploidy estimates from paired SNP6 microarray data (ABSOLUTE[2]) against those from tumor-only whole-exome sequencing (WES) data (PureCN) in ovarian cancer (OV) and lung adenocarcinoma (LUAD) samples. (E-F) Shown are concordances of the major and minor allele copy numbers of ABSOLUTE copy number alteration (can) calls with the corresponding tumor-only WES PureCN calls for all altered regions where both tools could make a call. Bubbles on the diagonal represent concordant calls. States where the minor copy number is 0 (1-0, 2-0, 3-0, 4-0) are regions in loss of heterozygosity (LOH). (G-H) Concordance of LOH calls between the 2 analyses was further reviewed on HLA-A/B/C and TP53 loci for the cases with sufficient power to detect LOH: LOH observed in both microarray and WES analyses (both, dark red); absent in both analyses (neither, orange); detected only from microarray data (SNP6 array, dark blue); and detected only from WES data (tumor WES, light blue). TCGA, The Cancer Genome Atlas.
Allele-Specific Copy Number and LOH
We further analyzed the accuracy of allele-specific copy number analysis by comparing ABSOLUTE from SNP6 data with the corresponding numbers called by PureCN on WES data. We restricted our comparison to the samples with concordant ploidy calls and tumor purity > 30% and demonstrated a high concordance of copy number calls (Figs 1E and 1F).In an LOH event, the minor copy number is by definition 0; LOH calling is thus a special case of allele-specific copy number calling. We examined 2 specific loci of main clinical interest, HLA-A/B/C and TP53, in more detail. TP53 is lost most frequently in ovarian cancer, and HLA LOH is the major interest in immunotherapy.[33,34] HLA and TP53 loci were investigated in 143 and 223 OV cases, respectively, where both tumor-only WES and SNP6 array made LOH calls (Data Supplement). For LUAD, the same comparison was done in 298 and 332 samples for HLA and TP53 loci, respectively. In OV, the mean agreement in LOH status between tumor-only WES and SNP6 microarray was 94.2% for HLA and 99.6% for TP53 (Fig 1G). In LUAD, it was 91.0% for HLA and 95.5% for TP53 (Fig 1H), with the discordant samples showing low purity (average of 30.9% v 43.3% tumor purity for discordant v concordant samples, respectively; 2-sided Mann-Whitney, P < .0005).
Classification of Variants by Somatic Status
We next evaluated the somatic status predictions of variants not found in public germline databases. We first compared predictions against a simple model that uses only allelic fractions. This essentially compared the performance of commonly used ad hoc allelic fraction filters such as 0.4 against our model that adjusts allelic fractions for allele-specific copy number. We observed a significant improvement over this simple model in tumors with purity > 30% (Figs 2A and 2B, Data Supplement). At tumor purity < 30%, inclusion of copy number does not provide a benefit for classification because of the large difference in expected allelic fractions of germline and somatic variants. A small number of cases were observed in which the simple model performed slightly better in terms of area under the curve; these were mainly cases with small numbers of CNAs. However, the complex model still provides a benefit in that it returns a probability.
FIG 2.
Accuracy of variant classification. (A-B) Gain in area under the curve (AUC) of the somatic status prediction by PureCN over a model that only uses allelic fractions, shown as a function of tumor purity. (C-D) Correlation of tumor purity and call rates in ovarian cancer (OV) and lung adenocarcinoma (LUAD) for PureCN (red) and SGZ (teal).[17] (E-H) Histograms of accuracy rates for all samples. These are the fractions of variants correctly called as somatic (orange) or germline (blue). TCGA, The Cancer Genome Atlas.
Accuracy of variant classification. (A-B) Gain in area under the curve (AUC) of the somatic status prediction by PureCN over a model that only uses allelic fractions, shown as a function of tumor purity. (C-D) Correlation of tumor purity and call rates in ovarian cancer (OV) and lung adenocarcinoma (LUAD) for PureCN (red) and SGZ (teal).[17] (E-H) Histograms of accuracy rates for all samples. These are the fractions of variants correctly called as somatic (orange) or germline (blue). TCGA, The Cancer Genome Atlas.We then examined how many variants can be classified as either germline or somatic with reasonable certainty (Data Supplement). As expected, this call rate was largely a function of tumor purity (Figs 2C and 2D). Increasing sequencing coverage also increased these rates (Appendix Figs A4A and A4B). Somatic variants were classified with higher median accuracy than germline variants (96.1% v 88.1%, respectively, in OV; and 97.2% v 96.6%, respectively, in LUAD; Figs 2E and 2F). This is also expected because the somatic group includes subclonal mutations, which are usually easier to classify than monoclonal mutations because of their lower allelic fractions and therefore higher allelic fraction difference compared with germline. We observed a similar median somatic and germline accuracy using SGZ (94.0% and 88.9%, respectively, in OV; and 98.4% and 97.3%, respectively, in LUAD; Figs 2G and 2H),[17] but with lower median call rates (39.5% and 59.5% for OV and LUAD, respectively, for SGZ v 64.4% and 82.2%, respectively, for PureCN).
FIG A4.
Correlation of call rates and median sequencing coverage. Median coverage is plotted against call rate for different purity ranges. LUAD, lung adenocarcinoma; OV, ovarian cancer.
We further investigated the ability to detect functionally important mutations using a driver detection algorithm.[35] Well-defined LUAD driver genes such as TP53, KRAS, KEAP1, and STK11 were called in both tumor-only and paired analyses. We observed a small number of false-positive hits from sequencing artifacts that the matched normals, but not the pool of normals, filtered out (Data Supplement).
TMB
We next sought to investigate the accuracy of the variant classification for determining TMB (Appendix). From the comparison of tumor-only and paired analysis modes, we found a high concordance (Pearson correlation, r = 0.98) and good calibration of somatic mutation rates per megabase in both OV and LUAD (Fig 3A; Data Supplement). The mean absolute difference in somatic rates per megabase of the matched versus tumor-only pipeline was 0.60 Mb for OV and 1.74 Mb for LUAD. A simplified pipeline that removed variants with allelic fraction > 0.4 and was otherwise identical showed differences of 0.9 Mb for OV and 1.80 Mb for LUAD compared with the matched pipeline (Data Supplement).
FIG 3.
Tumor mutational burden (TMB) and mutational signatures. (A) TMB from ovarian cancer (OV; red) and lung adenocarcinoma (LUAD; teal) samples in tumor-only versus paired modes shown on a log scale. (B-C) Concordance of COSMIC mutational signatures (Sig.) between tumor-only and paired analysis modes: mutational signatures observed in both tumor-only and paired modes of analysis (both, red); absent in both analyses (neither, orange); detected only from tumor-only analysis (tumor only, dark blue); or detected only from paired mode of analysis (paired, light blue). TCGA, The Cancer Genome Atlas.
Tumor mutational burden (TMB) and mutational signatures. (A) TMB from ovarian cancer (OV; red) and lung adenocarcinoma (LUAD; teal) samples in tumor-only versus paired modes shown on a log scale. (B-C) Concordance of COSMIC mutational signatures (Sig.) between tumor-only and paired analysis modes: mutational signatures observed in both tumor-only and paired modes of analysis (both, red); absent in both analyses (neither, orange); detected only from tumor-only analysis (tumor only, dark blue); or detected only from paired mode of analysis (paired, light blue). TCGA, The Cancer Genome Atlas.
Mutational Signatures
To further evaluate the clinical utility of our workflow, we assessed the accuracy of mutational signature identification[36] from tumor WES data with or without matched normal profile. Among 30 validated mutational signatures, we investigated the 2 OV-associated mutational signatures with known etiology in detail (Fig 3B). Signature 1 has been found in all cancer types and is linked to aging. Signature 3 is associated with homologous repair deficiency, a potential biomarker for PARP inhibition in ovarian cancer.[37] We obtained a high agreement for mutational signature calls from tumor-only and paired analyses (77.5% for signature 1 and 88.1% for signature 3), confirming that our workflow can detect mutational signatures without matched normal profile even in high-purity samples.In LUAD data, we again reproduced previously associated signatures of known etiology. In addition to the aging signature 1, we found a significant fraction of samples dominated by the APOBEC (signature 2 and signature 13), tobacco (signature 4), and DNA mismatch repair deficiency (signature 6) signatures (Fig 3C, Data Supplement). We observed high agreement between tumor-only and matched normal data for all these signatures (79.3%, 94.8%, 95.7%, and 75.5% for signatures 1, combined 2 and 13, 4, and 6, respectively).
Terra Pipeline
The described workflow is available as a shareable workspace on Terra (formerly known as FireCloud) of the Broad Institute (Cambridge, MA; Appendix). Users can thus easily test the workflow and apply it to their own data stored in Google Cloud Storage (Google, Mountain View, CA) or to data already hosted by Terra, such as TCGA.
DISCUSSION
We present a complete workflow for reliable analysis of clinical tumor-only WES data without matched normal samples. This workflow is validated on OV and LUAD data from TCGA and benchmarked against a gold standard, manually curated analysis of SNP6 microarray data with matched normals. Our workflow estimates tumor purity, ploidy, LOH, TMB, and mutational signatures with high concordance to established workflows for SNP6 and WES data with paired tumor and normal samples.TMB is an emerging biomarker for response to immunotherapy,[38-41] but the current lack of standards significantly challenges implementing TMB testing in the clinic.[42] To our knowledge, this is the first thoroughly validated open-source, tumor-only TMB pipeline. This open-source reference implementation will help establish standards for TMB calling and support its implementation in standard clinical settings where tumor-only WES is performed.Although high tumor purity challenges somatic status classification (Figs 2C and 2D), the proposed approach to determining clinically relevant biomarkers such as TMB and somatic signatures is surprisingly robust to varying tumor purity (Fig 3). Notably, signatures of clear etiology such as homologous repair deficiency, APOBEC, or smoking had a significantly higher concordance with matched analyses than broader and less certain signatures, such as those associated with aging. In contrast, we also note that high tumor purity is beneficial for LOH and copy number calling. Still, all parts of the workflow achieved high accuracy in tumors of 40%-60% purity, the range in which most clinical tissue specimens fall.Increasing sequencing coverage increases the accuracy of somatic status classification[16] and ploidy inference. The results presented in this study are based on relatively low-coverage WES sequencing to an average of 100×. The substantial improvements in sequencing costs and runtimes of current-generation instruments such as Illumina NovaSeq (Illumina, San Diego, CA) make much deeper sequencing of WES feasible. Therefore, we expect accuracies reported here to be pessimistic estimates for assays implemented in the clinic.The average runtime of a WES sample was approximately 3 hours and required 3.5 GB of RAM on an Intel Xeon E5-2680 v4 cluster node (Intel, Santa Clara, CA). Parallelization could reduce the runtime to approximately 30 minutes per sample, making application in high-throughput clinical settings feasible. This is an order of magnitude more than matched tumor and normal allele-specific CNA callers.[18] These tools usually average coverage and SNP allelic fractions across segments in their likelihood models to reduce data points dramatically. Because without matched normal the germline status of variants is not available, PureCN in contrast includes this uncertainty in the likelihood model, resulting in the longer runtime.This study has several limitations. First, we focused on benchmarking our tumor-only workflow where it differs from standard matched tumor and normal analyses. A systematic evaluation of accuracy for the variant calling steps upstream of this workflow is beyond the scope of this study.[43,44] Second, our workflow is currently not designed for whole-genome sequencing (WGS) data. In contrast to gold standard WGS tools, PureCN was designed for high-coverage data (> 100×) and currently does not use information largely unavailable in hybrid-capture data such as split reads or SNP phasing. These would be straightforward additions once high-coverage diagnostic WGS becomes common in oncology. However, support for WGS would likely require the implementation of additional heuristics to achieve acceptable runtimes, for example by averaging information in noncoding regions. Third, as with allele-specific CNA calling in matched tumor and normal data, purity and ploidy inference can be ambiguous in a minority of cases of low purity or of high heterogeneity. Therefore, our pipeline provides tools that allow manual correction of results by trained curators, described in the documentation of the PureCN package. Importantly, the accuracy of TMB calling was robust even to inaccuracies in ploidy, partly because different ploidy solutions can be equivalent for variant classification.[17]Fourth, all samples in this study originated from high-quality fresh frozen samples from only 2 cancer types, and only limited benchmarking on formalin-fixed paraffin-embedded samples was previously done.[15] Cancer types that have proven to be difficult to analyze with ABSOLUTE (eg, chromosomally stable samples from patients with myeloproliferative disease[2]) are expected to be similarly challenging with PureCN. Finally, reliable labeling of clonal hematopoiesis from tumor-only or low-coverage matched normal sequencing remains a shortcoming but is an area of research we are currently pursuing.As a result of the high concordance with matched tumor and normal sequencing, the proposed workflow supports the clinical application of tumor-only sequencing, especially in diagnostic settings. Furthermore, implementation of the workflow on Terra will facilitate users, even those with no coding experience, to process their own data in the Cloud.
Authors: Kyle S Smith; Vinod K Yadav; Shanshan Pei; Daniel A Pollyea; Craig T Jordan; Subhajyoti De Journal: Bioinformatics Date: 2015-11-20 Impact factor: 6.937
Authors: Catherine C Coombs; Ahmet Zehir; Sean M Devlin; Ashwin Kishtagari; Aijazuddin Syed; Philip Jonsson; David M Hyman; David B Solit; Mark E Robson; José Baselga; Maria E Arcila; Marc Ladanyi; Martin S Tallman; Ross L Levine; Michael F Berger Journal: Cell Stem Cell Date: 2017-08-10 Impact factor: 24.633
Authors: Naiyer A Rizvi; Matthew D Hellmann; Alexandra Snyder; Pia Kvistborg; Vladimir Makarov; Jonathan J Havel; William Lee; Jianda Yuan; Phillip Wong; Teresa S Ho; Martin L Miller; Natasha Rekhtman; Andre L Moreira; Fawzia Ibrahim; Cameron Bruggeman; Billel Gasmi; Roberta Zappasodi; Yuka Maeda; Chris Sander; Edward B Garon; Taha Merghoub; Jedd D Wolchok; Ton N Schumacher; Timothy A Chan Journal: Science Date: 2015-03-12 Impact factor: 47.728
Authors: Markus Riester; Angad P Singh; A Rose Brannon; Kun Yu; Catarina D Campbell; Derek Y Chiang; Michael P Morrissey Journal: Source Code Biol Med Date: 2016-12-15
Authors: Irina Kalatskaya; Quang M Trinh; Melanie Spears; John D McPherson; John M S Bartlett; Lincoln Stein Journal: Genome Med Date: 2017-06-29 Impact factor: 11.117
Authors: Ahmet Zehir; Ryma Benayed; Ronak H Shah; Aijazuddin Syed; Sumit Middha; Hyunjae R Kim; Preethi Srinivasan; Jianjiong Gao; Debyani Chakravarty; Sean M Devlin; Matthew D Hellmann; David A Barron; Alison M Schram; Meera Hameed; Snjezana Dogan; Dara S Ross; Jaclyn F Hechtman; Deborah F DeLair; JinJuan Yao; Diana L Mandelker; Donavan T Cheng; Raghu Chandramohan; Abhinita S Mohanty; Ryan N Ptashkin; Gowtham Jayakumaran; Meera Prasad; Mustafa H Syed; Anoop Balakrishnan Rema; Zhen Y Liu; Khedoudja Nafa; Laetitia Borsu; Justyna Sadowska; Jacklyn Casanova; Ruben Bacares; Iwona J Kiecka; Anna Razumova; Julie B Son; Lisa Stewart; Tessara Baldi; Kerry A Mullaney; Hikmat Al-Ahmadie; Efsevia Vakiani; Adam A Abeshouse; Alexander V Penson; Philip Jonsson; Niedzica Camacho; Matthew T Chang; Helen H Won; Benjamin E Gross; Ritika Kundra; Zachary J Heins; Hsiao-Wei Chen; Sarah Phillips; Hongxin Zhang; Jiaojiao Wang; Angelica Ochoa; Jonathan Wills; Michael Eubank; Stacy B Thomas; Stuart M Gardos; Dalicia N Reales; Jesse Galle; Robert Durany; Roy Cambria; Wassim Abida; Andrea Cercek; Darren R Feldman; Mrinal M Gounder; A Ari Hakimi; James J Harding; Gopa Iyer; Yelena Y Janjigian; Emmet J Jordan; Ciara M Kelly; Maeve A Lowery; Luc G T Morris; Antonio M Omuro; Nitya Raj; Pedram Razavi; Alexander N Shoushtari; Neerav Shukla; Tara E Soumerai; Anna M Varghese; Rona Yaeger; Jonathan Coleman; Bernard Bochner; Gregory J Riely; Leonard B Saltz; Howard I Scher; Paul J Sabbatini; Mark E Robson; David S Klimstra; Barry S Taylor; Jose Baselga; Nikolaus Schultz; David M Hyman; Maria E Arcila; David B Solit; Marc Ladanyi; Michael F Berger Journal: Nat Med Date: 2017-05-08 Impact factor: 53.440
Authors: Ludmil B Alexandrov; Philip H Jones; David C Wedge; Julian E Sale; Peter J Campbell; Serena Nik-Zainal; Michael R Stratton Journal: Nat Genet Date: 2015-11-09 Impact factor: 38.330
Authors: Ludmil B Alexandrov; Serena Nik-Zainal; David C Wedge; Samuel A J R Aparicio; Sam Behjati; Andrew V Biankin; Graham R Bignell; Niccolò Bolli; Ake Borg; Anne-Lise Børresen-Dale; Sandrine Boyault; Birgit Burkhardt; Adam P Butler; Carlos Caldas; Helen R Davies; Christine Desmedt; Roland Eils; Jórunn Erla Eyfjörd; John A Foekens; Mel Greaves; Fumie Hosoda; Barbara Hutter; Tomislav Ilicic; Sandrine Imbeaud; Marcin Imielinski; Marcin Imielinsk; Natalie Jäger; David T W Jones; David Jones; Stian Knappskog; Marcel Kool; Sunil R Lakhani; Carlos López-Otín; Sancha Martin; Nikhil C Munshi; Hiromi Nakamura; Paul A Northcott; Marina Pajic; Elli Papaemmanuil; Angelo Paradiso; John V Pearson; Xose S Puente; Keiran Raine; Manasa Ramakrishna; Andrea L Richardson; Julia Richter; Philip Rosenstiel; Matthias Schlesner; Ton N Schumacher; Paul N Span; Jon W Teague; Yasushi Totoki; Andrew N J Tutt; Rafael Valdés-Mas; Marit M van Buuren; Laura van 't Veer; Anne Vincent-Salomon; Nicola Waddell; Lucy R Yates; Jessica Zucman-Rossi; P Andrew Futreal; Ultan McDermott; Peter Lichter; Matthew Meyerson; Sean M Grimmond; Reiner Siebert; Elías Campo; Tatsuhiro Shibata; Stefan M Pfister; Peter J Campbell; Michael R Stratton Journal: Nature Date: 2013-08-14 Impact factor: 49.962
Authors: Ludwig Geistlinger; Sehyun Oh; Marcel Ramos; Lucas Schiffer; Rebecca S LaRue; Christine M Henzler; Sarah A Munro; Claire Daughters; Andrew C Nelson; Boris J Winterhoff; Zenas Chang; Shobhana Talukdar; Mihir Shetty; Sally A Mullany; Martin Morgan; Giovanni Parmigiani; Michael J Birrer; Li-Xuan Qin; Markus Riester; Timothy K Starr; Levi Waldron Journal: Cancer Res Date: 2020-08-03 Impact factor: 12.701