| Literature DB >> 31836783 |
Kanika Arora1, Minita Shah1, Molly Johnson1, Rashesh Sanghvi1, Jennifer Shelton1, Kshithija Nagulapalli1, Dayna M Oschwald1, Michael C Zody1, Soren Germer1, Vaidehi Jobanputra1, Jade Carter1, Nicolas Robine2.
Abstract
To test the performance of a new sequencing platform, develop an updated somatic calling pipeline and establish a reference for future benchmarking experiments, we performed whole-genome sequencing of 3 common cancer cell lines (COLO-829, HCC-1143 and HCC-1187) along with their matched normal cell lines to great sequencing depths (up to 278x coverage) on both Illumina HiSeqX and NovaSeq sequencing instruments. Somatic calling was generally consistent between the two platforms despite minor differences at the read level. We designed and implemented a novel pipeline for the analysis of tumor-normal samples, using multiple variant callers. We show that coupled with a high-confidence filtering strategy, the use of combination of tools improves the accuracy of somatic variant calling. We also demonstrate the utility of the dataset by creating an artificial purity ladder to evaluate the somatic pipeline and benchmark methods for estimating purity and ploidy from tumor-normal pairs. The data and results of the pipeline are made accessible to the cancer genomics community.Entities:
Mesh:
Year: 2019 PMID: 31836783 PMCID: PMC6911065 DOI: 10.1038/s41598-019-55636-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Homopolymer length and base mismatch comparisons between HiSeqX and NovaSeq. (A) Distribution of length of longest stretches of a nucleotide in HiSeqX and NovaSeq, Read 1 and Read 2 FASTQ files. Each dot represents fraction of reads in a single FASTQ file. Fraction of the total number of reads is represented in log-scale. (B) Single nucleotide mismatches by type in samples sequenced on NovaSeq and HiSeqX, with mapping quality (MQ) ≥10 and base quality (BQ) ≥10 cut-offs. Each bar represents a single sample and is colored based on sequencing platform. (C) Average mismatch rates for bases with MQ≥10 and BQ≥10 across the 6 cell line samples for each mismatch type per trinucleotide for HiSeqX (top row), NovaSeq (middle row) and difference between HiSeqX and NovaSeq (bottom row). (D) Same as (C), but with mismatch types categories collapsed with their respective reverse complements.
Figure 2Intra- and inter-platform comparison of somatic variants. (A) Comparison of SNVs, Indels and structural variants between two replicates COLO-829 NovaSeq data (created using reads from mutually exclusive lanes) and between HiSeqX and NovaSeq data for the three cell lines. Orange bars (resp. purple) represent the number of variants called uniquely in the NovaSeq runs (resp. HiSeqX) and the grey bars correspond to the variants called in both samples. The numbers in the grey bars represent the concordance between the two samples, calculated as percentage of the (number of variants in the intersect)/(number of variants in the union). (B) Allele frequency of the variants called only in HiSeqX in purple, and for reference the allele frequency of variants called in both platforms in grey. (C) The decomposition in trinucleotide contexts of the SNVs called uniquely by each platform. Substitutions are represented by the pyrimidine of the mutated Watson-Crick base pair. (D) Similar to panel B but for variants uniquely called in NovaSeq. The AllSomatic callsets were used for panels B, C and D.
Figure 3Comparison of somatic variants called on HiSeqX and NovaSeq COLO-829 tumor/normal data downsampled to 80X/40X to the Craig et al. reference dataset.
Figure 4Precision, recall and F1 scores at different simulated purities for (A) SNVs (left),Indels (center) and SVs (right), and (B) CNVs without (left) and with (right) adjustments of log2 values for purity and ploidy. (C) Ploidy and (D) purity estimation for the purity ladder samples using CELLULOID and HATCHet in single-sample and multi-sample mode.
Figure 5NYGC Somatic Pipeline for tumor-normal whole-genome sequencing samples.