| Literature DB >> 25103687 |
Erin N Smith, Kristen Jepsen, Mahdieh Khosroheidari, Laura Z Rassenti, Matteo D'Antonio, Emanuela M Ghia, Dennis A Carson, Catriona Hm Jamieson, Thomas J Kipps, Kelly A Frazer.
Abstract
Accurate allele frequencies are important for measuring subclonal heterogeneity and clonal evolution. Deep-targeted sequencing data can contain PCR duplicates, inflating perceived read depth. Here we adapted the Illumina TruSeq Custom Amplicon kit to include single molecule tagging (SMT) and show that SMT-identified duplicates arise from PCR. We demonstrate that retention of PCR duplicate reads can imply clonal evolution when none exists, while their removal effectively controls the false positive rate. Additionally, PCR duplicates alter estimates of subclonal heterogeneity in tumor samples. Our method simplifies PCR duplicate identification and emphasizes their removal in studies of tumor heterogeneity and clonal evolution.Entities:
Mesh:
Year: 2014 PMID: 25103687 PMCID: PMC4165357 DOI: 10.1186/s13059-014-0420-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Adaptation of Illumina TruSeq Custom Amplicon Kit to allow for single molecule tagging. (A) Schematic of method showing amplification of target DNA using custom probes and flanking primers. The P5-SMT primer is the same as the standard P5 index primer, but contains a degenerate 12 N-mer sequence in place of the index. The incorporation of an Ampure Bead size selection step after two rounds of PCR removes unused P5-SMT, and the P5 primer is added to facilitate downstream amplification. Figure schematic is adapted from Illumina promotional material. (B) Stacked barplot showing number of paired-end reads, split into unique reads (dark grey) and SMT-identified duplicate reads (light-grey) in 24 samples (18 germline, 6 tumor). (C) For each of 18 germline samples, we show the number of SMTs by duplicate cluster size (the number of times that an SMT was observed at a given target within a sample). Higher overall duplicate rates within a sample were associated with larger duplicate clusters. Except for the sample with the highest duplicate rate, duplicate cluster sizes were generally less than 10. The length of the SMT (8- versus 12-mer) did not affect the distribution. SMT, single molecule tag.
Figure 2Single molecule tag-identified duplicates represent PCR duplicates. (A) Boxplot of target duplicate rate as a function of DNA input. DNA samples from two tumors were diluted and single molecule tag (SMT)-libraries were prepared. Duplicate rate, as identified by SMT sequence, was higher in the lower input DNA samples, consistent with lower starting complexity. (B) Smoothed scatterplot of the relationship between target duplicate rate, adjusted for sample-specific effects and GC content of insert and primers, and depth of coverage (shown log scale). (C) Barplot of the number of times that independent SMTs (one SMT per target per sample) were seen across targets and samples (black) compared to expectations by Poisson sampling (grey). (D) Motif identified using the 54 SMT sequences observed 10 or more times across all 14 12-mer germline samples and targets. (E) Agreement of allele calls within duplicate clusters for 12-mer and 8-mer SMT sequences. Percent allele call agreement is the percentage of duplicate clusters where all allele calls at the SNP of interest were consistent.
Figure 3PCR duplicates inflate the false positive rate of differences between samples and alter measures of clonal heterogeneity. (A) False positive rate (FPR) for tests at heterozygous single nucleotide polymorphisms (SNPs) between groups of reads randomly allocated from the same sample with a varying percentage of simulated duplicates. Dotted line indicates FPR = 0.05. (B) Boxplot indicating higher variability in alternate allele frequency at heterozygous SNPs as duplication rate increases. (C) Increase in FPR for tests at heterozygous SNPs when single molecule tag-duplicates are present (black) or removed (grey). Dotted line indicates FPR = 0.05. (D) Estimates of the number of genetic clusters in two tumor samples (in red and blue respectively) becomes variable as duplicate rate increases. The number of clusters was calculated using PyClone.