| Literature DB >> 27454357 |
Mark T W Ebbert1, Mark E Wadsworth1, Lyndsay A Staley1, Kaitlyn L Hoyt1, Brandon Pickett1, Justin Miller1, John Duce1, John S K Kauwe1, Perry G Ridge2.
Abstract
BACKGROUND: Analyzing next-generation sequencing data is difficult because datasets are large, second generation sequencing platforms have high error rates, and because each position in the target genome (exome, transcriptome, etc.) is sequenced multiple times. Given these challenges, numerous bioinformatic algorithms have been developed to analyze these data. These algorithms aim to find an appropriate balance between data loss, errors, analysis time, and memory footprint. Typical analysis pipelines require multiple steps. If one or more of these steps is unnecessary, it would significantly decrease compute time and data manipulation to remove the step. One step in many pipelines is PCR duplicate removal, where PCR duplicates arise from multiple PCR products from the same template molecule binding on the flowcell. These are often removed because there is concern they can lead to false positive variant calls. Picard (MarkDuplicates) and SAMTools (rmdup) are the two main softwares used for PCR duplicate removal.Entities:
Keywords: Next-Generation Sequencing; PCR duplicate removal; Picard; SAMTools
Mesh:
Year: 2016 PMID: 27454357 PMCID: PMC4965708 DOI: 10.1186/s12859-016-1097-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Our pipeline was identical for every sample, except for how we handled PCR duplicates. Original BAM files with mapped reads were processed using the shown pipeline. PCR duplicates were ignored, removed using SAMTools (rmdup), or marked (left in the file, but ignored in subsequent steps) using Picard (MarkDuplicates). After the duplicates step, all subsequent steps in the analysis pipeline were identical. The final output of the pipeline is a multi-sample VCF file. VCF: Variant Call Format, BQSR: Base Quality Score Recalibration, VQSR: Variant Quality Score Recalibration
Fig. 2We constructed a Venn diagram using the variant datasets. The datasets correspond to the three pipelines: removing duplicates using SAMTools, removing duplicates using Picard, and ignoring duplicates. The blue circle is the Picard dataset, red is the no duplicates removed dataset, and green is the SAMTools dataset
Minimal differences between Picard, SAMTools, and no duplicate removal
| Subset | Total Variants | Ti/Tv Ratio | % Variants in dbSNP | Avg. Population Frequency | % Protein Changing Variants |
|---|---|---|---|---|---|
| All Picard | 16354497 | 2.14 | 72.05 | 0.21 | 0.40 |
| All SAMTools | 16250761 | 2.14 | 71.86 | 0.22 | 0.40 |
| All No Dups | 16494672 | 2.14 | 71.30 | 0.21 | 0.40 |
|
| <2.60e-16 | 1.00 | 0.99 | 0.99 | 1.00 |
| Common to all three | 15688522 | 2.15 | 80.18 | 0.22 | 0.41 |
| Unique to Picard | 307486 | 1.92 | 66.27 | 0.16 | 0.33 |
| Unique to SAMTools | 150474 | 1.80 | 69.59 | 0.19 | 0.26 |
| Unique to No Dups | 398248 | 1.95 | 54.07 | 0.16 | 0.34 |
| Unique to Picard/SAMTools | 181176 | 1.97 | 73.86 | 0.22 | 0.33 |
| Unique to Picard/No Dups | 177313 | 2.07 | 65.30 | 0.21 | 0.31 |
| Unique to SAMTools/No Dups | 230589 | 1.73 | 52.17 | 0.23 | 0.24 |
|
| <2.60e-16 | 1.00 | 0.32 | 0.84 | 1.00 |
Here we present metrics from each portion of the Venn diagram (Fig. 2), including total number of variants, transition/transversion (Ti/Tv) ratios, average population frequency, proportion of novel variants, and proportion of variants that change the protein product. In the top part of the table, variant characteristics are reported for all the variants resulting from duplicate removal using Picard or SAMTools, or no duplicate removal. Variants from the dataset processed using Picard are referred to as Picard, processed using SAMTools as SAMTools, and the dataset without duplicate removal as No Dups. Population frequencies are based on the 1000 Genomes Project, dbSNP variants refer to build 138 and any variant not present in dbSNP is considered novel, and protein changing variants are missense SNVs or frameshifting InDels. We performed a Chi-square goodness-of-fit to test for significant differences amongst values in each column. Two tests were performed for each column: (1) comparing the values for all variants in each main dataset (“All Picard”, “All SAMTools”, and “All No Dups”); and (2) comparing values for variants across all “Unique” groups. There was a significant difference when comparing the number of variants across groups, but none of the other measures were significantly different
Fig. 3The Venn diagram is as described in Fig. 2, except limited to variants in the ACMG genes. The blue circle is the Picard dataset, red is the no duplicates removed dataset, and green is the SAMTools dataset
Differences using only ACMG genes are also minimal
| Subset | Total Variants | Ti/Tv Ratio | % Variants in dbSNP | Avg. Population Frequency | % Protein Changing Variants |
|---|---|---|---|---|---|
| All Picard | 34285 | 2.29 | 67.75 | 0.20 | 1.08 |
| All SAMTools | 34412 | 2.29 | 67.51 | 0.20 | 1.08 |
| All No Dups | 34531 | 2.29 | 67.34 | 0.20 | 1.07 |
|
| 0.64 | 1.00 | 0.99 | 1.00 | 1.00 |
| Common to all three | 34000 | 2.31 | 76.37 | 0.20 | 1.09 |
| Unique to Picard | 261 | 1.01 | 19.64 | 0.88 | 0 |
| Unique to SAMTools | 2 | 1.00 | 0 | NA | 0 |
| Unique to No Dups | 115 | 1.54 | 22.33 | 0.04 | 0 |
| Unique to Picard/SAMTools | 9 | 1.50 | 40 | 0.03 | 0 |
| Unique to Picard/No Dups | 15 | 1.00 | 0 | NA | 0 |
| Unique to SAMTools/No Dups | 401 | 0.98 | 15.86 | 0.12 | 0.32 |
|
| <2.60e-16 | 0.998 | 1.04e-13 | 0.74 | 0.90 |
We performed the same analyses using only the ACMG genes and found similar results
Concordance between SNP chip and NGS data across all three duplicate removal methods are nearly identical
| Chip data | ||||
|---|---|---|---|---|
| homref | het | homalt | ||
| homref | 99.97 | 0.18 | 0.16 | |
| No dup | het | 0.01 | 99.81 | 0.13 |
| homalt | 0.02 | 0.01 | 99.71 | |
| homref | 99.97 | 0.19 | 0.14 | |
| Picard | het | 0.01 | 99.80 | 0.14 |
| homalt | 0.02 | 0.01 | 99.71 | |
| homref | 99.97 | 0.19 | 0.16 | |
| SAMTools | het | 0.01 | 99.80 | 0.13 |
| homalt | 0.02 | 0.01 | 99.71 | |
| homref | 99.91 | 0.06 | 0.18 | |
| No dup ACMG | het | 0.02 | 99.94 | 0.07 |
| homalt | 0.08 | 0.00 | 99.76 | |
| homref | 99.91 | 0.06 | 0.18 | |
| Picard ACMG | het | 0.02 | 99.94 | 0.08 |
| homalt | 0.08 | 0.00 | 99.75 | |
| homref | 99.91 | 0.06 | 0.18 | |
| SAMTools ACMG | het | 0.02 | 99.94 | 0.07 |
| homalt | 0.08 | 0.00 | 99.76 | |
We compared the genotypes from NGS and matched SNP chip data to see if concordance varied by duplicate removal approaches. We performed this comparison for all variants and for ACMG variants only. Reported values are the percentage of total SNP chip genotypes called for a particular group (e.g., homozygous reference) that were correctly called by NGS for a given group. Exactly 99.97 % of genotypes called homozygous reference by SNP chip were also called homozygous reference by NGS across no dup, picard, and SAMTools. Similarly, 99.91 % of ACMG genotypes called homozygous reference by SNP chip were called identically by NGS. In the Table, homref: homozygous for the reference allele, het: heterozygous, and homalt: homozygous for an alternate allele
Fig. 4Density plot of memory usage by SAMTools for duplicate removal. The y-axis is memory in megabytes. Note the difference in magnitude of memory used in this figure compared to Picard (Fig. 5)
Fig. 5Density plot of memory usage by Picard for duplicate removal. The y-axis is memory in megabytes. Note the difference in magnitude of memory used in this figure compared to SAMTools (Fig. 4)
Fig. 6Density plot of execution time for both SAMTools and Picard duplicate removal. Picard is marked by the red line, and SAMTools in blue. The y-axis is execution time measured in minutes