| Literature DB >> 26220471 |
Howard W Huang1, James C Mullikin2, Nancy F Hansen3.
Abstract
BACKGROUND: Despite the tremendous drop in the cost of nucleotide sequencing in recent years, many research projects still utilize sequencing of pools containing multiple samples for the detection of sequence variants as a cost saving measure. Various software tools exist to analyze these pooled sequence data, yet little has been reported on the relative accuracy and ease of use of these different programs.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26220471 PMCID: PMC4518579 DOI: 10.1186/s12859-015-0624-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Program SNV Detection Results for (a) ClinSeq samples and (b) 1000 Genomes samples
| a | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 PooledSamples | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD |
| 100 % Sample Covg | 50 % Sample Covg | 25 % Sample Covg | 12.5 % Sample Covg | |||||||||||||
| CRISP |
| 7.8 |
|
|
| 7.3 |
|
|
| 6.5 |
|
|
| 6.2 |
|
|
| SNVer | 81.9 |
| 89 | 72.4 | 74.9 |
| 85.9 | 59.1 | 62.7 |
| 80.1 | 37.3 | 48 |
| 73.2 | 16.6 |
| LoFreq |
| 8.3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| VarScan | 46.7 |
| 73.3 | 4.7 | 47.7 |
| 73.8 | 6.2 | 48.9 |
| 74.4 | 8.1 | 45 |
| 72.3 | 6.6 |
| GATK |
|
|
|
|
| 7.4 |
|
|
| 8 |
|
|
| 8.7 |
|
|
| 8 Pooled Samples | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD |
| 100 % Sample Covg | 50 % Sample Covg | 25 % Sample Covg | 12.5 % Sample Covg | |||||||||||||
| CRISP |
| 7.8 |
|
|
| 7.4 |
|
|
| 6.8 |
|
|
| 6.7 |
|
|
| SNVer | 79.9 |
| 88.1 | 65.7 | 69.4 |
| 83.1 | 47.1 | 55.5 |
| 76.3 | 25.3 | 42.5 |
| 70 | 9.9 |
| LoFreq |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| VarScan | 28.8 |
| 64.4 | 0 | 29.2 |
| 64.5 | 0.1 | 29.8 |
| 64.8 | 0.1 | 30.4 |
| 65.1 | 0.3 |
| GATK |
| 8.6 |
|
|
| 8.5 |
|
|
| 10.1 |
|
|
| 11.4 |
|
|
| 16 Pooled Samples | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD |
| 100 % Sample Covg | 50 % Sample Covg | 25 % Sample Covg | 12.5 % Sample Covg | |||||||||||||
| CRISP |
| 7.7 |
|
|
| 7.6 |
|
|
| 7 |
|
|
| 7 |
|
|
| SNVer | 66.9 |
| 81.7 | 42.9 | 53.7 |
| 75.1 | 23.8 | 42.4 |
| 69.6 | 10.8 | 33.2 |
| 65.1 | 3.6 |
| LoFreq |
| 6.4 |
|
|
| 6 |
|
|
| 5.5 |
|
|
| 4.8 |
|
|
| VarScan | 18.1 |
| 59 | 0 | 18.2 |
| 59 | 0 | 18.4 |
| 59.1 | 0 | 18.7 |
| 59.3 | 0 |
| GATK | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 32 Pooled Samples | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD |
| 100 % Sample Covg | 50 % Sample Covg | 25 % Sample Covg | 12.5 % Sample Covg | |||||||||||||
| CRISP |
| 7.9 |
|
|
| 7.6 |
|
|
| 7.1 |
|
|
| 7.3 |
|
|
| SNVer | 41.8 |
| 68.8 | 11 | 34.6 |
| 65.2 | 5 | 29.3 |
| 62.7 | 2.3 | 24.5 |
| 60.4 | 0.6 |
| LoFreq |
| 5.4 |
|
|
| 5.4 |
|
|
| 5.3 |
|
|
| 5.2 |
|
|
| VarScan | 11.4 |
| 55.7 | 0 | 11.5 |
| 55.7 | 0 | 11.5 |
| 55.7 | 0 | 11.6 |
| 55.7 | 0 |
| GATK | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| b | ||||||||||||||||
| 4 Pooled Samples | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD |
| 100 % Sample Covg | 50 % Sample Covg | 25 % Sample Covg | 12.5 % Sample Covg | |||||||||||||
| CRISP |
| 4.4 |
|
|
| 4 |
|
|
| 3.7 |
|
|
| 3.3 |
|
|
| SNVer | 86.5 | 1.3 | 92.6 | 74.8 | 70.9 | 0.9 | 85 | 47.3 | 50.3 | 0.4 | 74.9 | 18.5 | 33.2 | 0.6 | 66.3 | 4.4 |
| LoFreq |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| VarScan | 44.6 |
| 72.3 | 0.9 | 45.1 |
| 72.5 | 3.2 | 42.3 |
| 71.2 | 3.8 | 33.4 |
| 66.7 | 1.4 |
| GATK |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 Pooled Samples | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD |
| 100 % Sample Covg | 50 % Sample Covg | 25 % Sample Covg | 12.5 % Sample Covg | |||||||||||||
| CRISP |
| 4.3 |
|
|
| 4.1 |
|
|
| 3.7 |
|
|
| 3.2 |
|
|
| SNVer | 80.5 | 2.1 | 89.2 | 62 | 61.8 | 1.8 | 80 | 30.1 | 44.5 | 0.8 | 71.8 | 9.3 | 30.6 | 0.8 | 64.9 | 1.6 |
| LoFreq |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| VarScan | 25.5 |
| 62.7 | 0 | 26 |
| 63 | 0 | 26.9 |
| 63.4 | 0 | 25.7 |
| 62.8 | 0.1 |
| GATK |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 Pooled Samples | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD | %Sen | %FP | %BA | %SD |
| 100 % Sample Covg | 50 % Sample Covg | 25 % Sample Covg | 12.5 % Sample Covg | |||||||||||||
| CRISP |
| 4.2 |
|
|
| 4 |
|
|
| 3.5 |
|
|
| 3.3 |
|
|
| SNVer | 61.3 | 4.4 | 78.4 | 27.8 | 47.6 | 3.3 | 72.1 | 9.8 | 36.1 | 1.6 | 67.3 | 1.7 | 27.3 | 0.9 | 63.2 | 0.3 |
| LoFreq |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| VarScan | 14.9 |
| 57.4 | 0 | 15.1 |
| 57.6 | 0 | 15.3 |
| 57.6 | 0 | 15.6 |
| 57.8 | 0 |
| GATK | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
GATK was unable to process the 16 or 32 pooled sample pools (see runtime results). Pools were run in groups of 8 for the ClinSeq samples and groups of 4 for the 1000 Genomes samples, except for LoFreq runs, which ran on individual pools, before grouping the results in sets of 8 (ClinSeq) or 4 (1000 Genomes) to calculating sensitivity, false positive rate, balanced accuracy, and singleton detection rate. Numbers reported in bold face represent the better performance values for each column
Fig. 1Effects of Pool Size on Program Balanced Accuracies. “Balanced accuracy” is defined as the mean of the sensitivity and 1 minus the false positive rate. No data point is reported for GATK with 16 or 32 samples because runs did not complete within a reasonable timeframe. Values are plotted for (a) ClinSeq and (b) Thousand Genomes pools containing read depth 50 % of a typical whole exome, which was 35.1x, on average, for ClinSeq samples and 21.0x, on average, for Thousand Genomes samples
Fig. 2Effects of Pool Coverage on Program Balanced Accuracies. “Balanced accuracy” is defined as the mean of the sensitivity and 1 minus the false positive rate. Values are plotted for various fractions of “full coverage” for (a) ClinSeq pools containing eight individuals and (b) Thousand Genomes pools containing four individuals
Fig. 3ROC Analysis on VCFs generated from ClinSeq eight sample, 50 % coverage pools with a total of 35.1x depth of coverage, on average, with eight pools per program run. For CRISP and GATK, quality score filtering was gradually increased on a logarithmic scale (0–100,100-1000,1000-10,000, etc.) to obtain a full range of sensitivity and false positive scores. LoFreq’s filtering was incremented logarithmically up to 1000, then by 100 s since its quality score range was smaller than those of the other programs. Many of SNVer’s P-values were extremely small (with reported p-values as low as 0), so maximum p-value filtering was set at values from 10−10 down to 10−300. Full details of score thresholds used are contained in the worksheet titled “Supp Table S2 Main Paper Figure S3” in the Additional file 1: Figure S3
Effects of submitting multiple and individual pooled BAM files to each program
| a | ||||
|---|---|---|---|---|
| Group Size | Sen% | FP% | BA% | SD% |
| CRISP-2 pools | 97.8 | 10.5 | 93.7 | 95.3 |
| CRISP-4 pools | 96.1 | 7.5 | 94.3 | 91.6 |
| CRISP-8 pools | 97.2 | 7.4 | 94.9 | 94.1 |
| SNVer-1 pool | 72.4 | 3.3 | 84.6 | 52.9 |
| SNVer-2 pools | 71.4 | 3.3 | 84.1 | 51 |
| SNVer-4 pools | 70.4 | 3.3 | 83.6 | 49 |
| SNVer-8 pools | 69.4 | 3.2 | 83.1 | 47.1 |
| VarScan-1 pool | 29.3 | 0.1 | 64.6 | 0.1 |
| VarScan-2 pools | 29.3 | 0.1 | 64.6 | 0.1 |
| VarScan-4 pools | 29.3 | 0.1 | 64.6 | 0.1 |
| VarScan-8 pools | 29.2 | 0.1 | 64.5 | 0.1 |
| GATK-1 pool | 98.2 | 9.1 | 94.6 | 95.9 |
| GATK-2 pools | 98.2 | 9 | 94.6 | 95.8 |
| GATK-4 pools | 98.1 | 8.6 | 94.7 | 95.5 |
| GATK-8 pools | 98 | 8.5 | 94.7 | 95.1 |
| b | ||||
| Group Size | Sen% | FP% | BA% | SD% |
| CRISP-2 pools | 97.1 | 4.1 | 96.5 | 93.2 |
| CRISP-4 pools | 92.2 | 4 | 94.1 | 83.4 |
| SNVer-1 pool | 74.4 | 1 | 86.7 | 53.6 |
| SNVer-2 pools | 73.7 | 1 | 86.3 | 52.5 |
| SNVer-4 pools | 72.8 | 1 | 85.9 | 50.8 |
| VarScan-1 pool | 41 | 0 | 70.5 | 3.1 |
| VarScan-2 pools | 41 | 0 | 70.5 | 3.1 |
| VarScan-4 pools | 40.9 | 0 | 70.5 | 3.1 |
| GATK-1 pool | 98 | 0.2 | 98.9 | 95.2 |
| GATK-2 pools | 97.9 | 0.2 | 98.9 | 95.1 |
| GATK-4 pools | 97.9 | 0.2 | 98.9 | 95 |
In (a), all values were calculated using eight ClinSeq samples per pool with 35.1x average total coverage (50 % of typical full coverage for each sample). In (b), all values were calculated using four Thousand Genomes samples per pool with 21.0x average total coverage (50 % of typical full coverage for each sample)
Program memory allocation and runtimes for pooled BAM files of 4, 8, and 16 ClinSeq samples, 35.1× average coverage each
| Program | CPU Hours per BAM file | Memory Used/Provided |
|---|---|---|
| CRISP | <2 h | <150 Mb Used |
| SNVer | 1 - 5 h | 4 - 8 Gb Provided |
| LoFreq | 1 - 5 h | ~150 Mb Used |
| VarScan | 2 - 5 h | 6-8 GB Provided |
| GATK | 8 h - +7 days | 4 - 20 Gb Provided |
Java programs required users to specify memory restrictions. Programs written in C were memory efficient and ran relatively quickly