| Literature DB >> 26554718 |
Koen Van der Borght1,2, Kim Thys3, Yves Wetzels4, Lieven Clement5, Bie Verbist6, Joke Reumers7, Herman van Vlijmen8, Jeroen Aerssens9.
Abstract
BACKGROUND: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset.Entities:
Mesh:
Year: 2015 PMID: 26554718 PMCID: PMC4641353 DOI: 10.1186/s12859-015-0812-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Quality score recalibration model trained on FASTQ/1 GAIIx data
| Parameter | Coefficient | Significance ( |
|---|---|---|
| Intercept | 0.84 ( | 2.27e-5 |
|
| −0.16 ( | <2e-16 |
|
| −0.89 ( | <2e-16 |
QQ-SNV model trained on 96 in silico HIV plasmid mixture data samples
| Parameter | Coefficient | Significance (P-value) |
|---|---|---|
| Intercept | −117.66 ( | <2e-16 |
|
| 4.38 ( | <2e-16 |
|
| 7.37 ( | <2e-16 |
|
| 116.78 ( | <2e-16 |
|
| −3.06 ( | <2e-16 |
|
| −0.70 ( | <2e-16 |
Fig. 1QQ-SNV variant calling. Schematic view of QQ-SNV variant calling methodology starting from Illumina NGS data. a QQ-SNV workflow b QQ-SNVD c QQ-SNVHS d QQ-SNVHS-P80, 80th percentile (P80) of the QQ-SNVHS distribution of “error” frequencies was used as frequency cutoff to decrease the number of false positives obtained with QQ-SNVHS
Performance on HIV plasmid mixture dataset 1
| QQ-SNVD | QQ-SNVHS-P80 a | LoFreq | ShoRAH | V-Phaser 2 | |
| Variant % | sens.b | sens.b | sens.b | sens.b | sens.b |
| 88.889 % | 57/61 | 60/61 | 59/61 | 56/61 | 20/61 |
| 10 % | 44/50 | 49/50 | 47/50 | 45/50 | 13/50 |
| 1 % | 28/34 | 30/34 | 23/34 | 26/34 | 10/34 |
| 0.1 % | 2/42 | 20/42 | 0/42 | 7/42 | 22/42 |
| 0.01 % | 0/40 | 1/40 | 0/40 | 3/40 | 8/40 |
| 0.001 % | 0/30 | 0/30 | 0/30 | 0/30 | 0/30 |
| FPc | FPc | FPc | FPc | FPc | |
| 2 | 4 | 0 | 6 | 2 | |
| timede | timede | timedf | timedg | timedg | |
| 14m11s | 14m11s | 53m24s | 6h25m | 33m5s |
a80th percentile of the QQ-SNVHS distribution of “error” frequencies was used as frequency cutoff to decrease the number of false positives obtained with QQ-SNVHS
bsens. is the percentage of true variants that were correctly called as SNV
cFP is the number of variants incorrectly called as SNV
dcomputation time in hours (h), minutes (m) and seconds (s)
eWindows 7 64 bit, 8GB RAM, 3.2GHz
fLinux Ubuntu 12.04.4, 57.6 GB RAM, one core (2.3GHz) used
gLinux Ubuntu 12.04.4, 57.6 GB RAM, 8 cores (2.3GHz) used in parallel
Performance on HIV plasmid mixture dataset 2
| QQ-SNVD | QQ-SNVHS-P80 a | LoFreq | ShoRAH | V-Phaser 2 | |
| Variant % | sens.b | sens.b | sens.b | sens.b | sens.b |
| 40.0 % | 2/2 | 2/2 | 2/2 | 1/2 | nah |
| 39.9 % | 1/1 | 1/1 | 1/1 | 1/1 | nah |
| 39.5 % | 1/1 | 1/1 | 1/1 | 0/1 | nah |
| 38.9 % | 1/1 | 1/1 | 1/1 | 0/1 | nah |
| 38.4 % | 2/2 | 2/2 | 2/2 | 2/2 | nah |
| 34.4 % | 2/2 | 2/2 | 2/2 | 1/2 | nah |
| 34.0 % | 1/1 | 1/1 | 1/1 | 1/1 | nah |
| 33.5 % | 1/1 | 1/1 | 1/1 | 1/1 | nah |
| 33.4 % | 8/8 | 8/8 | 8/8 | 7/8 | nah |
| 6.6 % | 1/1 | 1/1 | 1/1 | 1/1 | nah |
| 6.1 % | 1/1 | 1/1 | 1/1 | 0/1 | nah |
| 5.1 % | 1/1 | 1/1 | 1/1 | 1/1 | nah |
| 5.0 % | 8/8 | 8/8 | 8/8 | 8/8 | nah |
| 1.0 % | 10/10 | 10/10 | 9/10 | 7/10 | nah |
| 0.6 % | 0/1 | 1/1 | 1/1 | 0/1 | nah |
| 0.5 % | 5/8 | 8/8 | 8/8 | 2/8 | nah |
| 0.1 % | 1/10 | 6/10 | 2/10 | 2/10 | nah |
| FPc | FPc | FPc | FPc | FPc | |
| 0 | 1 | 0 | 3 | nah | |
| timede | timede | timedf | timedg | timedg | |
| 2h29m | 2h29m | 2d4h11m | 57m45s | nah |
a80th percentile of the QQ-SNVHS distribution of “error” frequencies was used as frequency cutoff to decrease the number of false positives obtained with QQ-SNVHS
bsens. is the percentage of true variants that were correctly called as SNV
cFP is the number of variants incorrectly called as SNV
dcomputation time in days (d), hours (h), minutes (m) and seconds (s)
eWindows 7 64 bit, 8GB RAM, 3.2GHz
fLinux Ubuntu 12.04.4, 57.6 GB RAM, one core (2.3GHz) used
gLinux Ubuntu 12.04.4, 57.6 GB RAM, 8 cores (2.3GHz) used in parallel
hNo results could be obtained for the V-Phaser 2 algorithm due to failure of the V-Phaser 2 software tool on our server
Performance on HCV plasmid mixture datasets
| QQ-SNVD | QQ-SNVHS-P80 a | LoFreq | ShoRAH | V-Phaser 2 | |||||||
| Variant % | Pair | sens.b | FPc | sens.b | FPc | sens.b | FPc | sens.b | FPc | sens.b | FPc |
| 0.50 % | 1d | 5/5 | 10 | 5/5 | 38 | 5/5 | 36 | 5/5 | 2 | 5/5 | 11 |
| 0.50 % | 2e | 0/5 | 0 | 0/5 | 0 | 0/5 | 0 | 0/5 | 0 | 3/5 | 15 |
| 0.50 % | 1 + 2f | 1/5 | 0 | 5/5 | 0 | 2/5 | 5 | 2/5 | 7 | 3/5 | 32 |
| 1 % | 1d | 5/5 | 11 | 5/5 | 53 | 5/5 | 62 | 5/5 | 2 | 5/5 | 13 |
| 1 % | 2e | 1/5 | 0 | 5/5 | 0 | 5/5 | 1 | 2/5 | 1 | 2/5 | 28 |
| 1 % | 1 + 2f | 3/5 | 1 | 5/5 | 0 | 5/5 | 7 | 4/5 | 1 | 3/5 | 45 |
| 2 % | 1d | 5/5 | 13 | 5/5 | 64 | 5/5 | 72 | 5/5 | 4 | 2/5 | 20 |
| 2 % | 2e | 3/5 | 0 | 5/5 | 0 | 5/5 | 1 | 3/5 | 0 | 3/5 | 14 |
| 2 % | 1 + 2f | 5/5 | 0 | 5/5 | 0 | 5/5 | 7 | 5/5 | 2 | 3/5 | 28 |
| 10 % | 1d | 5/5 | 6 | 5/5 | 42 | 5/5 | 75 | 5/5 | 0 | 3/5 | 10 |
| 10 % | 2e | 5/5 | 0 | 5/5 | 0 | 5/5 | 0 | 5/5 | 0 | 2/5 | 28 |
| 10 % | 1 + 2f | 5/5 | 0 | 5/5 | 0 | 5/5 | 8 | 5/5 | 1 | 2/5 | 56 |
| timegh | timegh | timegi | timegj | timegj | |||||||
| 0.50 % | 1d | 2m27s | 2m27s | 1m30s | 1h45m | 8m7s | |||||
| 0.50 % | 2e | 2m18s | 2m18s | 1m33s | 9h29m | 22m3s | |||||
| 0.50 % | 1 + 2f | 4m42s | 4m42s | 3m36s | 15h4m | 2h31m | |||||
| 1 % | 1d | 2m56s | 2m56s | 2m27s | 1h38m | 10m51s | |||||
| 1 % | 2e | 3m1s | 3m1s | 2m6s | 8h46m | 46m44s | |||||
| 1 % | 1 + 2f | 5m48s | 5m48s | 5m56s | 13h13m | 3h46m | |||||
| 2 % | 1d | 2m35s | 2m35s | 2m6s | 1h34m | 10m25s | |||||
| 2 % | 2e | 2m32s | 2m32s | 1m53s | 8h50m | 41m1s | |||||
| 2 % | 1 + 2f | 4m59s | 4m59s | 4m34s | 14h56m | 1h47m | |||||
| 10 % | 1d | 3m5s | 3m5s | 4m26s | 1h46m | 13m45s | |||||
| 10 % | 2e | 2m30s | 2m30s | 2m35s | 14h13m | 33m24s | |||||
| 10 % | 1 + 2f | 5m29s | 5m29s | 9m53s | 7h57m | 1h47m | |||||
a80th percentile of the QQ-SNVHS distribution of “error” frequencies was used as frequency cutoff to decrease the number of false positives obtained with QQ-SNVHS
bsens. is the percentage of true variants that were correctly called as SNV
cFP is the number of variants incorrectly called as SNV
dReads 1 of paired-end reads
eReads 2 of paired-end reads (sequenced after paired-end turn)
fAll paired-end reads
gcomputation time in hours (h), minutes (m) and seconds (s)
hWindows 7 64 bit, 8GB RAM, 3.2GHz
iLinux Ubuntu 12.04.4, 57.6 GB RAM, one core (2.3GHz) used
jLinux Ubuntu 12.04.4, 57.6 GB RAM, 8 cores (2.3GHz) used in parallel
Fig. 2QQ-SNV vs. other methods: computational efficiency on HCV plasmid mixture datasets. On the x-axis is the coverage, which is the average number of reads per position. On the y-axis is the computation time in minutes. A reference line is shown at 5 and 60 min. The size of the data points (reads 1, reads 2 and reads 1 + 2) shown for the different HCV datasets corresponds to the variant percentage (0.5 %, 1 %, 2 % and 10 %)
Performance on H1N1 clinical sample
| QQ-SNVD | QQ-SNVHS-P80 a | LoFreq | ShoRAH | V-Phaser 2 | |||||||
| Illumina sequencer | Pair | sens.b | FPc | sens.b | FPc | sens.b | FPc | sens.b | FPc | sens.b | FPc |
| GAIIx | 1d | 0/4 | 0 | 4/4 | 2 | 2/4 | 0 | 0/4 | 0 | 1/4 | 0 |
| GAIIx | 2e | 0/4 | 0 | 4/4 | 1 | 3/4 | 0 | 0/4 | 0 | 0/4 | 0 |
| GAIIx | 1 + 2f | 0/4 | 0 | 4/4 | 4 | 3/4 | 0 | 0/4 | 0 | nak | nak |
| MiSeq | 1d | 2/4 | 1 | 4/4 | 5 | 0/4 | 0 | 0/4 | 0 | 0/4 | 0 |
| MiSeq | 2e | 2/4 | 0 | 4/4 | 2 | 0/4 | 0 | nak | nak | 0/4 | 0 |
| MiSeq | 1 + 2f | 4/4 | 3 | 4/4 | 6 | 0/4 | 0 | nak | nak | nak | nak |
| timegh | timegh | timegi | timegj | timegj | |||||||
| GAIIx | 1d | 12m51s | 12m51s | 7m46s | 6h1m | 43m15s | |||||
| GAIIx | 2e | 13m5s | 13m5s | 7m55s | 6h41m | 43m6s | |||||
| GAIIx | 1 + 2f | 21m56s | 21m56s | 19m1s | 2h18m | nak | |||||
| MiSeq | 1d | 1m11s | 1m11s | 26 s | 5h46m | 3m40s | |||||
| MiSeq | 2e | 1m23s | 1m23s | 24 s | nak | 3m55s | |||||
| MiSeq | 1 + 2f | 2m45s | 2m45s | 1 m | nak | nak | |||||
a80th percentile of the QQ-SNVHS distribution of “error” frequencies was used as frequency cutoff to decrease the number of false positives obtained with QQ-SNVHS
bsens. is the percentage of SNVs as identified in [12] that were called as SNV by QQ-SNV in region 405–425 of NA gene in H1N1 BN3 sample
cFP is the number of “non-variants” in [12] that were called as SNV by QQ-SNV
dReads 1 of paired-end reads
eReads 2 of paired-end reads (sequenced after paired-end turn)
fAll paired-end reads
gcomputation time in hours (h), minutes (m) and seconds (s)
hWindows 7 64 bit, 8GB RAM, 3.2GHz
iLinux Ubuntu 12.04.4, 57.6 GB RAM, one core (2.3GHz) used
jLinux Ubuntu 12.04.4, 57.6 GB RAM, 8 cores (2.3GHz) used in parallel
kNo result could be obtained for the ShoRAH/V-Phaser 2 algorithm due to failure of the ShoRAH/V-Phaser 2 software tool on our server
SNVs called by QQ-SNVHS-P80 on H1N1 clinical sample (all paired-end reads)
| GAIIx | MiSeq | SNV identified in [ | |||
|---|---|---|---|---|---|
| Positiona | Percentageb | P(SNV)c | Percentageb | P(SNV)c | |
| 406 | nad | nad | 0.09 % | 0.99923 | No |
| 409 | 0.13 % | 0.00110 | 0.10 % | 0.00517 | No |
| 410 | 0.35 % | 0.05514 | 0.28 % | 0.60047 | Yes |
| 411 | 0.13 % | 0.00016 | 0.09 % | 0.00318 | No |
| 413 | 0.26 % | 0.00482 | 0.14 % | 0.98815 | Yes |
| 415 | 0.25 % | 0.00207 | 0.16 % | 0.98665 | Yes |
| 418 | 0.18 % | 0.00011 | 0.09 % | 0.00769 | No |
| 421 | 0.26 % | 0.04058 | 0.16 % | 0.94675 | Yes |
| 423 | 0.15 % | 0.00012 | 0.08 % | 0.02083 | No |
| 425 | nad | nad | 0.07 % | 0.88219 | No |
aNucleotide position in neuraminidase (NA) gene
bObserved frequency of single nucleotide variant called by QQ-SNV
cProbability that variant is a SNV, as predicted by QQ-SNV
dNot classified as SNV by QQ-SNVHS-P80, P(SNV) ≤ 0.0001
Fig. 3Quality Quantile plot. SNV 410 T > C is called by QQ-SNVHS-P80 both on GAIIx and MiSeq for H1N1 clinical sample BN3 [12] at position 410 of the neuraminidase (NA) gene