| Literature DB >> 25887734 |
Bie Verbist1, Lieven Clement2, Joke Reumers3, Kim Thys4, Alexander Vapirev5,6, Willem Talloen7, Yves Wetzels8, Joris Meys9, Jeroen Aerssens10, Luc Bijnens11, Olivier Thas12,13.
Abstract
BACKGROUND: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses.Entities:
Mesh:
Year: 2015 PMID: 25887734 PMCID: PMC4369097 DOI: 10.1186/s12859-015-0458-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Sensitivity of ViVaMBC in plasmid experiment
|
|
|
|
|---|---|---|
| 1:200 | 0.45 | 0.42 |
| 1:100 | 0.92 | 0.91 |
| 1:50 | 2.28 | 2.20 |
| 1:10 | 11.04 | 10.01 |
Two HCV-plasmids which differ at two codon positions 36 and 155 were combined in a sample for Illumina deep sequencing at four different mixing proportions. Their frequencies were estimated with ViVaMBC, which was able to retrieve codon variants with frequencies up to 0.5%.
Specificity of ViVaMBC in plasmid experiment
|
|
| |||
|---|---|---|---|---|
|
|
|
|
|
|
| 1:200 | 15,692 | 1.46 | 599 | 0.67 |
| 1:100 | 14,886 | 1.41 | 599 | 0.68 |
| 1:50 | 12,724 | 1.47 | 841 | 0.72 |
| 1:10 | 22,405 | 1.53 | 492 | 0.65 |
The number of codons in the NS3 are reported after pileup and ViVaMBC. Theoretically, 183(181+2) codons are expected, but far more are reported, especially when piling up the raw data. The maximum frequency of the false positive codons is presented as well. ViVaMBC is able to reduce these frequencies below 1% while they reached more than 1% after Pileup. This illustrated that ViVaMBC is able to reduce drastically the number of false-positive findings and to lower the detection limit above which 100% specificity is expected.
Figure 1Influence of coverage depth on the estimation of . Datasets with lower coverages are generated by random sampling a fraction (f = 0.1, 0.2, …, 0.8,0.9) of the reads from the original dataset. Ten datasets were generated for each fraction f resulting in 90 datasets with average coverages ranging between 6,463 and 58,185. The reported variants for all re-sampled datasets were plotted and colored according to the discovered codon. The green dots indicate the true variant and all others are false-positive findings. The average frequency of the true variant (averaged over the ten random samples) is indicated with triangles. The dotted line is the true frequency as estimated from the original dataset. Lowering the coverage increases the bias, the variance of the estimate and the number of false-positive findings.
Sensitivity and specificity of competing methods in plasmid experiment
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|
| SNP (WT) | 1:200 | 1:100 | 1:50 | 1:200 | 1:100 | 1:50 | 1:200 | 1:100 | 1:50 |
| A (G) | / | 1.03 | 2.41 | 0.59 | 1.06 | 2.37 | / | / | 2.22 |
| G (C) | 0.54 | 1.01 | 2.38 | / | 0.94 | 2.33 | / | / | 2.22 |
| A (C) | 0.66 | 1.03 | 2.16 | / | / | / | 0.44* | 0.80* | 1.78* |
| A (G) | 0.48 | 0.91 | 2.10 | 0.52 | 1.04 | 2.11 | 0.44* | 0.80* | 1.78* |
| A (G) | / | 0.89 | 2.05 | 0.48 | / | 2.07 | / | / | 1.28 |
|
| 3 | 5 | 2 | 19 | 32 | 24 | 4 | 1 | 1 |
| Max Freq false SNPs | 1.04 | 1.01 | 1.02 | 0.97 | 1.40 | 0.72 | 0.92* | 0.5* | 0.89 |
Frequency estimates of the true SNPs after applying the algorithms LoFreq, V-Phaser 2 and ShoRAH on the mixture of plasmids mixed at 1:200, 1:100 and 1:50. Two SNPs should be present in codon 36, while three SNPs are present in codon 155. In case of ShoRAH, the frequency is estimated from three overlapping windows, but often the variant is detected in two out of three windows (denoted with *). None of the methods seem to be able to retrieve all 5 SNPs at 0.5%. The bottom rows of the table report the total number of false SNPs over the whole NS3 region (543 bp long) together with their maximum frequency. The total number of false-positive findings is very low for all methods but their frequencies rise close to 1% which hamper the distinction of true SNPs from this false-positive findings.
Figure 2Specificity comparison of ViVaMBC with LoFreq, V-phaser 2 and ShoRAH. The frequencies of all minor variants discovered in the three mixtures 1:200, 1:100 and 1:50 are plotted for ViVaMBC, LoFreq, V-phaser 2 and ShoRAH. Note that these variants are at the codon level for ViVaMBC and at the SNP level for the other methods. The false positive variants are indicated with black dots and the true positives with gray crosses. It is clear that although far more false-positive findings are discovered with ViVaMBC, the distinction with the true positives is more apparent.
Figure 3Sensitivity and specificity comparison of ViVaMBC with pileup of a clinical HCV sample. a) Comparison of the codon frequencies after piling up the data (x-axis) with the estimated frequencies of ViVaMBC (y-axis). Codons represented with triangles were absent after 454 sequencing on the same sample and hence assumed to be false-positive findings. Codons colored in grey are present in either one of the two methods. Frequencies of 0.5% and 0.25% are indicated with dotted and dashed lines respectively. Above 0.5% and even above 0.25% a good correlation is observed where a few false-positive findings are filtered out using ViVaMBC b) False discovery rates for both ViVaMBC and pileup are calculated with changing reporting limits. The FDR is higher and increases more rapidly for the pileup.
Figure 4Comparison of LoFreq and V-Phaser with ViVaMBC on clinical sample. Barplot represents the number of reported variants (at SNP or codon level) by the different methodologies for different frequency bins. The bars are colored according to the method. The shaded region in the bars for ViVaMBC corresponds to the fraction of codons that were also discovered with 454.