| Literature DB >> 30483597 |
Khadija Said Mohammed1,2, Nelson Kibinge2, Pjotr Prins2,3, Charles N Agoti1,2, Matthew Cotten4, D J Nokes2,5, Samuel Brand5, George Githinji2.
Abstract
Background: High-throughput whole genome sequencing facilitates investigation of minority virus sub-populations from virus positive samples. Minority variants are useful in understanding within and between host diversity, population dynamics and can potentially assist in elucidating person-person transmission pathways. Several minority variant callers have been developed to describe low frequency sub-populations from whole genome sequence data. These callers differ based on bioinformatics and statistical methods used to discriminate sequencing errors from low-frequency variants.Entities:
Keywords: RSV; concordance; minority variants; performance; variant calling
Year: 2018 PMID: 30483597 PMCID: PMC6234735 DOI: 10.12688/wellcomeopenres.13538.2
Source DB: PubMed Journal: Wellcome Open Res ISSN: 2398-502X
Figure 1. A schematic diagram showing the variant calling workflow.
The artificial datasets (BAM files) were generated using ART-Illumina based on an RSV reference genome. BAMSurgeon was used to spike the resulting BAM files by inserting known variants at known locations across the artificial BAM file.
A breakdown of performance metrics of variant callers evaluated from first dataset that did not incorporate an error profile.
The samples represent simulated datasets of varying depth of coverage. True positive (TP), true negative (TN), false positive (FP) and false negatives (FN) were used to calculate performance metrics of each caller. FPR – False positive rate.
| Sample | Caller | TP | TN | FP | FN | Sensitivity | Specificity | Precision | FPR | Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|
|
| freebayes | 87 | 15018 | 21 | 79 | 0.5241 | 0.9986 | 0.8056 | 0.0014 | 0.9934 |
| lofreq | 29 | 15039 | 0 | 137 | 0.1747 | 1 | 1 | 0 | 0.991 | |
| vardict | 72 | 15039 | 0 | 94 | 0.4337 | 1 | 1 | 0 | 0.9938 | |
| varscan | 11 | 15039 | 0 | 155 | 0.0663 | 1 | 1 | 0 | 0.9898 | |
|
| freebayes | 118 | 14901 | 138 | 47 | 0.7152 | 0.9908 | 0.4609 | 0.00918 | 0.9878 |
| lofreq | 67 | 15039 | 0 | 99 | 0.40361 | 1 | 1 | 0 | 0.9935 | |
| vardict | 108 | 15039 | 0 | 58 | 0.6506 | 1 | 1 | 0 | 0.9962 | |
| varscan | 22 | 15039 | 0 | 144 | 0.13253 | 1 | 1 | 0 | 0.9905 | |
|
| freebayes | 127 | 14454 | 585 | 38 | 0.7697 | 0.9611 | 0.1784 | 0.0389 | 0.959 |
| lofreq | 57 | 15039 | 0 | 109 | 0.3434 | 1 | 1 | 0 | 0.9928 | |
| vardict | 104 | 15038 | 1 | 62 | 0.6265 | 0.9999 | 0.9905 | 6.65E-05 | 0.9959 | |
| varscan | 40 | 15039 | 0 | 126 | 0.241 | 1 | 1 | 0 | 0.9917 | |
|
| freebayes | 131 | 12559 | 2480 | 30 | 0.8137 | 0.8351 | 0.0502 | 0.1649 | 0.8349 |
| lofreq | 60 | 15039 | 0 | 106 | 0.3614 | 1 | 1 | 0 | 0.993 | |
| vardict | 110 | 15029 | 10 | 56 | 0.6627 | 0.9993 | 0.9167 | 6.65E-04 | 0.9957 | |
| varscan | 73 | 15039 | 0 | 93 | 0.4398 | 1 | 1 | 0 | 0.9939 | |
|
| freebayes | 146 | 14414 | 625 | 20 | 0.8795 | 0.9584 | 0.1894 | 0.0416 | 0.9576 |
| lofreq | 57 | 15039 | 0 | 109 | 0.3434 | 1 | 1 | 0 | 0.9928 | |
| vardict | 109 | 15036 | 3 | 57 | 0.6567 | 0.9998 | 0.9732 | 1.99E-04 | 0.9961 | |
| varscan | 79 | 15039 | 0 | 87 | 0.4759 | 1 | 1 | 0 | 0.9943 | |
|
| freebayes | 146 | 14923 | 116 | 20 | 0.8795 | 0.9923 | 0.5571 | 0.0077 | 0.9911 |
| lofreq | 70 | 15039 | 0 | 96 | 0.4217 | 1 | 1 | 0 | 0.9937 | |
| vardict | 120 | 15039 | 0 | 46 | 0.7229 | 1 | 1 | 0 | 0.997 | |
| varscan | 83 | 15039 | 0 | 83 | 0.5 | 1 | 1 | 0 | 0.9945 | |
|
| freebayes | 149 | 15020 | 19 | 17 | 0.8976 | 0.9987 | 0.8869 | 0.0013 | 0.9976 |
| lofreq | 67 | 15039 | 0 | 99 | 0.40366 | 1 | 1 | 0 | 0.9935 | |
| vardict | 117 | 15036 | 3 | 49 | 0.7048 | 0.9998 | 0.975 | 1.99E-04 | 0.9966 | |
| varscan | 78 | 15039 | 0 | 88 | 0.4699 | 1 | 1 | 0 | 0.9942 | |
|
| freebayes | 145 | 15022 | 17 | 21 | 0.8735 | 0.9989 | 0.8951 | 0.0011 | 0.9975 |
| lofreq | 72 | 15039 | 0 | 94 | 0.4337 | 1 | 1 | 0 | 0.9938 | |
| vardict | 118 | 15038 | 1 | 48 | 0.7108 | 0.9999 | 0.9916 | 6.65E-05 | 0.9968 | |
| varscan | 97 | 15039 | 0 | 69 | 0.5843 | 1 | 1 | 0 | 0.9955 |
Figure 2. Proportion of fully concordant positions with respect to sample coverage.
Each plot A– C represents the proportion (y-axis) of fully concordant variants with respect to read coverage (x-axis) for the first, second and third dataset. Concordant positions were defined as positions that were identified by all the four variant callers.
Figure 3. Heat maps illustrating tool specific concordance for the first artificial dataset.
The red tiles represent variants detected by each caller from the list of 166 variant positions. The panels are arranged left to right A– H in the order of increasing sample coverage (20,50,100,500,1000,2000,5000 and 10,000). The “not called” column in each panel represents the variants that were not identified by any of the variant callers.
Figure 4. A summary of the relationship between sample coverage, sensitivity ( A– C) and sample coverage and precision ( D– F). The x-axis shows the sample coverage and the y-axis represents the sensitivity and precision respectively. Sensitivity of the callers rose gradually from low to high coverage samples. Again, precision was variable for FreeBayes calls while relatively high for the rest of the three callers.
Figure 5. Box plots showing the distribution of frequencies between the spiked variants and the corresponding called variant for each variant caller at each coverage ( A– H) for the first dataset.