| Literature DB >> 35439933 |
Simone Ciccolella1, Luca Denti1,2, Yuri Pirola3, Marco Previtali1, Paola Bonizzoni1, Gianluca Della Vedova1.
Abstract
BACKGROUND: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe.Entities:
Keywords: Genotyping; Lineage classification; SARS-CoV-2; Sequence analysis; Virus
Mesh:
Year: 2022 PMID: 35439933 PMCID: PMC9017741 DOI: 10.1186/s12859-022-04668-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Precomputed catalog creation. Schematic representation of the pipelines used to create the precomputed SARS-CoV-2 catalogs from the set of public assemblies available on GenBank
GenBank assemblies information
Number of assemblies deposited on GenBank for the most represented lineages (left) and for 5 Variants of Concern (right). Lineages were assigned using pangolin. A total of 626 lineages are present, but 125 of them (about 20%) have only a single sequence and 56 (about 9%) have only two sequences. Sequences were retrieved on April 7, 2021
Fig. 2Variant genotyping and lineage assignment. Schematic representation of the variant genotyping and lineage assignment module of MALVIRUS
Fig. 3Report. Example of the final report of MALVIRUS
Samples considered in the experimental evaluation
| SRA ID | Technology | Coverage | GISAID ID | Lineage |
|---|---|---|---|---|
| ERR5026409 | ILL | 62 | 730750 | A.23.1 |
| ERR5040238 | ILL | 725 | 756390 | A.23.1 |
| ERR5053597 | ILL | 366 | 768932 | A.23.1 |
| ERR5174710 | ILL | 753 | 858801 | A.23.1 |
| ERR5189961 | ILL | 707 | 862989 | A.23.1 |
| ERR5074718 | ILL | 123 | 706608 | AP.1 |
| ERR4082432 | ONT | 115 | 425449 | B.1 |
| ERR4246852 | ONT | 285 | 457212 | B.1.1 |
| ERR4437884 | ILL | 2200 | 489377 | B.1.1.134 |
| ERR4584869 | ILL | 1600 | 532794 | B.1.1.253 |
| ERR4668432 | ILL | 483 | 575675 | B.1.1.255 |
| ERR4615778 | ILL | 847 | 539900 | B.1.1.269 |
| ERR4439830 | ILL | 17 | 499533 | B.1.1.323 |
| ERR4438180 | ILL | 367 | 488906 | B.1.157 |
| ERR4848290 | ILL | 509 | 643640 | B.1.160 |
| ERR5011401 | ILL | 838 | 710801 | B.1.1.7 |
| ERR5042696 | ILL | 1000 | 760798 | B.1.1.7 |
| ERR5052852 | ILL | 457 | 769604 | B.1.1.7 |
| ERR5183522 | ILL | 726 | 846539 | B.1.1.7 |
| ERR5184915 | ILL | 854 | 833702 | B.1.1.7 |
| ERR4759202 | ILL | 105 | 595464 | B.1.177 |
| ERR4860691 | ILL | 401 | 646554 | B.1.177.19 |
| ERR5049949 | ILL | 804 | 767261 | B.1.177.8 |
| ERR5082229 | ONT | 136 | 680109 | B.1.258.5 |
| ERR5041219 | ILL | 815 | 762499 | B.1.351 |
| ERR5074602 | ONT | 27 | 764231 | B.1.351 |
| ERR5093255 | ILL | 226 | 819798 | B.1.351 |
| ERR5178844 | ILL | 141 | 812064 | B.1.351 |
| ERR5179824 | ILL | 631 | 821134 | B.1.351 |
| ERR4651735 | ILL | 430 | 567575 | B.1.36.17 |
| SRR13261896 | ILL | 10 | 708631 | B.1.366 |
| ERR4973704 | ILL | 467 | 655669 | B.1.36.9 |
| SRR13606385 | ILL | 55 | 903433 | B.1.427 |
| ERR5042239 | ILL | 983 | 760883 | B.1.525 |
| ERR5176822 | ILL | 459 | 797195 | B.1.525 |
| ERR5181246 | ILL | 564 | 836880 | B.1.525 |
| ERR5181360 | ILL | 633 | 836839 | B.1.525 |
| ERR5190001 | ILL | 89 | 863189 | B.1.525 |
| ERR4366428 | ONT | 13 | 493559 | B.23 |
| SRR11494747 | ILL | 186 | 419918 | B.31 |
| SRR13530301 | ILL | 58 | 873257 | P.1 |
For each sample, we report the SRA Accession ID, the technology used (ILLumina or Oxford Nanopore Techonology), the coverage in terms of number of bases (in millions), the corresponding GISAID Accession ID (for ease of presentation we removed the EPI_ISL_ prefix), and the Pango lineage computed using pangolin on the corresponding assembly downloaded from GISAID
Results on NCBI catalogs
| Pipeline | MALVIRUS | bcftools | lofreq | |||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | Precision | Recall | |||
| A | 20 | 50 | .919 | .909 | .778 | .856 | ||
| B | 20 | 50 | .921 | .909 | .788 | .823 | ||
| A | 50 | 100 | .924 | .907 | .757 | .856 | ||
| B | 50 | 100 | .916 | .909 | .763 | .821 | ||
| A | 100 | 100 | .924 | .908 | .738 | .857 | ||
| B | 100 | 100 | .916 | .909 | .744 | .822 | ||
For each catalog, we report the precision and recall achieved by MALVIRUS, BCFtools, and lofreq in calling the variations available in the catalog. The results are shown in terms of average over all the 41 considered samples. We highlighted in bold the best results. We considered 6 different catalogs built using pipeline A or B on the set of assemblies retrieved from NCBI, prefiltered using and then subsampled using different combinations of parameters and
Results on NCBI catalogs depending on sequencing technology
| Technology | MALVIRUS | bcftools | lofreq | |||
|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | Precision | Recall | |
| ILL | .942 | .899 | .775 | .817 | ||
| ONT | .946 | .731 | .680 | .845 | ||
We report the precision and recall achieved by MALVIRUS, BCFtools, and lofreq in calling the variations available in the default catalog (NCBI catalog, Pipeline B, , ). The results are shown in terms of average over all the considered samples aggregated by sequencing technology. We highlighted in bold the best results
Results on GISAID catalogs
| Pipeline | Min support | Precision | Recall | Time (s) | No. of correct lineages | ||
|---|---|---|---|---|---|---|---|
| A | 5 | 20 | 2 | .953 | .947 | 38 | 40 |
| A | 5 | 20 | 5 | .951 | .945 | 19 | 40 |
| B | 5 | 20 | 2 | .992 | .967 | 48 | 40 |
| B | 5 | 20 | 5 | .993 | .960 | 21 | 40 |
| A | 50 | 50 | 2 | .933 | .918 | 897 | 40 |
| A | 50 | 50 | 5 | .942 | .948 | 355 | 40 |
| B | 50 | 50 | 2 | .960 | .962 | 2465 | 40 |
| B | 50 | 50 | 5 | .972 | .960 | 677 | 40 |
For each catalog, we report the precision and recall achieved by MALVIRUS in genotyping its variations, the average running times, and the number of input samples (out of 41) assigned to the correct lineage. We considered 8 different catalogs, built using pipeline A or B on the set of assemblies retrieved from GISAID, prefiltered using and then subsampled using different combinations of parameters and . In addition, we also filtered out from the catalogs all variations present in less than either 2 or 5 assemblies (Min support columns)