| Literature DB >> 35231666 |
Marco Cacciabue1, Pablo Aguilera2, María Inés Gismondi2, Oscar Taboga3.
Abstract
The epidemiological surveillance of SARS-CoV-2 by means of whole-genome sequencing has revealed the emergence and co-existence of multiple viral lineages or subtypes throughout the world. Moreover, it has been shown that several subtypes of this virus display particular phenotypes, such as increased transmissibility or reduced susceptibility to neutralizing antibodies, leading to the denomination of Variants of Interest (VOI) or Variants of Concern (VOC). Thus, subtyping of SARS-CoV-2 is a crucial step for the surveillance of this pathogen. Here, we present Covidex, an open-source, alignment-free machine learning subtyping tool. It is a shiny web app that allows an ultra-fast and accurate classification of SARS-CoV-2 genome sequences into the three most used nomenclature systems (GISAID, Nextstrain, Pango lineages). It also categorizes input sequences as VOI or VOC, according to current definitions. The program is cross-platform compatible and it is available via Source-Forge https://sourceforge.net/projects/covidex or via the web application http://covidex.unlu.edu.ar.Entities:
Keywords: Machine learning; SARS-CoV-2; Subtyping; VOC; VOI; Web-application
Mesh:
Year: 2022 PMID: 35231666 PMCID: PMC8881885 DOI: 10.1016/j.meegid.2022.105261
Source DB: PubMed Journal: Infect Genet Evol ISSN: 1567-1348 Impact factor: 3.342
Fig. 1Workflow overview of Covidex. First, viral sequences are loaded in FASTA format. Next, normalized k-mer counts are obtained from these sequences. Three random forest models are then used to classify the query sequences and probability scores based on the number of trees that calls for each class are calculated. Finally, the classification results are presented and a report can be generated for download.
Supplementary Fig. 1Accuracy score and running time for the random forest algorithm, at different values of k, for a set of 8000 whole SARS-CoV-2 genomes. The black arrow shows the chosen k (highest accuracy with an overall low time).
Fig. 2Overview of the Covidex app. The user is expected to load a sequence file and press RUN. A results table will be shown. Additionally, the user can download an automatic report.
Classification models stats. Basic statistics for each classification model. Models were derived with SARS-CoV-2 sequences downloaded on 2021/11/27. Accuracy was calculated as sequences correctly labeled / number of sequences. Multi-class AUC is the mean AUC from all pairwise class comparisons.
| Model | Number of classes | Number of trees | Training dataset size | Testing dataset size | Accuracy | Multi-class AUC |
|---|---|---|---|---|---|---|
| GISAID | 10 | 1000 | 66,126 | 13,230 | 0.9777 | 0.9931 |
| Nextstrain | 22 | 1000 | 63,972 | 12,810 | 0.9952 | 0.9879 |
| Pango | 1437 | 500 | 65,467 | 12,346 | 0.9656 | 0.9926 |
Supplementary Fig. 2VOC and VOI variants detection. Covidex performance on detecting variants of relevance was analyzed. For each VOC and VOI category a sample dataset was created by downloading from GISAID database (with high coverage and complete filters on). Sequences: number of sequences in the dataset; Accuracy: percentage of correctly labeled variants.