| Literature DB >> 35847318 |
Jnanendra Prasad Sarkar1, Indrajit Saha2, Nimisha Ghosh3,4, Debasree Maity5, Dariusz Plewczynski6,7.
Abstract
The problem of virus classification is always a subject of concern for virology or epidemiology over the decades. In this regard, a machine learning technique can be used to predict the novel coronavirus by considering its sequence. Thus, we are proposing a machine learning-based novel coronavirus prediction technique, called COVID-Predictor, where 1000 sequences of SARS-CoV-1, MERS-CoV, SARS-CoV-2, and other viruses are used to train a Naive Bayes classifier so that it can predict any unknown sequences of these viruses. The model has been validated using 10-fold cross-validation in comparison with other machine learning techniques. The results show the superiority of our predictor by achieving an average 99.7% accuracy on an unseen validation set of viruses. The same pre-trained model has been used to design a web-based application where sequences of unknown viruses can be uploaded to predict the novel coronavirus.Entities:
Year: 2022 PMID: 35847318 PMCID: PMC9280959 DOI: 10.1021/acsomega.2c00215
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Statistics of the Refined Datasets of Corona and Other Viruses
| virus name | source of sequence | no. of sequence | max length of sequence | avg length of sequence |
|---|---|---|---|---|
| SARS-CoV-1 | NCBI | 340 | 30,311 | 29,514 |
| MERS-CoV | NCBI | 291 | 30,150 | 29,983 |
| SARS-CoV-2 | GISAID | 2391 | 29,986 | 29,512 |
| Other viruses | NCBI | 600 | 19,897 | 15,316 |
Prediction Performance of COVID-Predictor on Validation Data
| source | data samples | accuracy | precision | recall | F1 score | ROC-AUC-score | MCC |
|---|---|---|---|---|---|---|---|
| NCBI | 493 sequences (only SARS-CoV-2) | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| NCBI + GISAID | 1090 sequences (90 SARS-CoV-1, 200 MERS-CoV, 200 SARS-CoV-2, 600 other viruses) | 0.99908 | 0.99908 | 0.99908 | 0.99908 | 0.99912 | 0.99852 |
| NCBI + GISAID | 2043 sequences (103 SARS-CoV-1, 41 MERS-CoV, 1599 SARS-CoV-2, 300 other viruses) | 0.98217 | 0.98991 | 0.98217 | 0.98602 | 0.98432 | 0.98291 |
| NCBI + GISAID | 3143 sequences (90 SARS-CoV-1, 41 MERS-CoV, 2152 SARS-CoV-2, 860 other viruses) | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
| NCBI + GISAID | 3500 sequences (90 SARS-CoV-1, 250 MERS-CoV, 2410 SARS-CoV-2, 750 other viruses) | 0.99971 | 0.99971 | 0.99971 | 0.99971 | 0.99952 | 0.99940 |
| NCBI + GISAID | 4000 sequences (90 SARS-CoV-1, 220 MERS-CoV, 3030 SARS-CoV-2, 2639 other viruses) | 0.99975 | 0.99975 | 0.99975 | 0.99974 | 0.99949 | 0.99937 |
| GISAID | 4747 sequences (only SARS-CoV-2) | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
Figure 1(a) Word cloud of descriptors generated using k-mer techniques from sequences of SARS-CoV-1, MERS-CoV, SARS-CoV-2, and other viruses. (b) Top 10 n-grams of descriptors generated using k-mer techniques from sequences of SARS-CoV-1, MERS-CoV, SARS-CoV-2, and other viruses. (c) Pipeline of the proposed COVID-Predictor. (d) Performance measures for k = 7 and n = 3. (e) Screenshots of the web-based COVID-Predictor to select a virus sequence file as csv and results after prediction.
Prediction Performance of Different Machine Learning Techniques after Performing 10-Fold Cross Validation with Different Values of k-mer and n-gram on 1000 Genome Sequences of SARS-CoV-1, MERS-CoV, SARS-CoV-2, and Other Virus Sequences
ROC AUC Score and MCC of Different Machine Learning Techniques after Performing 10-Fold Cross Validation with Different Values of k-mer and n-gram on 1000 Genome Sequences of SARS-CoV-1, MERS-CoV, SARS-CoV-2, and Other Virus Sequences
| method | ROC-AUC-score | MCC | ROC-AUC-score | MCC | ROC-AUC-Score | MCC | ROC-AUC-Score | MCC | |
|---|---|---|---|---|---|---|---|---|---|
| NB | 2 | 0.99998 | 0.99746 | 0.99810 | 0.99746 | 0.99998 | 0.99746 | 1.00000 | 0.98987 |
| GSVM | 0.99997 | 0.93321 | 0.99994 | 0.95759 | 0.99998 | 0.97610 | 0.99998 | 0.98985 | |
| RF | 0.99795 | 0.93321 | 0.99998 | 0.95761 | 0.99872 | 0.97608 | 0.99940 | 0.99873 | |
| NB | 3 | 0.99810 | 0.99746 | 0.99999 | 0.99746 | 1.00000 | 0.99873 | 0.99872 | 0.99874 |
| GSVM | 0.99994 | 0.95759 | 0.99998 | 0.97610 | 0.99998 | 0.98985 | 1.00000 | 0.99874 | |
| RF | 0.99998 | 0.95760 | 0.99872 | 0.97612 | 0.99940 | 0.98987 | 1.00000 | 0.99746 | |
| NB | 4 | 0.99872 | 0.99746 | 1.00000 | 0.99873 | 1.00000 | 0.99746 | 1.00000 | 0.99946 |
| GSVM | 0.99998 | 0.97610 | 0.99998 | 0.98985 | 1.00000 | 0.99874 | 1.00000 | 0.99874 | |
| RF | 0.99997 | 0.97608 | 0.99940 | 0.98985 | 0.99872 | 0.99874 | 1.00000 | 0.99876 | |
| NB | 5 | 0.99940 | 0.99873 | 0.99872 | 0.99746 | 0.99811 | 0.99874 | 1.00000 | 0.99876 |
| GSVM | 0.99998 | 0.98985 | 1.00000 | 0.99874 | 0.99974 | 0.99495 | 1.00000 | 0.99873 | |
| RF | 1.00000 | 0.98987 | 1.00000 | 0.99875 | 0.99798 | 0.99591 | 1.00000 | 0.99874 | |
| NB | 6 | 0.99872 | 0.99746 | 1.00000 | 0.99897 | 1.00000 | 0.99998 | 0.99938 | 0.99873 |
| GSVM | 1.00000 | 0.99874 | 0.99874 | 0.98885 | 1.00000 | 0.99873 | 1.00000 | 0.99620 | |
| RF | 1.00000 | 0.99876 | 1.00000 | 0.98973 | 1.00000 | 0.99877 | 1.00000 | 0.99624 | |
| NB | 7 | 1.00000 | 0.99877 | 1.00000 | 1.00000 | 0.99938 | 0.99873 | 0.99938 | 0.99873 |
| GSVM | 1.00000 | 0.99873 | 1.00000 | 1.00000 | 1.00000 | 0.99620 | 1.00000 | 0.99493 | |
| RF | 1.00000 | 0.99873 | 1.00000 | 1.00000 | 1.00000 | 0.99623 | 0.99998 | 0.99491 | |