| Literature DB >> 35308181 |
Walaa Alkady1, Khaled ElBahnasy2, Víctor Leiva3, Walaa Gad2.
Abstract
COVID-19 disease causes serious respiratory illnesses. Therefore, accurate identification of the viral infection cycle plays a key role in designing appropriate vaccines. The risk of this disease depends on proteins that interact with human receptors. In this paper, we formulate a novel model for COVID-19 named "amino acid encoding based prediction" (AAPred). This model is accurate, classifies the various coronavirus types, and distinguishes SARS-CoV-2 from other coronaviruses. With the AAPred model, we reduce the number of features to enhance its performance by selecting the most important ones employing statistical criteria. The protein sequence of SARS-CoV-2 for understanding the viral infection cycle is analyzed. Six machine learning classifiers related to decision trees, k-nearest neighbors, random forest, support vector machine, bagging ensemble, and gradient boosting are used to evaluate the model in terms of accuracy, precision, sensitivity, and specificity. We implement the obtained results computationally and apply them to real data from the National Genomics Data Center. The experimental results report that the AAPred model reduces the features to seven of them. The average accuracy of the 10-fold cross-validation is 98.69%, precision is 98.72%, sensitivity is 96.81%, and specificity is 97.72%. The features are selected utilizing information gain and classified with random forest. The proposed model predicts the type of Coronavirus and reduces the number of extracted features. We identify that SARS-CoV-2 has similar physicochemical characteristics in some regions of SARS-CoV. Also, we report that SARS-CoV-2 has similar infection cycles and sequences in some regions of SARS CoV indicating the affectedness of vaccines on SARS-CoV-2. A comparison with deep learning shows similar results with our method.Entities:
Keywords: ANOVA; Amino acid composition; Artificial intelligence; Bagging ensemble and gradient boosting; Chi-square test; Deep learning; Feature extraction and selection; Information gain; LASSO; Molecular modeling; Protein sequence; SARS-CoV-2
Year: 2022 PMID: 35308181 PMCID: PMC8923015 DOI: 10.1016/j.chemolab.2022.104535
Source DB: PubMed Journal: Chemometr Intell Lab Syst ISSN: 0169-7439 Impact factor: 4.175
Acronyms and definition of symbols.
| Acronym/Symbol | Meaning | Acronym/Symbol | Meaning |
|---|---|---|---|
| A, G, M, … | The 20 amino acids 1-letter abbreviations [ | MI | Mutual information |
| AAC | Amino acid composition | MMS | Mean of sum of squares in the between groups |
| AAPred | Amino acid encoding based prediction | MMS within | Mean of the sum of squares in the same group |
| Acc | Accuracy | NCBI | National Center for Biotechnology Information |
| ANOVA | Analysis of variance | NGDC | National Genomics Data Center |
| BE | Bagging ensemble | NSPs | Non-structural proteins |
| Number of classes, dataset, feature | Num | Number of occurrences of class | |
| COVID-19 | Coronavirus disease 2019 | ORF | Open reading frame |
| CSV | Comma-separated values | Pr | Probability of selecting a sample of a class |
| DNA | Deoxyribonucleic acid | PK | Polynomial kernel |
| DT | Decision tree | Prec | Precision |
| Ent | Entropy | RBD | Receptor-binding domain |
| FASTA | Format of text for nucleotide or amino acids | RBF | Radial basis function |
| FN | False-negative | RF | Random forest |
| FP | False-positive | RNA | Ribonucleic acid |
| Freq | Frequency of each amino acid class | SARS-CoV | Severe acute respiratory syndrome coronavirus |
| GB | Gradient boosting | Sens | Sensitivity |
| IG | Information gain | SK | Sigmoid kernel |
| KEGG | Kyoto Encyclopedia of Genes and Genomes | Spec | Specificity |
| KNN | k-nearest neighbor | Spyder | Scientific python development environment |
| LASSO | Least absolute shrinkage and selection operator | SVM | Support vector machine |
| Len | Length of the protein sequence | TN | True-negative |
| LK | Linear kernel | TP | True-positive |
| MD | Manhattan distance | WHO | World Health Organization |
| MERS-CoV | Middle East respiratory syndrome coronavirus | χ2 | Chi-square |
Fig. 1Phases of the AAPred model and their connection.
Amino acids classification based on their side chain volumes and dipole values. Source: Taken from Ref. [32] which is licensed under a CC BY 4.0 License (creativecommons.org, accessed on October 24, 2021).
| Class number | Dipole scale | Volume scale | Amino acids |
|---|---|---|---|
| 1 | – | – | A, G, V |
| 2 | – | + | I, L, F, P |
| 3 | + | + | Y, M, T, S |
| 4 | ++ | + | H, N, Q, W |
| 5 | +++ | + | R, K |
| 6 | +'+'+' | + | D, E |
| 7 | + | + | C |
Eight classes of amino acids. Source: Taken from Ref. [33] which is licensed under a CC BY 4.0 License (creativecommons.org, accessed on October 24, 2021).
| Class | Amino acids |
|---|---|
| 0 | X (unknown), B (D or N), Z (E or Q) |
| 1 | A, G, V |
| 2 | I, L, F, P, J (I or L) |
| 3 | Y, M, T, S |
| 4 | H, N, Q, W |
| 5 | R, K |
| 6 | D, E |
| 7 | C |
Data presented in the CSV file format.
| Accession | Species | Length | Host |
|---|---|---|---|
| AVP78037 | SARS-Cov-2 | 121 | Homo Sapiens |
| AVP78039 | SARS-Cov-2 | 97 | Homo Sapiens |
| AVP78040 | SARS-Cov | 70 | Homo Sapiens |
| BBE15202 | Alpha | 237 | Felis Catus |
| QBI71705 | Avian | 125 | Gallus Gallus |
| AXM42849 | Porcine | 161 | Sus Scrofa |
| ATG84898 | MERS | 4391 | Homo Sapiens |
Fig. 2FASTA file sample.
Fig. 3Bar plot with frequencies of eight amino acids classes in: (a) COVID-19 and (b) non-COVID-19 samples for the NGDC dataset.
AAPred model performance for the indicated method and classifier with the NGDC dataset.
| Classifier | IG | ANOVA | χ2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | Sens | Spec | Prec | Acc | Sens | Spec | Prec | Acc | Sens | Spec | Prec | |
| BE | 98.53 | 98.21 | 98.43 | 98.35 | 96.91 | 96.32 | 96.44 | 96.83 | 98.89 | 98.87 | 98.33 | 98.33 |
| DT | 99.23 | 99.19 | 99.37 | 99.37 | 99.39 | 99.31 | 99.48 | 99.48 | 99.28 | 99.17 | 99.40 | 99.40 |
| GB | 97.61 | 96.80 | 97.85 | 97.51 | 95.62 | 95.34 | 95.64 | 95.64 | 97.81 | 97.56 | 97.74 | 97.73 |
| KNN | 99.66 | 99.65 | 99.67 | 99.67 | 99.63 | 99.55 | 99.71 | 99.71 | 99.69 | 99.65 | 99.72 | 99.72 |
| RF | 99.69 | 99.80 | 99.58 | 99.58 | 99.69 | 99.81 | 99.56 | 99.56 | 99.68 | 99.78 | 99.57 | 99.57 |
| SVM | 95.13 | 95.40 | 94.86 | 94.89 | 95.14 | 95.39 | 94.89 | 94.91 | 95.15 | 95.40 | 94.90 | 94.92 |
AAPred model performance for the indicated method and classifier with the NGDC dataset using 10-fold cross-validation.
| Classifier | IG | ANOVA | χ2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | Sens | Spec | Prec | Acc | Sens | Spec | Prec | Acc | Sens | Spec | Prec | |
| BE | 98.53 | 96.21 | 97.43 | 98.35 | 96.51 | 96.32 | 94.44 | 90.83 | 93.91 | 90.32 | 90.44 | 90.83 |
| DT | 89.23 | 89.19 | 89.37 | 89.37 | 94.39 | 96.31 | 90.48 | 90.48 | 89.28 | 89.17 | 89.4 | 89.4 |
| GB | 97.61 | 96.8 | 97.85 | 97.51 | 95.62 | 95.34 | 95.64 | 90.64 | 92.62 | 89.34 | 88.64 | 88.64 |
| KNN | 89.66 | 89.65 | 89.67 | 89.67 | 89.63 | 89.55 | 89.71 | 89.71 | 90.69 | 89.65 | 88.72 | 91.72 |
| RF | 98.69 | 96.81 | 97.72 | 98.72 | 96.69 | 95.81 | 95.56 | 91.56 | 94.68 | 90.78 | 91.57 | 89.57 |
| SVM | 85.13 | 85.4 | 84.86 | 84.89 | 85.14 | 85.39 | 84.89 | 84.91 | 94.15 | 85.4 | 84.9 | 84.92 |
Fig. 4Proposed model performance based on IG for the NGDC dataset using a 10-fold cross-validation.
Fig. 5Proposed model performance based on an ANOVA for the NGDC dataset using a 10-fold cross-validation.
Fig. 6Proposed model performance based on the χ2 method for the NGDC dataset using a 10-fold cross-validation.
Fig. 7Proposed model performance using the spike protein dataset.
AAPred model performance compared to the method proposed in Ref. [14]. Source: The authors.
| Method | Number of features | Acc | Sens | Spec | Prec | Computing time (in seconds) |
|---|---|---|---|---|---|---|
| IG | 7 | 98.69 | 96.81 | 97.72 | 98.72 | 6.43 |
| ANOVA | 7 | 96.69 | 95.81 | 95.56 | 91.56 | 1.17 |
| χ2 | 7 | 94.68 | 90.78 | 91.57 | 89.57 | 1.12 |
| Chen et al. [ | 35 | 95.90 | 96.10 | 95.70 | 98.60 | 5.62 |
AAPred performance compared to the method in Ref. [15]. Source: The authors.
| Method | Number of features | Acc | Sens | Spec | Prec | Computing time (in seconds) |
|---|---|---|---|---|---|---|
| IG | 7 | 99.01 | 98.56 | 97.02 | 96.41 | 3.58 |
| Qiang el al [ | 20 | 98.18 | 99.16 | 97.26 | 96.38 | 4.21 |