| Literature DB >> 32832270 |
Shaojun Pei1, Rui Dong1, Yiming Bao2,3, Rong Lucy He4, Stephen S-T Yau1.
Abstract
BACKGROUND: Begomoviruses are widely distributed and causing devastating diseases in many crops. According to the number of genomic components, a begomovirus is known as either monopartite or bipartite begomovirus. Both the monopartite and bipartite begomoviruses have the DNA-A component which encodes all essential proteins for virus functions, while the bipartite begomoviruses still contain the DNA-B component. The satellite molecules, known as betasatellites, alphasatellites or deltasatellites, sometimes exist in the begomoviruses. So, the genomic components of begomoviruses are complex and varied. Different genomic components have different gene structures and functions. Classifying the components of begomoviruses is important for studying the virus origin and pathogenic mechanism.Entities:
Keywords: Begomovirus; Classification; Recursive feature elimination; Subsequence natural vector; Support vector machine
Year: 2020 PMID: 32832270 PMCID: PMC7409808 DOI: 10.7717/peerj.9625
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
The data size used in training and testing datasets of the classification tasks.
| Genus classification | Genome components classification | Monopartite or bipartite DNA-A classification | |
|---|---|---|---|
| Training dataset | 976 | 491 | 349 |
| Test dataset | 419 | 212 | 120 |
Figure 1Flowchart of the classification model.
Figure 2Flowchart of the gene prediction model.
The performance of the classification tasks by SNV-SVM, Random Forest, Naive Bayes and Blastn.
| Accuracy | Precision | Recall | F-measure | AUC | Time (s) | ||
|---|---|---|---|---|---|---|---|
| Genus classification | SNV-SVM | 0.990 | 0.987 | 0.992 | 0.990 | 0.998 | 0.553 |
| Random Forest | 0.980 | 0.974 | 0.988 | 0.981 | 0.997 | 3.158 | |
| Naive Bayes | 0.867 | 0.871 | 0.896 | 0.884 | 0.923 | 1.717 | |
| Blastn | 0.990 | 0.990 | 0.990 | 0.990 | 0.995 | 5.718 | |
| Genome component classification | SNV-SVM | 0.995 | 0.995 | 0.995 | 0.995 | 0.998 | 1.148 |
| Random Forest | 0.995 | 0.996 | 0.996 | 0.996 | 0.996 | 2.765 | |
| Naive Bayes | 0.982 | 0.976 | 0.988 | 0.981 | 0.995 | 2.050 | |
| Blastn | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 3.745 | |
| Mono-or bipartite | SNV-SVM | 0.987 | 0.985 | 0.991 | 0.988 | 0.998 | 0.293 |
| Random Forest | 0.952 | 0.947 | 0.953 | 0.952 | 0.980 | 1.862 | |
| Naive Bayes | 0.977 | 0.967 | 0.987 | 0.977 | 0.973 | 0.219 | |
| Blastn | 0.818 | 0.750 | 0.844 | 0.794 | 0.834 | 1.050 |
Figure 3The confusion matrices and ROC curves of genus, genomic components and monopartite or bipartite DNA-A classification by SVM.
(A) the confusion matrix of genus classification; (B) the ROC curves of genus classification; (C) the confusion matrix of genomic components classification; (D) the ROC curves of genomic components classification; (E) the confusion matrix of monopartite or bipartite DNA-A classification; (F) the ROC curves of monopartite or bipartite DNA-A classification.
Figure 4Two-dimensional projection of the convex hulls of different genes of different DNA types by Linear Discrimination Analysis.
X-axis and Y-axis are two directions of projection. (A) Two-dimensional projection of the convex hulls of different genes DNA-A; (B) two-dimensional projection of the convex hulls of different genes DNA-B.
The results of T-test of GC content.
| Monopartite | Bipartite | T-statistic | ||
|---|---|---|---|---|
| GC content (2nd segment) | 0.430(±0.044) | 0.474(±0.047) | −7.578 | 0.000 |
| GC content (10th segment) | 0.493(±0.035) | 0.479(±0.037) | 2.962 | 0.003 |
Figure 5Genomic regions, transcripts, and products of a monopartite begomoviruse (Tomato yellow leaf curl virus, NC_004005).