| Literature DB >> 35017554 |
Manyun Guo1, Yucheng Ma2, Wanyuan Liu1, Zuyi Yuan1.
Abstract
Nucleocapsid protein (NC) in the group-specific antigen (gag) of retrovirus is essential in the interactions of most retroviral gag proteins with RNAs. Computational method to predict NCs would benefit subsequent structure analysis and functional study on them. However, no computational method to predict the exact locations of NCs in retroviruses has been proposed yet. The wide range of length variation of NCs also increases the difficulties. In this paper, a computational method to identify NCs in retroviruses is proposed. All available retrovirus sequences with NC annotations were collected from NCBI. Models based on random forest (RF) and weighted support vector machine (WSVM) were built to predict initiation and termination sites of NCs. Factor analysis scales of generalized amino acid information along with position weight matrix were utilized to generate the feature space. Homology based gene prediction methods were also compared and integrated to bring out better predicting performance. Candidate initiation and termination sites predicted were then combined and screened according to their intervals, decision values and alignment scores. All available gag sequences without NC annotations were scanned with the model to detect putative NCs. Geometric means of sensitivity and specificity generated from prediction of initiation and termination sites under fivefold cross-validation are 0.9900 and 0.9548 respectively. 90.91% of all the collected retrovirus sequences with NC annotations could be predicted totally correct by the model combining WSVM, RF and simple alignment. The composite model performs better than the simplex ones. 235 putative NCs in unannotated gags were detected by the model. Our prediction method performs well on NC recognition and could also be expanded to solve other gene prediction problems, especially those whose training samples have large length variations.Entities:
Mesh:
Year: 2022 PMID: 35017554 PMCID: PMC8752852 DOI: 10.1038/s41598-021-03182-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Predicting performance of models applying WSVM & RF on initiation and termination sites of NCs.
| NC Boundary Type | Algorithm | Sn | Sp | G-mean | Accuracy | MCC |
|---|---|---|---|---|---|---|
| NC Initiation site | WSVM | 0.9869 | 0.9932 | 0.9900 | 0.9922 | 0.9735 |
| RF | 1.0000 | 0.9974 | 0.9986 | 0.9978 | 0.9923 | |
| NC Termination site | WSVM | 0.9227 | 0.9881 | 0.9548 | 0.9771 | 0.9179 |
| RF | 0.9481 | 1.0000 | 0.9737 | 0.9913 | 0.9687 |
Predicting performance of different methods on NCs.
| Prediction method | NC sample set | Init Acc amount | Init Acc rate | Term Acc amount | Term Acc rate | NC Acc amount | NC Acc rate |
|---|---|---|---|---|---|---|---|
| All 77 | 56 | 72.73% | 66 | 85.71% | 47 | 61.04% | |
| SA | All 77 | 61 | 79.22% | 65 | 84.42% | 51 | 66.23% |
| WSVM&RF | All 77 | 61 | 79.22% | 61 | 79.22% | 53 | 68.83% |
| SA + | All 77 | 64 | 83.12% | 65 | 84.42% | 54 | 70.13% |
| WSVM&RF + SA | All 77 | 65 | 84.42% | 64 | 83.12% | 54 | 70.13% |
| WSVM&RF + | All 77 | 66 | 85.71% | 65 | 84.42% | 56 | 72.73% |
| WSVM&RF + SA + | All 77 | 67 | 87.01% | 65 | 84.42% | 57 | 74.03% |
| WSVM&RF + SA | 66 with large PM | 59 | 89.40% | 59 | 89.40% | 52 | 78.79% |
Figure 1Motifs of residues adjacent to boundaries of NCs in ERV sequences. It shows motifs of surrounding residues of (A) NC initiation sites, (B) NC termination sites.
Figure 2The feature importance of 6 factors of FASGAI.
Figure 3The model summary of the CNN model.
Predicting performance of WSVM&RF + SA method with different window lengths.
| WindowLength | Init Acc amount | Init Acc rate | Term Acc amount | Term Acc rate | NC Acc amount | NC Acc rate |
|---|---|---|---|---|---|---|
| 1 | 0 | 0% | 1 | 1.30% | 0 | 0% |
| 2 | 10 | 12.99% | 11 | 14.29% | 0 | 0% |
| 3 | 39 | 50.65% | 43 | 55.84% | 32 | 41.56% |
| 4 | 58 | 75.32% | 64 | 83.12% | 49 | 63.64% |
| 5 | 43 | 55.84% | 64 | 83.12% | 35 | 45.45% |
| 6 | 58 | 75.32% | 64 | 83.12% | 50 | 64.94% |
| 7 | 58 | 75.32% | 64 | 83.12% | 49 | 63.64% |
| 8 | 58 | 75.32% | 64 | 83.12% | 49 | 63.64% |
| 9 | 55 | 71.43% | 61 | 79.22% | 42 | 54.55% |
| 10 | 62 | 80.52% | 61 | 79.22% | 50 | 64.94% |
| 11 | 63 | 81.82% | 63 | 81.82% | 53 | 68.83% |
| 12 | 58 | 75.32% | 63 | 81.82% | 47 | 61.04% |
| 13 | 63 | 81.82% | 60 | 77.92% | 51 | 66.23% |
| 14 | 62 | 80.52% | 63 | 81.82% | 51 | 66.23% |
| 15 | 61 | 79.22% | 65 | 84.42% | 51 | 66.23% |
| 16 | 65 | 84.42% | 64 | 83.12% | 54 | 70.13% |
| 17 | 60 | 77.92% | 65 | 84.42% | 50 | 64.94% |
| 18 | 64 | 83.12% | 64 | 83.12% | 53 | 68.83% |
| 19 | 61 | 79.22% | 65 | 84.42% | 51 | 66.23% |
| 20 | 62 | 80.52% | 65 | 84.42% | 52 | 67.53% |
Predicting performance of models applying CNN on initiation and termination sites of NCs.
| NC Boundary Type | Fold Number | Sn | Sp | G-mean |
|---|---|---|---|---|
| NC Initiation site | fivefold | 0.6195 | 0.9717 | 0.7759 |
| tenfold | 0.6522 | 0.9413 | 0.7835 | |
| leave-one-out | 0.6957 | 0.9587 | 0.8167 | |
| NC Termination site | fivefold | 0.5978 | 0.9348 | 0.7475 |
| tenfold | 0.6304 | 0.9370 | 0.7686 | |
| leave-one-out | 0.6739 | 0.9565 | 0.8029 |
Figure 4The evolutionary relationship of NCs in retroviruses. The leaf nodes denote annotated NCs in the benchmark dataset, and the edge lengths describe the phylogenic relationship between these nodes.