| Literature DB >> 14572668 |
Feng Gao1, Hong-Yu Ou, Ling-Ling Chen, Wen-Xin Zheng, Chun-Ting Zhang.
Abstract
Recently, we have developed a coronavirus-specific gene-finding system, ZCURVE_CoV 1.0. In this paper, the system is further improved by taking the prediction of cleavage sites of viral proteinases in polyproteins into account. The cleavage sites of the 3C-like proteinase and papain-like proteinase are highly conserved. Based on the method of traditional positional weight matrix trained by the peptides around cleavage sites, the present method also sufficiently considers the length conservation of non-structural proteins cleaved by the 3C-like proteinase and papain-like proteinase to reduce the false positive prediction rate. The improved system, ZCURVE_CoV 2.0, has been run for each of the 24 completely sequenced coronavirus genomes in GenBank. Consequently, all the non-structural proteins in the 24 genomes are accurately predicted. Compared with known annotations, the performance of the present method is satisfactory. The software ZCURVE_CoV 2.0 is freely available at http://tubic.tju.edu.cn/sars/.Entities:
Mesh:
Substances:
Year: 2003 PMID: 14572668 PMCID: PMC7232748 DOI: 10.1016/s0014-5793(03)01091-3
Source DB: PubMed Journal: FEBS Lett ISSN: 0014-5793 Impact factor: 4.124
The lengths for 11 non-structural proteinsa cleaved by the 3C-like proteinase
| Genome | The length of non-structural proteins (aa) | ||||||||||
| nsp2 | nsp3 | nsp4 | nsp5 | nsp6 | nsp7 | nsp9 | nsp10 | nsp11 | nsp12 | nsp13 | |
| TOR2 | 306 | 290 | 83 | 198 | 113 | 139 | 932 | 601 | 527 | 346 | 298 |
| HCoV-229E | 302 | 279 | 83 | 195 | 109 | 135 | 927 | 597 | 518 | 348 | 300 |
| MHV | 303 | 287 | 92 | 194 | 110 | 137 | 928 | 600 | 521 | 374 | 299 |
| BCoV | 303 | 287 | 89 | 197 | 110 | 137 | 928 | 603 | 521 | 374 | 299 |
| IBV | 307 | 293 | 83 | 210 | 111 | 145 | 940 | 600 | 521 | 338 | 302 |
| TGEV | 302 | 294 | 83 | 195 | 111 | 135 | 929 | 599 | 519 | 339 | 300 |
| PEDV | 302 | 280 | 83 | 195 | 108 | 135 | 927 | 597 | 517 | 339 | 301 |
| Average length | 304 | 287 | 85 | 198 | 110 | 138 | 930 | 600 | 521 | 351 | 300 |
| Standard deviation | 2.07 | 5.87 | 3.76 | 5.59 | 1.60 | 3.60 | 4.67 | 2.15 | 3.26 | 16.07 | 1.35 |
These proteins are cleaved by the 3C-like proteinase within polyprotein 1ab derived from the seven coronavirus genomes annotated by NCBI.
The cleavage sites have been confirmed by experimental evidence in these genomes.
The genomes that have maximum lengths for nsp2–13 except nsp8 are IBV, TGEV, MHV, IBV, TOR2, IBV, IBV, BCoV, TOR2, MHV (BCoV) and IBV respectively. The genomes that have the minimum lengths for nsp2–13 except nsp8 are HCoV-229E (TGEV, PEDV), HCoV-229E, TOR2 (HCoV-229E, IBV, TGEV, PEDV), MHV, PEDV, HCoV-229E (TGEV, PEDV), HCoV-229E (PEDV), HCoV-229E (PEDV), PEDV, IBV and TOR2, respectively.
The lengths for the non-structural proteinsa cleaved by the papain-like proteinase
| Genome | Length (aa) | |||
| PCP CP1 | PCP CP2 | PCP CP3 | PCP CP4 | |
| IBV | – | 673 | 2106 | – |
| TOR2 | 179 | 639 | 2422 | – |
| MHV | 247 | 585 | 2501 | – |
| BCoV | 246 | 605 | 1899 | 496 |
| HCoV-229E | 111 | 786 | 1587 | 481 |
| TGEV | 108 | 771 | 1509 | 490 |
| PEDV | 110 | 785 | 1622 | 480 |
| Average length | 144 | 737 | 1654 | 487 |
| Standard deviation | 68.18 | 88.10 | 169.87 | 7.63 |
These proteins are cleaved by the papain-like proteinase within polyprotein 1ab derived from the seven coronavirus genomes annotated by NCBI.
These cleavage products have been confirmed by experimental evidence.
The average length and standard deviation are calculated based on the genomes of BCoV, HCoV-229E, TGEV and PEDV.
Fig. 1Comparison between the N-terminal sequences of the polyprotein 1abs in MHV and BCoV is shown schematically. The additional cleavage site in the annotated PCP CP3 predicted by the present method for MHV is situated at the corresponding position where the PCP CP3 and PCP CP4 are cleaved in BCoV. Cleavage sites that have been annotated by NCBI are indicated by black arrows, while the cleavage site predicted by the present method is indicated by an open arrow.
Fig. 2Conservation of the sites cleaved by coronavirus proteinases. Two separate multiple, gap-free alignments around the P1|P1′ positions of the sites cleaved by the 3C-like proteinase (A) and papain-like proteinase (B) in the training set are converted to logo presentations in which the size of an amino acid is proportional to its conservation at the specific position and the sampling size. The amino acid conservation is measured in bits of information plotted on a vertical axis whose upper limit is determined by the natural diversity of amino acids (20) expressed as a logarithm of 2 [16]. Seventy-seven sites cleaved by the 3C-like proteinase were used to generate the logo in A, and 17 sites cleaved by the papain-like proteinase were used to generate the logo in B.
Comparison of the predicted results for TGEV and PEDV with those annotated by NCBIa
| Number | Genome | Location (bp) | Location (aa) | Length (aa) | Cleavable peptide | Feature | ||
| Start | Stop | Start | Stop | |||||
| 1 | TGEV | 315 | 638 | 1 | 108 | 108 | – | PCP CP1 |
| 639 | 2 951 | 109 | 879 | 771 | KIARTG|RGAIYV | PCP CP2 | ||
| 2 952 | 7 478 | 880 | 2 388 | 1 509 | YNKMGG|GDKTVS | PCP CP3 | ||
| 7 479 | 8 948 | 2 389 | 2 878 | 490 | VSPKSG|SGFFDV | PCP CP4 | ||
| 8 949 | 9 854 | 2 879 | 3 180 | 302 | STLQ|SGLR | nsp2 | ||
| 9 855 | 10 736 | 3 181 | 3 474 | 294 | VNLQ|AGKV | nsp3 | ||
| 10 737 | 10 985 | 3 475 | 3 557 | 83 | STVQ|SKLT | nsp4 | ||
| 10 986 | 11 570 | 3 558 | 3 752 | 195 | TILQ|SVAS | nsp5 | ||
| 11 571 | 11 903 | 3 753 | 3 863 | 111 | TKLQ|NNEI | nsp6 | ||
| 11 904 | 12 308 | 3 864 | 3 998 | 135 | VRLQ|AGKP | nsp7 | ||
| 12 309 | 15 094 | 3 999 | 4 927 | 929 | TSMQ|SFTV | nsp9 | ||
| 15 095 | 16 891 | 4 928 | 5 526 | 599 | TVLQ|AAGM | nsp10 | ||
| 16 892 | 18 448 | 5 527 | 6 045 | 519 | IGLQ|AKPE | nsp11 | ||
| 18 449 | 19 465 | 6 046 | 6 384 | 339 | KALQ|SLEN | nsp12 | ||
| 19 466 | 20 365 | 6 385 | 6 684 | 300 | PQLQ|SAEW | nsp13 | ||
| 2 | PEDV | 297 | 626 | 1 | 110 | 110 | – | PCP CP1 |
| 627 | 2 981 | 111 | 895 | 785 | FGRRGG|NIVPVD | PCP CP2 | ||
| 2 982 | 7 847 | 896 | 2 517 | 1 622 | FKKKGG|GDVKFS | PCP CP3 | ||
| 7 848 | 9 287 | 2 518 | 2 997 | 480 | ANKKGA|GLPSFS | PCP CP4 | ||
| 9 288 | 10 193 | 2 998 | 3 299 | 302 | STLQ|AGLR | nsp2 | ||
| 10 194 | 11 033 | 3 300 | 3 579 | 280 | VNLQ|GGYV | nsp3 | ||
| 11 034 | 11 282 | 3 580 | 3 662 | 83 | SSVQ|SKLT | nsp4 | ||
| 11 283 | 11 867 | 3 663 | 3 857 | 195 | SMLQ|SVAS | nsp5 | ||
| 11 868 | 12 192 | 3 858 | 3 965 | 108 | VKLQ|NNEI | nsp6 | ||
| 12 191 | 12 596 | 3 966 | 4 100 | 135 | VRLQ|AGKQ | nsp7 | ||
| 12 597 | 15 376 | 4 101 | 5 027 | 927 | SIMQ|STDM | nsp9 | ||
| 15 377 | 17 167 | 5 028 | 5 624 | 597 | AVLQ|SAGL | nsp10 | ||
| 17 168 | 18 718 | 5 625 | 6 141 | 517 | SDLQ|ANEG | nsp11 | ||
| 18 719 | 19 735 | 6 142 | 6 480 | 339 | NNLQ|GLEN | nsp12 | ||
| 19 736 | 20 638 | 6 481 | 6 781 | 301 | PQLQ|ASEW | nsp13 | ||
Note that of the 24 coronavirus genomes, the predicted results by ZCURVE_CoV 2.0 are in complete agreement with those annotated by NCBI, except for the genomes of TGEV and PEDV, in which the predicted results are different from those annotated by NCBI. In this table the reasons for these conflicts are analyzed.
This conflict with the annotation is caused by the problematic annotation.
The locations are different from the annotation, which is caused by a questionable additional insertion of an amino acid residue in nsp9.
This conflict with the annotation is caused by the non-standard frameshift.
The predicted results by the present method for BCoVL and SARS-CoV BJ01
| Number | Genome | Location (bp) | Location (aa) | Length (aa) | Cleavable peptide | Feature | ||
| Start | Stop | Start | Stop | |||||
| 1 | BCoVL | 211 | 948 | 1 | 246 | 246 | – | PCP CP1 |
| 949 | 2 763 | 247 | 851 | 605 | IRGYRG|VKPLLY | PCP CP2 | ||
| 2 764 | 8 460 | 852 | 2 750 | 1 899 | WRVPCA|GRRVTF | PCP CP3 | ||
| 8 461 | 9 948 | 2 751 | 3 246 | 496 | FSLKGG|AVFSYF | PCP CP4 | ||
| 9 949 | 10 857 | 3 247 | 3 549 | 303 | SFLQ|SGIV | nsp2 | ||
| 10 858 | 11 718 | 3 550 | 3 836 | 287 | IKLQ|SKRT | nsp3 | ||
| 11 719 | 11 985 | 3 837 | 3 925 | 89 | SQFQ|SKLT | nsp4 | ||
| 11 986 | 12 576 | 3 926 | 4 122 | 197 | TVLQ|ALQS | nsp5 | ||
| 12 577 | 12 906 | 4 123 | 4 232 | 110 | TVLQ|NNEL | nsp6 | ||
| 12 907 | 13 317 | 4 233 | 4 369 | 137 | VRLQ|AGTA | nsp7 | ||
| 13 318 | 16 100 | 4 370 | 5 297 | 928 | TTVQ|SKDT | nsp9 | ||
| 16 101 | 17 909 | 5 298 | 5 900 | 603 | AVMQ|SVGA | nsp10 | ||
| 17 910 | 19 472 | 5 901 | 6 421 | 521 | TRVQ|CSTN | nsp11 | ||
| 19 473 | 20 594 | 6 422 | 6 795 | 374 | TKLQ|SLEN | nsp12 | ||
| 20 595 | 21 491 | 6 796 | 7 094 | 299 | PRLQ|AASD | nsp13 | ||
| 2 | BJ01 | 246 | 782 | 1 | 179 | 179 | – | PCP CP1 |
| 783 | 2 699 | 180 | 818 | 639 | TRELNG|GAVTRY | PCP CP2 | ||
| 2 700 | 8 465 | 819 | 2 740 | 1 922 | FRLKGG|APIKGV | PCP CP3 | ||
| 8 466 | 9 965 | 2 741 | 3 240 | 500 | ISLKGG|KIVSTC | PCP CP4 | ||
| 9 966 | 10 883 | 3 241 | 3 546 | 306 | AVLQ|SGFR | nsp2 | ||
| 10 884 | 11 753 | 3 547 | 3 836 | 290 | VTFQ|GKFK | nsp3 | ||
| 11 754 | 12 002 | 3 837 | 3 919 | 83 | ATVQ|SKMS | nsp4 | ||
| 12 003 | 12 596 | 3 920 | 4 117 | 198 | ATLQ|AIAS | nsp5 | ||
| 12 597 | 12 935 | 4 118 | 4 230 | 113 | VKLQ|NNEL | nsp6 | ||
| 12 936 | 13 352 | 4 231 | 4 369 | 139 | VRLQ|AGNA | nsp7 | ||
| 13 353 | 16 147 | 4 370 | 5 301 | 932 | PLMQ|SADA | nsp9 | ||
| 16 148 | 17 950 | 5 302 | 5 902 | 601 | TVLQ|AVGA | nsp10 | ||
| 17 951 | 19 531 | 5 903 | 6 429 | 527 | ATLQ|AENV | nsp11 | ||
| 19 532 | 20 569 | 6 430 | 6 775 | 346 | TRLQ|SLEN | nsp12 | ||
| 20 570 | 21 463 | 6 776 | 7 073 | 298 | PKLQ|ASQA | nsp13 | ||
The alternative cleavage site predicted by the present method is at QALQ|SEFV (Gln-3928|Ser-3929).
Compared with the annotation, this cleavage site is predicted additionally by the present method.