Literature DB >> 14572668

Prediction of proteinase cleavage sites in polyproteins of coronaviruses and its applications in analyzing SARS-CoV genomes.

Feng Gao1, Hong-Yu Ou, Ling-Ling Chen, Wen-Xin Zheng, Chun-Ting Zhang.   

Abstract

Recently, we have developed a coronavirus-specific gene-finding system, ZCURVE_CoV 1.0. In this paper, the system is further improved by taking the prediction of cleavage sites of viral proteinases in polyproteins into account. The cleavage sites of the 3C-like proteinase and papain-like proteinase are highly conserved. Based on the method of traditional positional weight matrix trained by the peptides around cleavage sites, the present method also sufficiently considers the length conservation of non-structural proteins cleaved by the 3C-like proteinase and papain-like proteinase to reduce the false positive prediction rate. The improved system, ZCURVE_CoV 2.0, has been run for each of the 24 completely sequenced coronavirus genomes in GenBank. Consequently, all the non-structural proteins in the 24 genomes are accurately predicted. Compared with known annotations, the performance of the present method is satisfactory. The software ZCURVE_CoV 2.0 is freely available at http://tubic.tju.edu.cn/sars/.

Entities:  

Mesh:

Substances:

Year:  2003        PMID: 14572668      PMCID: PMC7232748          DOI: 10.1016/s0014-5793(03)01091-3

Source DB:  PubMed          Journal:  FEBS Lett        ISSN: 0014-5793            Impact factor:   4.124


Introduction

Due to the severity of a life-threatening disease, referred to as severe acute respiratory syndrome (SARS), the World Health Organization (WHO) has issued a global alert for the illness. SARS apparently began in Guangdong province of China in November 2002, and has spread to Hong Kong, Singapore, Vietnam, Canada, the USA and several European countries [1], [2], [3], [4], [5], [6]. By early June 2003, more than 700 SARS-related deaths were recorded by WHO (http://www.who.int/csr/sars/country/en/). A novel coronavirus, called SARS-coronavirus or SARS-CoV, has been proved to be the cause of SARS. The coronaviruses (order Nidovirales, family Coronaviridae, genus Coronavirus) are members of a family of large, enveloped, positive-stranded RNA viruses that replicate in the cytoplasm of animal host cells [7]. There are three groups of coronaviruses; groups I and II contain mammalian viruses, while group III contains only avian viruses. The viruses are associated with a variety of diseases in humans and domestic animals, including gastroenteritis and diseases of the upper and lower respiratory tract. Many researchers have analyzed the phylogeny of SARS-CoV and concluded that it is not closely related to any of the previously characterized coronaviruses and forms a distinct group (group IV) within the genus Coronavirus [7], [8]. At the time this paper was written, there were 12 strains of SARS-CoV complete genome sequences available from GenBank [7], [8], [9]. Among these genomes, six have been annotated manually, and the remaining six have not been annotated yet. The genomic organization of SARS-CoV is that of a typical coronavirus, with the order of the characteristic genes being replicase [rep], spike [S], envelope [E], membrane [M], nucleocapsid [N] from the 5′ to the 3′ terminus. SARS-CoV also encodes a number of non-structural proteins located between S and E, between M and N, or downstream of N with unknown functions. We have developed a coronavirus-specific gene-finding system ZCURVE_CoV 1.0 [10], which is especially suitable for gene recognition in SARS-CoV genomes. The software has the advantages of simplicity, reliability, high accuracy and quickness and can be obtained freely at the website http://tubic.tju.edu.cn/sars/. The system ZCURVE_CoV 1.0 has been run for each of the 12 SARS-CoV genomes. In addition to the polyprotein chains Orf1a and Orf1b and the four genes encoding the major structural proteins, S, E, M and N, respectively, ZCURVE_CoV 1.0 also predicts five to six putative proteins between 39 and 274 amino acids in length, with unknown functions in SARS-CoV genomes. However, the cleavage sites of viral proteinase in replicases are not predicted in ZCURVE_CoV 1.0. The coronavirus replicases are encoded by two large, 5′-proximal open reading frames (ORFs) that comprise approximately two-thirds of the genome. Polyproteins ORF1a and ORF1b are connected by a ribosomal frameshift site, which is believed to occur at the conserved ‘slippery sequence’, UUUAAAC. It results in the translation of an ORF1a protein and a carboxyl-extended ORF1ab frameshift protein, which are also known as replicase polyproteins pp1a and pp1ab [11]. The ORF1a and ORF1ab translation products are polyprotein precursors, which are cleaved by viral proteinases, resulting in a minimum of 13 non-structural proteins, including a 3C-like proteinase, an RNA-dependent RNA polymerase, an ATPase/helicase and other function-unknown non-structural proteins [11]. These proteins in turn are responsible for replicating the viral genome as well as generating nested transcripts that are used in the synthesis of viral proteins. In this paper, all the putative non-structural proteins resulting from the cleavage by viral proteinases in the polyproteins are precisely predicted using ZCURVE_CoV 2.0.

Materials and methods

Seven genomic sequences of coronaviruses and the annotation information were downloaded from the NCBI RefSeq project. These coronaviruses include avian infectious bronchitis virus (IBV) (NC_001451), bovine coronavirus (BCoV) (NC_003045), human coronavirus 229E (HCoV-229E) (NC_002645), murine hepatitis virus (MHV) (NC_001846), porcine epidemic diarrhea virus (PEDV) (NC_003436), SARS coronavirus TOR2 (TOR2) (NC_004718) and transmissible gastroenteritis virus (TGEV) (NC_002306). The above genomes have been annotated by NCBI and the sequences of mature peptides are available. According to the annotation, a total of 77 sites cleaved by the 3C-like proteinase and 17 sites cleaved by the papain-like proteinase were extracted from the above seven genomes. Octapeptides cleaved by the 3C-like proteinase and 12-mer peptides cleaved by the papain-like proteinase were used to train the corresponding positional weight matrix (PWM) [12]. The cleavage site is at the center of the octapeptide or 12-mer peptide. The length distribution of non-structural proteins within ORF1ab was also derived from the annotated genomes. At the time this paper was written, there were 24 complete sequences of coronavirus genomes available in the GenBank database, of which 12 are SARS-CoVs and 12 are other groups of coronaviruses. The former comprises SARS-CoV TOR2 (NC_004718), Urbani (AY278741), HKU-39849 (AY278491), CUHK-W1 (AY278554), BJ01 (AY278488), CUHK-Su10 (AY282752), SIN2500 (AY283794), SIN2748 (AY283797), SIN2679 (AY283796), SIN2774 (AY283798), SIN2677 (AY283795) and TW1 (AY291451), whereas the latter comprises IBV (NC_001451), BCoV (NC_003045), bovine coronavirus strain Mebus (BCoVM) (U00735), bovine coronavirus isolate BCoV-LUN (BCoVL) (AF391542), bovine coronavirus strain Quebec (BCoVQ) (AF220295), HCoV-229E (NC_002645), MHV (NC_001846), murine hepatitis virus strain ML-10 (MHVM) (AF208067), murine hepatitis virus strain 2 (MHV2) (AF201929), murine hepatitis virus strain Penn 97-1 (MHVP) (AF208066), PEDV (NC_003436) and TGEV (NC_002306). The mature peptides cleaved by the 3C-like proteinase are highly conserved in length among different groups of coronaviruses, while others cleaved by the papain-like proteinase are not so conserved. The lengths of all the non-structural proteins cleaved by the 3C-like proteinase within polyprotein 1ab are listed in Table 1 , while the lengths for the non-structural proteins cleaved by the papain-like proteinase are listed in Table 2 . The average length and standard deviation for each kind of non-structural proteins are calculated. As shown in Table 1, Table 2, the lengths of the non-structural proteins cleaved by the 3C-like proteinase are highly conserved, while the lengths and the number of the papain-like cysteine proteinase cleavage products (abbreviated as PCP CP) appear to be irregular. Since the NCBI annotations are not always correct, the annotations of cleavage products of the papain-like proteinase may be incomplete. It is observed that the size of the annotated PCP CP3 of SARS-CoV, MHV and IBV is approximately the sum of the sizes of PCP CP3 and PCP CP4 of other mammalian coronaviruses listed in Table 2. Therefore, the PCP CP3 of SARS-CoV, MHV and IBV may be further cleaved, i.e. it is possible that another papain-like proteinase cleavage site is present in the PCP CP3 of SARS-CoV, MHV and IBV. Based on the above analysis, a cleavage model of the papain-like proteinase is presented schematically in Fig. 1 . According to this model, all coronaviruses have four non-structural proteins cleaved by the papain-like proteinase. Consequently, the cleavage products of the papain-like proteinase predicted by this model show the conservation in both their length and number. The average length and standard deviation for each papain-like proteinase cleavage product are estimated based on the genomes of BCoV, HCoV-229E, TGEV and PEDV, in which four of the papain-like proteinase cleavage products are annotated (see Table 2). Fig. 2A,B shows the conservation sites cleaved by the 3C-like proteinase and papain-like proteinase, respectively. It can be seen that both the 3C-like proteinase and papain-like proteinase have conserved cleavage sites. The same arrangement order of the cleavage products in polyprotein 1ab, similar sizes of non-structural proteins and the conserved residues in the cleavable peptides form the basis of the present algorithm to predict cleavage sites of polyproteins. Here, the method is described briefly as follows.
Table 1

The lengths for 11 non-structural proteinsa cleaved by the 3C-like proteinase

GenomeThe length of non-structural proteins (aa)
nsp2nsp3nsp4nsp5nsp6nsp7nsp9nsp10nsp11nsp12nsp13

TOR230629083198113139932601527346298
HCoV-229Eb30227983195109135927597518348300
MHVb30328792194110137928600521374299
BCoV30328789197110137928603521374299
IBVb30729383210111145940600521338302
TGEV30229483195111135929599519339300
PEDV30228083195108135927597517339301
Average lengthc30428785198110138930600521351300
Standard deviation2.075.873.765.591.603.604.672.153.2616.071.35

These proteins are cleaved by the 3C-like proteinase within polyprotein 1ab derived from the seven coronavirus genomes annotated by NCBI.

The cleavage sites have been confirmed by experimental evidence in these genomes.

The genomes that have maximum lengths for nsp2–13 except nsp8 are IBV, TGEV, MHV, IBV, TOR2, IBV, IBV, BCoV, TOR2, MHV (BCoV) and IBV respectively. The genomes that have the minimum lengths for nsp2–13 except nsp8 are HCoV-229E (TGEV, PEDV), HCoV-229E, TOR2 (HCoV-229E, IBV, TGEV, PEDV), MHV, PEDV, HCoV-229E (TGEV, PEDV), HCoV-229E (PEDV), HCoV-229E (PEDV), PEDV, IBV and TOR2, respectively.

Table 2

The lengths for the non-structural proteinsa cleaved by the papain-like proteinase

GenomeLength (aa)
PCP CP1PCP CP2PCP CP3PCP CP4

IBV673b2106
TOR21796392422
MHV247b585b2501
BCoV2466051899496
HCoV-229E111b7861587b481
TGEV1087711509490
PEDV1107851622480
Average lengthc1447371654487
Standard deviationc68.1888.10169.877.63

These proteins are cleaved by the papain-like proteinase within polyprotein 1ab derived from the seven coronavirus genomes annotated by NCBI.

These cleavage products have been confirmed by experimental evidence.

The average length and standard deviation are calculated based on the genomes of BCoV, HCoV-229E, TGEV and PEDV.

Fig. 1

Comparison between the N-terminal sequences of the polyprotein 1abs in MHV and BCoV is shown schematically. The additional cleavage site in the annotated PCP CP3 predicted by the present method for MHV is situated at the corresponding position where the PCP CP3 and PCP CP4 are cleaved in BCoV. Cleavage sites that have been annotated by NCBI are indicated by black arrows, while the cleavage site predicted by the present method is indicated by an open arrow.

Fig. 2

Conservation of the sites cleaved by coronavirus proteinases. Two separate multiple, gap-free alignments around the P1|P1′ positions of the sites cleaved by the 3C-like proteinase (A) and papain-like proteinase (B) in the training set are converted to logo presentations in which the size of an amino acid is proportional to its conservation at the specific position and the sampling size. The amino acid conservation is measured in bits of information plotted on a vertical axis whose upper limit is determined by the natural diversity of amino acids (20) expressed as a logarithm of 2 [16]. Seventy-seven sites cleaved by the 3C-like proteinase were used to generate the logo in A, and 17 sites cleaved by the papain-like proteinase were used to generate the logo in B.

The lengths for 11 non-structural proteinsa cleaved by the 3C-like proteinase These proteins are cleaved by the 3C-like proteinase within polyprotein 1ab derived from the seven coronavirus genomes annotated by NCBI. The cleavage sites have been confirmed by experimental evidence in these genomes. The genomes that have maximum lengths for nsp2–13 except nsp8 are IBV, TGEV, MHV, IBV, TOR2, IBV, IBV, BCoV, TOR2, MHV (BCoV) and IBV respectively. The genomes that have the minimum lengths for nsp2–13 except nsp8 are HCoV-229E (TGEV, PEDV), HCoV-229E, TOR2 (HCoV-229E, IBV, TGEV, PEDV), MHV, PEDV, HCoV-229E (TGEV, PEDV), HCoV-229E (PEDV), HCoV-229E (PEDV), PEDV, IBV and TOR2, respectively. The lengths for the non-structural proteinsa cleaved by the papain-like proteinase These proteins are cleaved by the papain-like proteinase within polyprotein 1ab derived from the seven coronavirus genomes annotated by NCBI. These cleavage products have been confirmed by experimental evidence. The average length and standard deviation are calculated based on the genomes of BCoV, HCoV-229E, TGEV and PEDV. Comparison between the N-terminal sequences of the polyprotein 1abs in MHV and BCoV is shown schematically. The additional cleavage site in the annotated PCP CP3 predicted by the present method for MHV is situated at the corresponding position where the PCP CP3 and PCP CP4 are cleaved in BCoV. Cleavage sites that have been annotated by NCBI are indicated by black arrows, while the cleavage site predicted by the present method is indicated by an open arrow. Conservation of the sites cleaved by coronavirus proteinases. Two separate multiple, gap-free alignments around the P1|P1′ positions of the sites cleaved by the 3C-like proteinase (A) and papain-like proteinase (B) in the training set are converted to logo presentations in which the size of an amino acid is proportional to its conservation at the specific position and the sampling size. The amino acid conservation is measured in bits of information plotted on a vertical axis whose upper limit is determined by the natural diversity of amino acids (20) expressed as a logarithm of 2 [16]. Seventy-seven sites cleaved by the 3C-like proteinase were used to generate the logo in A, and 17 sites cleaved by the papain-like proteinase were used to generate the logo in B. First, ORF1ab and the slippery sequences are identified using ZCURVE_CoV 1.0. Subsequently, the predicted ORF1ab is translated into amino acid sequence. Starting from the C-terminus of the predicted ORF1ab polyprotein, the candidate cleavage site of nsp13 is searched within a particular region using the sliding-window technique. The distance between the scanning region center and the C-terminus of polyprotein 1ab should be equal to the average length of nsp13. Denoting the center position by c, a window with an octapeptide size slides from the positions c−3δ to c+3δ, where δ is the standard deviation of the length distribution for nsp13 (see Table 1). Given an octapeptide within the region S=X4X3X2X1X1′X2′X3′X4′, where X (i=4, 3, 2, 1, 1′, 2′, 3′, 4′) represents the amino acid at the position P, the score of the octapeptide is computed aswhere f(i, X) (i=4, 3, 2, 1, 1′, 2′, 3′, 4′) is the frequency of amino acid X occurring at the position P, which is an element in the corresponding positional weight matrix. The site with maximum score is selected as a candidate site. Consequently, the cleavage site of nsp12|13 is determined and nsp13 is found. Prediction of other cleavage sites is performed in a recurrent way. Once the cleavage site of nsp12|13 is determined, the next cleavage site to be predicted is nsp11|12, then nsp10|11, and so forth until nsp1|2. Generally, if the site of nspk|(k+1) is determined, the next target is to predict the site of nsp(k−1)|k, where k=12, 11, …, 2, but k≠8 (see the explanation below). For clarity, take k=6 as an example, where the site of nsp6|7 is known. First, the center position and the sliding window used for identifying the site of nsp5|6 need to be determined. The center position c is situated upstream of the site of nsp6|7. The distance between the center position c and the site of nsp6|7 should be equal to l 6, which is the average length of nsp6. In Table 1, we find l 6=110 aa and the standard deviation δ of the length distribution for nsp6 is 1.6. A window with an octapeptide size thus slides from the position c−3δ≈c−5 to c+3δ≈c+5. Second, the site with the highest score is predicted to be the candidate site of nsp5|6. Note that in some cases the scores may be zero because of the limited training samples. In this case, a very small quantity (0.001) is assigned to the zero elements in the positional weight matrix. Also note that the nsp7|8 site is cleaved in polyprotein 1a, while the nsp7|9 site is cleaved in polyprotein 1ab. Therefore, the cleavage sites of nsp7|8 and nsp7|9 are in fact the same, leading to the result of k≠8. Furthermore, if the following two conditions are satisfied, besides the site with the maximum score, the site with the second maximum score is also taken into account: (i) Gln and Leu are found at the P1 and P2 positions, respectively; (ii) the distance between the two sites is less than five amino acid residues. This procedure considers the prediction of two adjacent cleavage sites in the scanning window. Consequently, two alternative cleavage sites annotated by NCBI are also found in the genomes of MHV and BCoV. Note that such cases occur rarely in the genomes studied. Repeating the above procedure 11 times, all of the mature peptides cleaved by the 3C-like proteinase are identified one by one. Then, the papain-like proteinase cleavage products are searched within the remaining regions of polyprotein 1ab. A similar recurrent procedure is performed to search for the papain-like proteinase cleavage sites. The scores of 12-mer peptides are calculated as described above. The center position and the size of the sliding window used to search for the papain-like proteinase cleavage sites are determined in a way similar to that used for the 3C-like proteinase. The sites associated with the maximum scores in the corresponding scanning regions are predicted to be cleavage sites. Consequently, three papain-like proteinase cleavage sites are predicted for each genome.

Results and discussion

Replicase polyprotein processing is carried out by two or three ORF1a-encoded viral proteinases. Coronaviruses encode a chymotrypsin-like proteinase, 3C-like proteinase, which is analogous to the main picornaviral proteinase, 3C proteinase [11]. As mentioned above, the cleavage sites of the 3C-like proteinase are highly conserved. As shown in Fig. 2A, the P1 position of the peptide sequence is exclusively occupied by Gln. Leu is dominant at the P2 position (more than 75%) and Val, Ser, Thr and Pro are clearly favored at the P4 position. At the P1′ position, small, aliphatic residues (Ser, Ala, Asn, Gly and Cys) are found, of which the content of Ser is more than 50%. There are no highly favored residues at the P3, P2′, P3′ and P4′ positions. The length distributions of each of the 11 non-structural proteins cleaved by the 3C-like proteinase in the annotated genomes are listed in Table 1. Of these non-structural proteins, nsp2 is the putative 3C-like proteinase; nsp3 contains a hydrophobic domain; nsp7 is known as a growth-factor-like protein; nsp9 is the putative RNA-dependent RNA polymerase; nsp10 contains a metal ion-binding domain and NTPase/helicase domain. Recently the mRNA cap-1 methyltransferase function has been assigned to nsp13 [13]. The functions of other non-structural proteins are unknown. Moreover, coronaviruses also encode one (group III) or two (groups I and II) papain-like proteinases, which are analogous to the foot and mouth disease virus leader proteinase. SARS-CoV appears to contain only one papain-like proteinase domain in the predicted gene product of ORF1a [7]. For the papain-like proteinase, the cleavage sites are also conserved, but not as conserved as those of the 3C-like proteinase. Gly and Ala are found at the P1 position and Gly accounts for more than 75%. At the P2 and P1′ positions, Gly is also the dominant residue, which accounts for more than 45% and 50%, respectively. No residues exceed 40% at other positions. In this study, similar sizes of non-structural proteins and conserved cleavage sites form the basis of the present algorithm. The performance of the algorithm is satisfactory by comparing the predicted results with known annotations. Although all the SARS genomes have been annotated by in silico analysis so far, some annotations for other coronaviruses, such as IBV, MHV and HCoV-229E, are supported by experimental evidence [11]. The jack-knife (leave-one-out) test has been performed here to ensure the validation of the prediction results for the cleavage sites of the 3C-like proteinase. By the jack-knife test, each genome out of the seven genomes under study is singled out in turn, and used as a testing genome. The remaining six genomes are used as the training set. Based on the data derived from the six training genomes, the cleavage sites of the 3C-like proteinase in the testing genome are predicted and evaluated. The jack-knife test was finished by repeating the above procedure seven times. Consequently, the predicted results by the jack-knife test are found to be as good as those by a self-consistency test mentioned previously, suggesting that the prediction results are reliable. The prediction results for TGEV and PEDV, which are different from the annotations of NCBI RefSeq projects, are listed in Table 3 . The prediction results for other genomes can be obtained from the supplementary materials (http://tubic.tju.edu.cn/sars/). The coronavirus −1 frameshift site [14] is believed to occur at the ‘slippery sequence’, UUUAAAC. This assumption has been supported by experimental evidence [15]. But the annotated frameshift sites are not always consistent with this pattern, as in the case of PEDV, whose frameshift site lies upstream of the UUUAAAC sequence according to the annotation. This may be due to the questionable annotation. For example, the genomes of MHV and BCoV were originally annotated by the authors as the ones having a non-standard frameshift site, however, these conclusions were then corrected by the re-annotations of NCBI as the ones having standard frameshift sites. In light of this, we adopt UUUAAAC as the standard slippery sequence.
Table 3

Comparison of the predicted results for TGEV and PEDV with those annotated by NCBIa

NumberGenomeLocation (bp)
Location (aa)
Length (aa)Cleavable peptideFeature
StartStopStartStop

1TGEV3156381108108PCP CP1
NC_0023066392 951109879771KIARTG|RGAIYVPCP CP2
2 9527 4788802 3881 509YNKMGG|GDKTVSPCP CP3
7 4798 9482 3892 878490VSPKSG|SGFFDVPCP CP4
8 9499 8542 8793 180302STLQ|SGLRnsp2
9 85510 7363 1813 474294VNLQ|AGKVnsp3
10 73710 9853 4753 55783STVQ|SKLTnsp4
10 98611 5703 5583 752195TILQ|SVASnsp5
11 57111 9033 7533 863111TKLQ|NNEInsp6
11 90412 3083 8643 998135VRLQ|AGKPnsp7
12 30915 0943 9994 927929TSMQ|SFTVnsp9b
15 095c16 891c4 9285 526599TVLQ|AAGMnsp10
16 892c18 448c5 5276 045519IGLQ|AKPEnsp11
18 449c19 465c6 0466 384339KALQ|SLENnsp12
19 466c20 365c6 3856 684300PQLQ|SAEWnsp13
2PEDV2976261110110PCP CP1
NC_0034366272 981111895785FGRRGG|NIVPVDPCP CP2
2 9827 8478962 5171 622FKKKGG|GDVKFSPCP CP3
7 8489 2872 5182 997480ANKKGA|GLPSFSPCP CP4
9 28810 1932 9983 299302STLQ|AGLRnsp2
10 19411 0333 3003 579280VNLQ|GGYVnsp3
11 03411 2823 5803 66283SSVQ|SKLTnsp4
11 28311 8673 6633 857195SMLQ|SVASnsp5
11 86812 1923 8583 965108VKLQ|NNEInsp6
12 19112 5963 9664 100135VRLQ|AGKQnsp7
12 59715 3764 1015 027927SIMQ|STDMnsp9d
15 37717 1675 0285 624597AVLQ|SAGLnsp10
17 16818 7185 6256 141517SDLQ|ANEGnsp11
18 71919 7356 1426 480339NNLQ|GLENnsp12
19 73620 6386 4816 781301PQLQ|ASEWnsp13

Note that of the 24 coronavirus genomes, the predicted results by ZCURVE_CoV 2.0 are in complete agreement with those annotated by NCBI, except for the genomes of TGEV and PEDV, in which the predicted results are different from those annotated by NCBI. In this table the reasons for these conflicts are analyzed.

This conflict with the annotation is caused by the problematic annotation.

The locations are different from the annotation, which is caused by a questionable additional insertion of an amino acid residue in nsp9.

This conflict with the annotation is caused by the non-standard frameshift.

Comparison of the predicted results for TGEV and PEDV with those annotated by NCBIa Note that of the 24 coronavirus genomes, the predicted results by ZCURVE_CoV 2.0 are in complete agreement with those annotated by NCBI, except for the genomes of TGEV and PEDV, in which the predicted results are different from those annotated by NCBI. In this table the reasons for these conflicts are analyzed. This conflict with the annotation is caused by the problematic annotation. The locations are different from the annotation, which is caused by a questionable additional insertion of an amino acid residue in nsp9. This conflict with the annotation is caused by the non-standard frameshift. Using the present method, only few false positive predictions exist in the prediction results. The tedious calculations for deriving the cutoff value can be avoided by restricting the sizes of the scanning regions and only selecting the site with the maximum score within this region. The annotated cleavage sites often correspond to the highest scores measured by the PWM method. However, the sites scored high by the PWM method do not always correspond to the cleavage sites and vice versa. Restricting the scanning regions for each of the cleavage sites is more efficient to reduce the false positive prediction rate. For the prediction of the 3C-like proteinase cleavage sites, there are only two conflicts between the predicted results and the annotations, which are marked in Table 3. The first conflict lies in the locations of non-structural proteins downstream of nsp9 in TGEV, which may be due to the problematic annotation. The length of amino acid sequences for ORF 1ab (315–20 368 bp) should be 6684 aa, instead of 6685 aa, which is annotated by NCBI. The questionable additional insertion of an amino acid residue in nsp9 causes one conflict of location errors. The second is caused by a non-standard frameshift site in PEDV, which causes the difference of five amino acid residues between the non-standard frameshift site and the standard frameshift site. For this reason, the octapeptide predicted by the present method is SIMQ|STDM instead of the annotated SIMQ|STDY. Using the cleavage model of the papain-like proteinase presented here, the additional cleavage sites in the annotated PCP CP3 predicted by this method for SARS-CoV TOR2, MHV and IBV are ISLKGG|KIVSTC, FSLKGG|AVFSRM and VEKKAG|GIVSGT, respectively. The predicted cleavable peptides are similar to those annotated by NCBI, for example, the cleavable peptide FSLKGG|AVFSRM in MHV is different from the annotated peptide FSLKGG|AVFSYF in BCoV only at the P5′ and P6′ positions. Comparison between the N-terminal sequences of the polyprotein 1abs in MHV and BCoV is shown in Fig. 1. The additional cleavage site in the annotated PCP CP3 predicted by this method for MHV is situated at the corresponding position where the PCP CP3 and PCP CP4 are cleaved in BCoV. Cleavage sites that have been annotated by NCBI are indicated by black arrows, whereas that predicted by the present method is indicated by the open arrow. Therefore, the annotated PCP CP3 of SARS-CoV TOR2, MHV and IBV may be a precursor, which can be cleaved further. Based on the present method, the genomes without annotation have been annotated. To save printing space, only the results of BCoVL and SARS-CoV BJ01 are summarized in Table 4 . The detailed annotations for other coronavirus genomes are accessible at http://tubic.tju.edu.cn/sars/.
Table 4

The predicted results by the present method for BCoVL and SARS-CoV BJ01

NumberGenomeLocation (bp)
Location (aa)
Length (aa)Cleavable peptideFeature
StartStopStartStop

1BCoVL2119481246246PCP CP1
AF3915429492 763247851605IRGYRG|VKPLLYPCP CP2
2 7648 4608522 7501 899WRVPCA|GRRVTFPCP CP3
8 4619 9482 7513 246496FSLKGG|AVFSYFPCP CP4
9 94910 8573 2473 549303SFLQ|SGIVnsp2
10 85811 7183 5503 836287IKLQ|SKRTnsp3
11 71911 9853 8373 92589SQFQ|SKLTnsp4
11 98612 5763 9264 122197TVLQ|ALQSansp5
12 57712 9064 1234 232110TVLQ|NNELnsp6
12 90713 3174 2334 369137VRLQ|AGTAnsp7
13 31816 1004 3705 297928TTVQ|SKDTnsp9
16 10117 9095 2985 900603AVMQ|SVGAnsp10
17 91019 4725 9016 421521TRVQ|CSTNnsp11
19 47320 5946 4226 795374TKLQ|SLENnsp12
20 59521 4916 7967 094299PRLQ|AASDnsp13
2BJ012467821179179PCP CP1
AY2784887832 699180818639TRELNG|GAVTRYPCP CP2
2 7008 4658192 7401 922FRLKGG|APIKGVPCP CP3
8 4669 9652 7413 240500ISLKGG|KIVSTCbPCP CP4
9 96610 8833 2413 546306AVLQ|SGFRnsp2
10 88411 7533 5473 836290VTFQ|GKFKnsp3
11 75412 0023 8373 91983ATVQ|SKMSnsp4
12 00312 5963 9204 117198ATLQ|AIASnsp5
12 59712 9354 1184 230113VKLQ|NNELnsp6
12 93613 3524 2314 369139VRLQ|AGNAnsp7
13 35316 1474 3705 301932PLMQ|SADAnsp9
16 14817 9505 3025 902601TVLQ|AVGAnsp10
17 95119 5315 9036 429527ATLQ|AENVnsp11
19 53220 5696 4306 775346TRLQ|SLENnsp12
20 57021 4636 7767 073298PKLQ|ASQAnsp13

The alternative cleavage site predicted by the present method is at QALQ|SEFV (Gln-3928|Ser-3929).

Compared with the annotation, this cleavage site is predicted additionally by the present method.

The predicted results by the present method for BCoVL and SARS-CoV BJ01 The alternative cleavage site predicted by the present method is at QALQ|SEFV (Gln-3928|Ser-3929). Compared with the annotation, this cleavage site is predicted additionally by the present method.

Conclusion

SARS is an extremely severe disease, which has spread to many countries around the world. Evidence shows that SARS is caused by a new coronavirus, i.e. SARS-CoV. A system, called ZCURVE_CoV 1.0, has been developed previously to recognize protein-coding genes in coronavirus genomes, especially suitable for SARS-CoV genomes [10]. Here an improved version of the system, ZCURVE_CoV 2.0, has been developed to identify all the non-structural proteins cleaved by viral proteinases in the polyproteins. Consequently, all the non-structural proteins in the 24 completely sequenced coronavirus genomes are predicted. Compared with the known annotations, including those based on experimental evidence, the performance of the present method is satisfactory.
  16 in total

1.  Characterization of a novel coronavirus associated with severe acute respiratory syndrome.

Authors:  Paul A Rota; M Steven Oberste; Stephan S Monroe; W Allan Nix; Ray Campagnoli; Joseph P Icenogle; Silvia Peñaranda; Bettina Bankamp; Kaija Maher; Min-Hsin Chen; Suxiong Tong; Azaibi Tamin; Luis Lowe; Michael Frace; Joseph L DeRisi; Qi Chen; David Wang; Dean D Erdman; Teresa C T Peret; Cara Burns; Thomas G Ksiazek; Pierre E Rollin; Anthony Sanchez; Stephanie Liffick; Brian Holloway; Josef Limor; Karen McCaustland; Melissa Olsen-Rasmussen; Ron Fouchier; Stephan Günther; Albert D M E Osterhaus; Christian Drosten; Mark A Pallansch; Larry J Anderson; William J Bellini
Journal:  Science       Date:  2003-05-01       Impact factor: 47.728

2.  Sequence logos: a new way to display consensus sequences.

Authors:  T D Schneider; R M Stephens
Journal:  Nucleic Acids Res       Date:  1990-10-25       Impact factor: 16.971

3.  A major outbreak of severe acute respiratory syndrome in Hong Kong.

Authors:  Nelson Lee; David Hui; Alan Wu; Paul Chan; Peter Cameron; Gavin M Joynt; Anil Ahuja; Man Yee Yung; C B Leung; K F To; S F Lui; C C Szeto; Sydney Chung; Joseph J Y Sung
Journal:  N Engl J Med       Date:  2003-04-07       Impact factor: 91.245

4.  A cluster of cases of severe acute respiratory syndrome in Hong Kong.

Authors:  Kenneth W Tsang; Pak L Ho; Gaik C Ooi; Wilson K Yee; Teresa Wang; Moira Chan-Yeung; Wah K Lam; Wing H Seto; Loretta Y Yam; Thomas M Cheung; Poon C Wong; Bing Lam; Mary S Ip; Jane Chan; Kwok Y Yuen; Kar N Lai
Journal:  N Engl J Med       Date:  2003-03-31       Impact factor: 91.245

5.  Identification of severe acute respiratory syndrome in Canada.

Authors:  Susan M Poutanen; Donald E Low; Bonnie Henry; Sandy Finkelstein; David Rose; Karen Green; Raymond Tellier; Ryan Draker; Dena Adachi; Melissa Ayers; Adrienne K Chan; Danuta M Skowronski; Irving Salit; Andrew E Simor; Arthur S Slutsky; Patrick W Doyle; Mel Krajden; Martin Petric; Robert C Brunham; Allison J McGeer
Journal:  N Engl J Med       Date:  2003-03-31       Impact factor: 91.245

6.  Characterization of ribosomal frameshifting for expression of pol gene products of human T-cell leukemia virus type I.

Authors:  S H Nam; T D Copeland; M Hatanaka; S Oroszlan
Journal:  J Virol       Date:  1993-01       Impact factor: 5.103

7.  ZCURVE_CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes.

Authors:  Ling-Ling Chen; Hong-Yu Ou; Ren Zhang; Chun-Ting Zhang
Journal:  Biochem Biophys Res Commun       Date:  2003-07-25       Impact factor: 3.575

8.  Mutational analysis of the "slippery-sequence" component of a coronavirus ribosomal frameshifting signal.

Authors:  I Brierley; A J Jenner; S C Inglis
Journal:  J Mol Biol       Date:  1992-09-20       Impact factor: 5.469

9.  mRNA cap-1 methyltransferase in the SARS genome.

Authors:  Marcin von Grotthuss; Lucjan S Wyrwicz; Leszek Rychlewski
Journal:  Cell       Date:  2003-06-13       Impact factor: 41.582

10.  A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJ01).

Authors:  E'de Qin; Qingyu Zhu; Man Yu; Baochang Fan; Guohui Chang; Bingyin Si; Bao'an Yang; Wenming Peng; Tao Jiang; Bohua Liu; Yongqiang Deng; Hong Liu; Yu Zhang; Cui'e Wang; Yuquan Li; Yonghua Gan; Xiaoyu Li; Fushuang Lü; Gang Tan; Wuchun Cao; Ruifu Yang; Jian Wang; Wei Li; Zuyuan Xu; Yan Li; Qingfa Wu; Wei Lin; Weijun Chen; Lin Tang; Yajun Deng; Yujun Han; Changfeng Li; Meng Lei; Guoqing Li; Wenjie Li; Hong Lü; Jianping Shi; Zongzhong Tong; Feng Zhang; Songgang Li; Bin Liu; Siqi Liu; Wei Dong; Jun Wang; Gane K-S Wong; Jun Yu; Huanming Yang
Journal:  Chin Sci Bull       Date:  2003
View more
  36 in total

1.  Prevalence and genetic diversity of coronaviruses in bats from China.

Authors:  X C Tang; J X Zhang; S Y Zhang; P Wang; X H Fan; L F Li; G Li; B Q Dong; W Liu; C L Cheung; K M Xu; W J Song; D Vijaykrishna; L L M Poon; J S M Peiris; G J D Smith; H Chen; Y Guan
Journal:  J Virol       Date:  2006-08       Impact factor: 5.103

Review 2.  The novel human coronaviruses NL63 and HKU1.

Authors:  Krzysztof Pyrc; Ben Berkhout; Lia van der Hoek
Journal:  J Virol       Date:  2006-11-01       Impact factor: 5.103

3.  Crystal structure of nonstructural protein 10 from the severe acute respiratory syndrome coronavirus reveals a novel fold with two zinc-binding motifs.

Authors:  Jeremiah S Joseph; Kumar Singh Saikatendu; Vanitha Subramanian; Benjamin W Neuman; Alexei Brooun; Mark Griffith; Kin Moy; Maneesh K Yadav; Jeffrey Velasquez; Michael J Buchmeier; Raymond C Stevens; Peter Kuhn
Journal:  J Virol       Date:  2006-08       Impact factor: 5.103

4.  Functional and genetic studies of the substrate specificity of coronavirus infectious bronchitis virus 3C-like proteinase.

Authors:  Shouguo Fang; Hongyuan Shen; Jibin Wang; Felicia P L Tay; Ding Xiang Liu
Journal:  J Virol       Date:  2010-05-05       Impact factor: 5.103

5.  Structural and Biochemical Characterization of Endoribonuclease Nsp15 Encoded by Middle East Respiratory Syndrome Coronavirus.

Authors:  Lianqi Zhang; Lei Li; Liming Yan; Zhenhua Ming; Zhihui Jia; Zhiyong Lou; Zihe Rao
Journal:  J Virol       Date:  2018-10-29       Impact factor: 5.103

6.  Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: a novel method for decisive template selection.

Authors:  Zheng Rong Yang
Journal:  Bioinformatics       Date:  2005-03-29       Impact factor: 6.937

7.  Recent developments in anti-severe acute respiratory syndrome coronavirus chemotherapy.

Authors:  Dale L Barnard; Yohichi Kumaki
Journal:  Future Virol       Date:  2011-05       Impact factor: 1.831

8.  Crystal structure of a monomeric form of severe acute respiratory syndrome coronavirus endonuclease nsp15 suggests a role for hexamerization as an allosteric switch.

Authors:  Jeremiah S Joseph; Kumar Singh Saikatendu; Vanitha Subramanian; Benjamin W Neuman; Michael J Buchmeier; Raymond C Stevens; Peter Kuhn
Journal:  J Virol       Date:  2007-04-04       Impact factor: 5.103

9.  Cinanserin is an inhibitor of the 3C-like proteinase of severe acute respiratory syndrome coronavirus and strongly reduces virus replication in vitro.

Authors:  Lili Chen; Chunshan Gui; Xiaomin Luo; Qingang Yang; Stephan Günther; Elke Scandella; Christian Drosten; Donglu Bai; Xichang He; Burkhard Ludewig; Jing Chen; Haibin Luo; Yiming Yang; Yifu Yang; Jianping Zou; Volker Thiel; Kaixian Chen; Jianhua Shen; Xu Shen; Hualiang Jiang
Journal:  J Virol       Date:  2005-06       Impact factor: 5.103

10.  Coronavirus 3CLpro proteinase cleavage sites: possible relevance to SARS virus pathology.

Authors:  Lars Kiemer; Ole Lund; Søren Brunak; Nikolaj Blom
Journal:  BMC Bioinformatics       Date:  2004-06-06       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.