Literature DB >> 12859968

ZCURVE_CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes.

Ling-Ling Chen¹, Hong-Yu Ou, Ren Zhang, Chun-Ting Zhang.

Abstract

A new system to recognize protein coding genes in the coronavirus genomes, specially suitable for the SARS-CoV genomes, has been proposed in this paper. Compared with some existing systems, the new program package has the merits of simplicity, high accuracy, reliability, and quickness. The system ZCURVE_CoV has been run for each of the 11 newly sequenced SARS-CoV genomes. Consequently, six genomes not annotated previously have been annotated, and some problems of previous annotations in the remaining five genomes have been pointed out and discussed. In addition to the polyprotein chain ORFs 1a and 1b and the four genes coding for the major structural proteins, spike (S), small envelop (E), membrane (M), and nuleocaspid (N), respectively, ZCURVE_CoV also predicts 5-6 putative proteins in length between 39 and 274 amino acids with unknown functions. Some single nucleotide mutations within these putative coding sequences have been detected and their biological implications are discussed. A web service is provided, by which a user can obtain the annotated result immediately by pasting the SARS-CoV genome sequences into the input window on the web site (http://tubic.tju.edu.cn/sars/). The software ZCURVE_CoV can also be downloaded freely from the web address mentioned above and run in computers under the platforms of Windows or Linux.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Viral Proteins

Year: 2003 PMID： 12859968 PMCID： PMC7134609 DOI： 10.1016/s0006-291x(03)01192-6

Source DB: PubMed Journal: Biochem Biophys Res Commun ISSN： 0006-291X Impact factor: 3.575

An outbreak of a life-threatening disease, referred to as severe acute respiratory syndrome (SARS), has spread to many countries around the world [1], [2], [3], [4], [5], [6]. By late May 2003, the World Health Organization (WHO) has recorded more than 7000 cases of SARS and more than 600 SARS-related deaths, and therefore a global alert for the illness was issued due to the severity of the disease (http://www.who.int/csr/sars/en/). A growing body of evidence has convincingly shown that SARS is caused by a novel coronavirus, called SARS-coronavirus or SARS-CoV. Currently, the complete genome sequences of 11 strains of SARS-CoV isolated from some SARS patients have been sequenced [7], [8], [9], and more complete genome sequences of SARS-CoV are expected to come. The SARS-CoV genomes are about 30 kb in length. For such short genome sequences, currently, there is no reliable software for the identification of protein-coding genes. Therefore, most sequenced genomes were annotated manually or not annotated. Among the 11 completed sequences, six were not annotated yet and the remaining were annotated manually. Currently, most algorithms for gene identification in prokaryotic genomes, such as GeneMark.hmm [10] and Glimmer [11], are based either on the higher-order Markov chain model or the hidden Markov chain model in which thousands of parameters need to be trained. The large number of parameters may result in less adaptability, especially for small genomes. Meanwhile, ZCURVE [12] is a newly developed system for gene recognition in bacterial and archaeal genomes, in which only 33 parameters are used and the recognition accuracy is high. Therefore, the ZCURVE algorithm essentializes the coding properties of protein-coding genes with relatively small number of parameters. Thus, it is not only suitable for large but also especially suitable for small genomes. In this paper, we describe a system, called ZCURVE_CoV, based on a coronavirus-specific ZCURVE algorithm, which is especially suitable for gene recognition in SARS-CoV genomes. The system has the advantages of simplicity, reliability, high accuracy, and quickness. The software system ZCURVE_CoV is freely available at http://tubic.tju.edu.cn/sars/.

Materials and methods

Six genome sequences of coronaviruses and the annotation information were downloaded from the web site of NCBI RefSeq project (http://www.ncbi.nih.gov/RefSeq/). These coronaviruses include avian infectious bronchitis virus (NC_001451), bovine coronavirus (NC_003045), human coronavirus 229E (NC_002645), murine hepatitis virus (NC_001846), porcine epidemic diarrhea virus (NC_003436), and transmissible gastroenteritis virus (NC_002306). A total of 48 genes were extracted from the above six genomes and used to train the gene-finding algorithm. Currently, 15 genome sequences of SARS coronavirus (SARS-CoV) strains are available in the GenBank database, of which there are 11 complete and four partial genomes, respectively. The former includes SARS-CoV TOR2 (Accession No. AY274119), Urbani (AY278741), HKU-39849 (AY278491), CUHK-W1 (AY278554), BJ01 (AY278488), CUHK-Su10 (AY282752), SIN2500 (AY283794), SIN2748 (AY283797), SIN2679 (AY283796), SIN2774 (AY283798), and SIN2677 (AY283795), whereas the latter includes SARS-CoV BJ02 (AY278487), BJ03 (AY278490), BJ04 (AY279354), and GZ01 (AY278489), respectively. The gene-finding algorithm presented in this paper is based on the Z curve [13], which is a graphic representation of DNA sequences. The Z curve method has been used to recognize protein coding genes in the budding yeast genome [14]. A new ab initio gene-finding system for bacterial and archaeal genomes has been developed recently, based on the Z curve method [12]. Here the method with some modifications is used to recognize protein coding genes in coronavirus genomes, which is presented briefly as follows. Suppose that the occurrence frequencies of the bases A, C, G, and T (U) at the first, second, and third codon positions in an ORF are denoted by ai, ci, gi, and t , respectively, where i=1,2,3. The four numbers, ai, ci, gi, and t , are mapped onto a point in a 3-dimensional space V with the coordinatesThen, each ORF may be represented by a point or a vector in a 9-dimensional space V, where V=V1⊕V2⊕V3, where the symbol ⊕ denotes the direct-sum of two subspaces. The nine components u 1–u 9 of the space V are defined as follows: To train the system, two sets of samples are needed, which are positive samples corresponding to protein-coding genes (seed ORFs) and negative samples corresponding to non-coding sequences. In the Z curve method, essentially, the gene recognition is based on the compositional asymmetry of three codon positions in coding sequences. It was shown that the overall extent of codon usage bias in RNA viruses is low and there is little variation in bias between genes [15]. Coronaviruses belong to the coronaviridae and the G + C content of the published coronavirus genomes ranges from 37% to 42% [7]. Therefore, it is reasonable to deduce that the published coronavirus genomes have similar codon usage. Based on this consideration, it is possible that gene-finding parameters derived from some published coronavirus genomes may be applied to recognize genes in other coronavirus genomes. Because the SARS-CoV genomes are relatively small (≈30 kb), it is difficult to obtain enough seed ORFs from its own genome. Therefore, we used some other published coronavirus genomes to train gene-finding parameters. Consequently, the genomes of avian infectious bronchitis virus, bovine coronavirus, human coronavirus 229E, murine hepatitis virus, porcine epidemic diarrhea virus, and transmissible gastroenteritis virus, respectively, were used, in which 48 seed ORFs were selected. The detailed information about the 48 seed ORFs is described in Table 1 of the supplementary materials (see: http://tubic.tju.edu.cn/sars/). Below we describe the strategy to produce the negative samples. It is a rather difficult problem to produce an appropriate set of non-coding sequences in coronavirus genomes, because the amount of non-coding DNA sequences in these genomes is too few to be used. A method to produce negative samples has been developed previously and it has been shown to be an effective way to solve the problem [12]. The same method is still used in the current study. In this method, a negative sample is just derived from a seed ORF. Generally speaking, if the regular structure of a coding sequence is completely destroyed, it is transformed into a non-coding one. Therefore, the negative sample may be simply obtained by shuffling the corresponding coding sequence sufficiently (20,000 times in current study). The resulting random sequences from all 48 seed ORFs were used as non-coding sequences. The major difference is that the former has some regular structures, whereas the latter is a random sequence. In fact, a random sequence is not a non-coding sequence, but it is a good approximation. As shown below, this approximation generally results in good gene-finding results. The Fisher linear equation for discriminating the positive and negative samples in the 9-dimensional space V represents a super-plane, described by a vector c which has nine components c 1,c 2,…, and c 9. For more details about Fisher discrminant algorithm, refer to, for example [14]. Based on the data in the training set (including the positive and negative samples), the vector c and the threshold c 0 are obtained. The decision of coding/non-coding for each ORF and negative sample is simply made by the criterion of c·u>c0/c·uc0/c·u0/Z(u)<0, where Z(u)=c·u−c0. Z(u) is called the Z score or Z index for an ORF or a fragment of DNA sequence. Finally, the strategy to deal with overlapping ORFs used here is similar to that described in the previous paper [12].

Results and discussions

Comparison with the existing system—GeneMark.hmm

No coronavirus-specific annotation systems have been available so far. Currently, GeneMark.hmm is commonly used for gene-finding in virus genomes [10]. We submitted the SARS-CoV TOR2 genome to GeneMark.hmm website using default settings and the prediction result is listed in Table 1 . It can be seen that the predicted ‘gene 1’ is questionable, because of its short length and the lack of a start codon. An important structural protein gene (small envelope protein E), which is located from 26117 to 26347, was not predicted by GeneMark.hmm. Moreover, we submitted the same genome sequence several times to the website, however, the prediction results were not identical at all times, indicating that the system is unstable. An important structural protein gene (N protein), which is located from 28120 to 29388, was predicted as ‘gene 10’ and ‘gene 11’ (marked with * in Table 1) in some predicted results. Sometimes, ‘gene 9’ (marked with * in Table 1), a quite conserved ORF in all of the 11 SARS-CoV genomes mentioned above, was not predicted. Compared with GeneMark.hmm for gene-finding in the SARS-CoV genomes, the performance of ZCURVE_CoV is better (see Table 3 in the supplementary materials).

Table 1

The genes predicted by GeneMark.hmm for the SARS-CoV, TOR2 strain

Gene	Start	Stop	Gene length (bp)

1	<3	53	51
2	265	13,413	13,149
3	13,599	21,485	7887
4	21,492	25,259	3768
5	25,268	26,092	825
6	26,398	27,063	666
7	27,074	27,265	192
8	27,273	27,641	369
9*	27,864	28,118	255
10*	28,130	28,426	297
11*	28,423	29,388	966

The same genome was submitted several times to the website, however, the prediction results were not identical at all times, indicating that the system is unstable. An important structural protein gene (N protein), which is located from 28120 to 29388, was predicted as ‘gene 10’ and ‘gene 11’ in some predicted results. Sometimes, ‘gene 9,’ a quite conserved ORF in all of the 11 SARS-CoV genomes mentioned above, was not predicted. In addition, the gene coding for a structural protein (small envelope protein E) was also missed by the prediction. For more details, see the text.

The genes predicted by GeneMark.hmm for the SARS-CoV, TOR2 strain The same genome was submitted several times to the website, however, the prediction results were not identical at all times, indicating that the system is unstable. An important structural protein gene (N protein), which is located from 28120 to 29388, was predicted as ‘gene 10’ and ‘gene 11’ in some predicted results. Sometimes, ‘gene 9,’ a quite conserved ORF in all of the 11 SARS-CoV genomes mentioned above, was not predicted. In addition, the gene coding for a structural protein (small envelope protein E) was also missed by the prediction. For more details, see the text.

Apply ZCURVE_CoV to analyze the SARS-CoV genomes

Currently, the genome sequences of 15 SARS-CoV strains are available in GenBank/EMBL databases, of which there are 11 complete and four partially complete genomes. The gene-finding software ZCURVE_CoV Version 1.0 has been run for each of the 11 complete SARS-CoV genomes. To save space, the detailed results are listed in Table 3 of the supplementary materials (see also the discussion below). In addition to the polyprotein chain ORFs 1a and 1b, the program predicts four structural genes coding for the four major structural proteins, i.e., spike (S), small envelop (E), membrane (M), and nuleocaspid (N), respectively, in all the 11 SARS-CoV genomes. Additionally, ZCURVE_CoV 1.0 also predicts 5–6 putative proteins with lengths between 39 and 274 amino acids for the 11 genomes. These putative genes might code for non-structural proteins in the SARS-CoV genomes. To compare the gene-finding result of the system ZCURVE_CoV 1.0 with that of known annotation, the SARS-CoV TOR2 strain is used as an example. The genome of TOR2 strain was annotated manually [8] and the annotated result is listed on the left part of Table 2 , whereas the annotated result of ZCURVE_CoV 1.0 is listed on the right part of Table 2. As we can see both annotations are in good agreement with each other, except three ORFs. The three ORFs, i.e., ORF4, ORF13, and ORF14 annotated by Marra et al. [8] are not predicted by ZCURVE_CoV 1.0. These ORFs are completely embedded, with a frameshift, within the genes coding for some structural proteins. The absence of the transcription regulating sequences (TRSs) at the 5′ end of these ORFs [8] suggests that they are unlikely to be the protein-coding genes. The principal component analysis performed below further confirms the above conjecture. As mentioned in the Materials and methods section, each ORF is represented by a point in a 9-dimensional (9-D) space. Consequently, the positive samples (genes) and negative samples (non-coding sequences) are represented by two groups of points in the 9-D space, respectively. For the TOR2 strain, the 12 putative genes predicted by ZCURVE_CoV and ORF 4, ORF 13, and ORF 14 are represented by the corresponding points in the 9-D space, respectively. We project the points in the 9-D space onto the 3-D space spanned by the first, second, and third principal axes based on the principal component analysis. The fraction of the first three principal components accounts for about 70% of the total inertia of the 9-D space. Fig. 1 shows the distribution of the corresponding points in the 3-D space, where green and orange balls represent the positive samples (genes) and negative samples (non-coding sequences), respectively. Blue balls correspond to the genes predicted by ZCURVE_CoV for the TOR2 strain, while red balls correspond to ORF 4, ORF 13, and ORF 14 annotated by Marra et al. [8]. It is clear that the three red balls are located at the side of non-coding sequences, indicating that ORF 4, ORF 13, and ORF 14 are very unlikely to code for proteins.

Table 2

Comparison of the genes annotated and those predicted by ZCURVE_CoV 1.0, for the SARS-CoV, TOR2 strain

Genes annotated					Genes predicted by ZCURVE_CoV 1.0
Start	Stop	bp	a.a.	Feature	Start	Stop	bp	a.a.	Feature

265	13,398	13,134	4377	ORF 1a	265	13,398a	13,134	4377	ORF 1a
13,398	21,485	8088	2695	ORF 1b	13,398a	21,485	8088	2695	ORF 1b
21,492	25,259	3768	1255	S protein	21,492	25,259	3768	1255	S protein
25,268	26,092	825	274	ORF 3	25,268	26,092	825	274	Sars274
25,689	26,153	465	154	ORF 4
26,117	26,347	231	76	E protein	26,117	26,347	231	76	E protein
26,398	27,063	666	221	M protein	26,398	27,063	666	221	M protein
27,074	27,265	192	63	ORF 7	27,074	27,265	192	63	Sars63
27,273	27,641	369	122	ORF 8	27,273	27,641	369	122	Sars122
27,638	27,772	135	44	ORF 9	27,638	27,772	135	44	Sars44
27,779	27,898	120	39	ORF 10	27,779	27,898	120	39	Sars39
27,864	28,118	255	84	ORF 11	27,864	28118	255	84	Sars84
28,120	29,388	1269	422	N protein	28,120	29,388	1269	422	N protein
28,130	28,426	297	98	ORF 13
28,583	28,795	213	70	ORF 14

The program ZCURVE_CoV 1.0 has two options. The default option is to use the heptamer UUUAAAC as the conservative ‘slippery sequence’ to find the coronavirus −1 frameshift site [16]. Once the heptamer is found in the upstream sequence near the ending site of ORF 1a originally predicted, the ending site of ORF 1a and starting site of ORF 1b are both corrected to the frameshift site (13398 in this genome) according to this ‘slippery sequence.’ Otherwise, if this heptamer cannot be found, only the original sites predicted for ORF 1a and ORF 1b are displayed in the output file. The second option is to ignore the −1 frameshift, and the original sites predicted for ORF 1a and ORF 1b are always displayed, regardless of the existence of the heptamer UUUAAAC.

Fig. 1

Distribution of the mapping points corresponding to genes, non-genes, predicted genes, and questionable ORFs for the SARS-CoV, TOR2 strain in a 3-dimensional (3-D) space. Each gene or ORF is mapped onto a point in a 9-D space. To visualize the distribution, the mapping points are projected onto the 3-D space spanned by the first three principal axes based on the principal component analysis. The first, second, and third principal vectors are denoted by the X-, Y-, and Z-axes, respectively. The fraction of the first three principal components accounts for 69.59% of the total inertia of the 9-D space. Green and orange balls represent the positive samples (genes) and negative samples (non-coding sequences), respectively. Blue balls correspond to the genes predicted by ZCURVE_CoV for the TOR2 strain, while red balls correspond to ORF 4, ORF 13, and ORF 14 annotated by Marra et al. [8]. It is clear that the three red balls are situated at the side of non-coding sequences, indicating that ORF 4, ORF 13, and ORF 14 are very unlikely to code for proteins.

Comparison of the genes annotated and those predicted by ZCURVE_CoV 1.0, for the SARS-CoV, TOR2 strain The program ZCURVE_CoV 1.0 has two options. The default option is to use the heptamer UUUAAAC as the conservative ‘slippery sequence’ to find the coronavirus −1 frameshift site [16]. Once the heptamer is found in the upstream sequence near the ending site of ORF 1a originally predicted, the ending site of ORF 1a and starting site of ORF 1b are both corrected to the frameshift site (13398 in this genome) according to this ‘slippery sequence.’ Otherwise, if this heptamer cannot be found, only the original sites predicted for ORF 1a and ORF 1b are displayed in the output file. The second option is to ignore the −1 frameshift, and the original sites predicted for ORF 1a and ORF 1b are always displayed, regardless of the existence of the heptamer UUUAAAC. Distribution of the mapping points corresponding to genes, non-genes, predicted genes, and questionable ORFs for the SARS-CoV, TOR2 strain in a 3-dimensional (3-D) space. Each gene or ORF is mapped onto a point in a 9-D space. To visualize the distribution, the mapping points are projected onto the 3-D space spanned by the first three principal axes based on the principal component analysis. The first, second, and third principal vectors are denoted by the X-, Y-, and Z-axes, respectively. The fraction of the first three principal components accounts for 69.59% of the total inertia of the 9-D space. Green and orange balls represent the positive samples (genes) and negative samples (non-coding sequences), respectively. Blue balls correspond to the genes predicted by ZCURVE_CoV for the TOR2 strain, while red balls correspond to ORF 4, ORF 13, and ORF 14 annotated by Marra et al. [8]. It is clear that the three red balls are situated at the side of non-coding sequences, indicating that ORF 4, ORF 13, and ORF 14 are very unlikely to code for proteins. Similar analysis was performed to the Urbani strain [7]. The result is listed in Table 3 , in which the putative gene X2 annotated by Rota et al. [7], corresponding to ORF 4 in Marra et al. [8], is not predicted by ZCURVE_CoV. Based on the above analysis, X2 is also very unlikely to code for a protein. Of the 11 complete SARS-CoV genomes, six have not yet been annotated. We have run the program ZCURVE_CoV for each of the 11 genomes. Consequently, those already annotated have been re-annotated and those not annotated yet have been annotated. All of the annotated results are listed in Table 3 of the supplementary materials.

Table 3

Comparison of the genes annotated and those predicted by ZCURVE_CoV1.0, for the SARS-CoV, Urbani strain

Genes annotated					Genes predicted by ZCURVE_CoV 1.0
Start	Stop	bp	a.a.	Feature	Start	Stop	bp	a.a.	Feature

265	13,398	13,134	4377	ORF 1a	265	13,398a	13,134	4377	ORF 1a
13,398	21,485	8088	2695	ORF 1b	13,398a	21,485	8088	2695	ORF 1b
21,492	25,259	3768	1255	S protein	21,492	25,259	3768	1255	S protein
25,268	26,092	825	274	X1	25,268	26,092	825	274	Sars274
25,689	26,153	465	154	X2
26,117	26,347	231	76	E protein	26,117	26,347	231	76	E protein
26,398	27,063	666	221	M protein	26,398	27,063	666	221	M protein
27,074	27,265	192	63	X3	27,074	27,265	192	63	Sars63
27,273	27,641	369	122	X4	27,273	27,641	369	122	Sars122
					27,638	27,772	135	44	Sars44
					27,779	27,898	120	39	Sars39
27,864	28,118	255	84	X5	27,864	28,118	255	84	Sars84
28,120	29,388	1269	422	N protein	28,120	29,388	1269	422	N protein

See the footnote in Table 2.

Comparison of the genes annotated and those predicted by ZCURVE_CoV1.0, for the SARS-CoV, Urbani strain See the footnote in Table 2.

Analyze the mutations of the six putative non-structural genes by sequence alignment

To test the nucleotide mutations of the predicted genes coding for non-structural proteins, we aligned the coding sequences of Sars274, Sars63, Sars122, Sars44, Sars39, and Sars84, respectively, for the 11 complete SARS-CoV genomes using ClustalW 1.8 [17]. The results of multiple sequence alignment for the above six predicted genes coding for non-structural proteins are listed in Fig. 1 of the supplementary materials. For the three ORFs, Sars122, Sars44, and Sars84, the nucleotide sequences are all conserved in the 11 SARS-CoV genomes, indicating that the three ORFs might have crucial biological functions. Mutations in these gene sequences would result in loss of important functions. Therefore, these coding sequences might serve as the candidate targets for designing drugs against SARS. On the contrary, Sars39 is not found in the strains SIN2677 and SIN2748, and a nucleotide mutation occurs at nucleotide position 49, leading to the mutation of Cys → Arg in the strains BJ01 and CUHK-W1. The rapid mutations occurring in Sars39 imply that it is probably not a key protein for SARS-CoV. For Sars63, two nucleotide mutations are observed at the base positions 38 and 170, leading to amino acid mutations of Glu → Gly and Pro → Leu in the strains SIN2677 and BJ01, respectively. See Fig. 1 in the supplementary materials for the detail. The result of ClustalW alignment for Sars274 is shown in Fig. 2 . Four nucleotide mutations, located at 31, 302, 406, and 783, respectively, at three different strains have been detected. The first three variations cause amino acid mutations (Fig. 2). The last substitution is a synonymous codon mutation which does not lead to amino acid change. The point mutations occurring at nucleotide positions 31, 302, and 406, respectively, cause amino acid changes. At the 31st position, G → A (TOR2) ⇒ Gly → Arg. Similarly, at the 302nd position, T (U) → A (HKU-39849) ⇒ Met → Lys; and at the 406th position, A → C (BJ01) ⇒ Lys → Gln. On the other hand, it was reported by Marra et al. [8] that there exist three trans-membrane regions spanning approximately at nucleotide positions 102 → 168 (residues 34 → 56), 231 → 297 (77 → 99), and 309 → 375 (103 → 115), respectively, in Sars274 sequence. Therefore, the mutations occur outside of the predicted trans-membrane regions. Note that the second mutation of amino acid (Met → Lys) is essential, as reflected by the fact that Met is a relatively strong hydrophilic amino acid, whereas Lys is a strong hydrophobic one. At present, we cannot know whether these mutations cause severe conformational changes in the tertiary structure of this putative protein. The high mutation rate of Sars274 implies that either it might be a relatively unimportant protein for SARS-CoV, or the mutations do not lead to biological function changes dramatically. Finally, for the time being we still cannot rule out the possibility that all or a part of these mutations are caused by sequencing errors.

Fig. 2

Nucleotide mutations of the predicted gene Sars274 based on the alignment of corresponding coding sequences in 11 complete genome sequences. A total of four point mutations are detected, of which one is a silent mutation and the other three cause amino acid changes in the putative genes. The point mutations occur at nucleotide positions 31, 302, 406, and 783, respectively. At the 31st position, G → A (TOR2) ⇒ Gly → Arg. Similarly, at the 302nd position, T (U) → A (HKU-39849) ⇒ Met → Lys; at the 406th position, A → C (BJ01) ⇒ Lys→ Gln; and at the 783rd position, A → C (BJ01), but no amino acid change.

Supplementary materials

The detailed supplementary materials related to this study are available from the website http://tubic.tju.edu.cn/sars/, which includes the following content: (a) Table 1. The 48 seed ORFs and the six coronavirus genomes from which the seed ORFs are derived. (b) Table 2. The Fisher coefficients and threshold obtained from the seed ORFs. (c) Table 3. Results of gene-finding using ZCURVE_CoV for the 11 SARS-CoV complete genomes. (d) Fig. 1. The results of multiple sequence alignment of the six predicted genes coding for non-structural proteins, Sars274, Sars63, Sars122, Sars44, Sars39, and Sars84, respectively.

Online service and availability of the program ZCURVE_CoV

A web interface of the ZCURVE_CoV system has been constructed. When a user pastes a SARS-CoV genome sequence to the input window of the website, the gene-finding result will be returned to the user immediately. A user may also download the executable version of the program ZCURVE_CoV and run it on the computers under the platforms of either Windows (95/98/NT/Me/2000 or higher), or Linux (Redhat 7.1 or higher), or SGI IRIX 6.5. For more detailed information, visit: http://tubic.tju.edu.cn/sars/.

Conclusion

Severe acute respiratory syndrome (SARS) is an extremely severe disease that has spread to many countries around the world. Accumulating evidence has shown that SARS is caused by a new coronavirus, i.e., SARS-CoV. A new system to recognize protein-coding genes in SARS-CoV genomes, called ZCURVE_CoV, has been reported in this paper. By applying the program to 11 complete SARS-CoV genomes, six genomes not annotated previously have been annotated, and some problems of previous annotations in the remaining five genomes have been pointed out and discussed. It is shown that the three protein-coding ORFs annotated by Marra et al. [8], i.e., ORF 4, ORF 13, and ORF 14, are very unlikely to code for proteins. In addition to ORF1a, ORF1b, and the four genes coding for the major structural proteins S, E, M, and N, the new system ZCURVE_CoV also predicts 5–6 putative genes coding for non-structural proteins. Aligning each of the non-structural gene sequences based on the 11 complete genomes, some mutations have been detected. The biological implications of the mutations have been discussed.

17 in total

1. Heuristic approach to deriving models for gene finding.

Authors: J Besemer; M Borodovsky
Journal: Nucleic Acids Res Date: 1999-10-01 Impact factor: 16.971

2. Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve.

Authors: C T Zhang; J Wang
Journal: Nucleic Acids Res Date: 2000-07-15 Impact factor: 16.971

3. ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes.

Authors: Feng-Biao Guo; Hong-Yu Ou; Chun-Ting Zhang
Journal: Nucleic Acids Res Date: 2003-03-15 Impact factor: 16.971

4. Analysis of distribution of bases in the coding sequences by a diagrammatic technique.

Authors: C T Zhang; R Zhang
Journal: Nucleic Acids Res Date: 1991-11-25 Impact factor: 16.971

5. A major outbreak of severe acute respiratory syndrome in Hong Kong.

Authors: Nelson Lee; David Hui; Alan Wu; Paul Chan; Peter Cameron; Gavin M Joynt; Anil Ahuja; Man Yee Yung; C B Leung; K F To; S F Lui; C C Szeto; Sydney Chung; Joseph J Y Sung
Journal: N Engl J Med Date: 2003-04-07 Impact factor: 91.245

6. A cluster of cases of severe acute respiratory syndrome in Hong Kong.

Authors: Kenneth W Tsang; Pak L Ho; Gaik C Ooi; Wilson K Yee; Teresa Wang; Moira Chan-Yeung; Wah K Lam; Wing H Seto; Loretta Y Yam; Thomas M Cheung; Poon C Wong; Bing Lam; Mary S Ip; Jane Chan; Kwok Y Yuen; Kar N Lai
Journal: N Engl J Med Date: 2003-03-31 Impact factor: 91.245

7. Identification of severe acute respiratory syndrome in Canada.

Authors: Susan M Poutanen; Donald E Low; Bonnie Henry; Sandy Finkelstein; David Rose; Karen Green; Raymond Tellier; Ryan Draker; Dena Adachi; Melissa Ayers; Adrienne K Chan; Danuta M Skowronski; Irving Salit; Andrew E Simor; Arthur S Slutsky; Patrick W Doyle; Mel Krajden; Martin Petric; Robert C Brunham; Allison J McGeer
Journal: N Engl J Med Date: 2003-03-31 Impact factor: 91.245

8. Microbial gene identification using interpolated Markov models.

Authors: S L Salzberg; A L Delcher; S Kasif; O White
Journal: Nucleic Acids Res Date: 1998-01-15 Impact factor: 16.971

9. A novel coronavirus associated with severe acute respiratory syndrome.

Authors: Thomas G Ksiazek; Dean Erdman; Cynthia S Goldsmith; Sherif R Zaki; Teresa Peret; Shannon Emery; Suxiang Tong; Carlo Urbani; James A Comer; Wilina Lim; Pierre E Rollin; Scott F Dowell; Ai-Ee Ling; Charles D Humphrey; Wun-Ju Shieh; Jeannette Guarner; Christopher D Paddock; Paul Rota; Barry Fields; Joseph DeRisi; Jyh-Yuan Yang; Nancy Cox; James M Hughes; James W LeDuc; William J Bellini; Larry J Anderson
Journal: N Engl J Med Date: 2003-04-10 Impact factor: 91.245

10. A complete sequence and comparative analysis of a SARS-associated virus (Isolate BJ01).

Authors: E'de Qin; Qingyu Zhu; Man Yu; Baochang Fan; Guohui Chang; Bingyin Si; Bao'an Yang; Wenming Peng; Tao Jiang; Bohua Liu; Yongqiang Deng; Hong Liu; Yu Zhang; Cui'e Wang; Yuquan Li; Yonghua Gan; Xiaoyu Li; Fushuang Lü; Gang Tan; Wuchun Cao; Ruifu Yang; Jian Wang; Wei Li; Zuyuan Xu; Yan Li; Qingfa Wu; Wei Lin; Weijun Chen; Lin Tang; Yajun Deng; Yujun Han; Changfeng Li; Meng Lei; Guoqing Li; Wenjie Li; Hong Lü; Jianping Shi; Zongzhong Tong; Feng Zhang; Songgang Li; Bin Liu; Siqi Liu; Wei Dong; Jun Wang; Gane K-S Wong; Jun Yu; Huanming Yang
Journal: Chin Sci Bull Date: 2003

16 in total

1. VIGOR, an annotation program for small viral genomes.

Authors: Shiliang Wang; Jaideep P Sundaram; David Spiro
Journal: BMC Bioinformatics Date: 2010-09-07 Impact factor: 3.169

2. Profiles of antibody responses against severe acute respiratory syndrome coronavirus recombinant proteins and their potential use as diagnostic markers.

Authors: Yee-Joo Tan; Phuay-Yee Goh; Burtram C Fielding; Shuo Shen; Chih-Fong Chou; Jian-Lin Fu; Hoe Nam Leong; Yee Sin Leo; Eng Eong Ooi; Ai Ee Ling; Seng Gee Lim; Wanjin Hong
Journal: Clin Diagn Lab Immunol Date: 2004-03

3. A rebuttal to the comments on the genome order index and the Z-curve.

Authors: Ren Zhang
Journal: Biol Direct Date: 2011-02-16 Impact factor: 4.540

4. ZCURVE_V: a new self-training system for recognizing protein-coding genes in viral and phage genomes.

Authors: Feng-Biao Guo; Chun-Ting Zhang
Journal: BMC Bioinformatics Date: 2006-01-10 Impact factor: 3.169

5. Genome organization of the SARS-CoV.

Authors: Jing Xu; Jianfei Hu; Jing Wang; Yujun Han; Yongwu Hu; Jie Wen; Yan Li; Jia Ji; Jia Ye; Zizhang Zhang; Wei Wei; Songgang Li; Jun Wang; Jian Wang; Jun Yu; Huanming Yang
Journal: Genomics Proteomics Bioinformatics Date: 2003-08 Impact factor: 7.691