| Literature DB >> 12859968 |
Ling-Ling Chen1, Hong-Yu Ou, Ren Zhang, Chun-Ting Zhang.
Abstract
A new system to recognize protein coding genes in the coronavirus genomes, specially suitable for the SARS-CoV genomes, has been proposed in this paper. Compared with some existing systems, the new program package has the merits of simplicity, high accuracy, reliability, and quickness. The system ZCURVE_CoV has been run for each of the 11 newly sequenced SARS-CoV genomes. Consequently, six genomes not annotated previously have been annotated, and some problems of previous annotations in the remaining five genomes have been pointed out and discussed. In addition to the polyprotein chain ORFs 1a and 1b and the four genes coding for the major structural proteins, spike (S), small envelop (E), membrane (M), and nuleocaspid (N), respectively, ZCURVE_CoV also predicts 5-6 putative proteins in length between 39 and 274 amino acids with unknown functions. Some single nucleotide mutations within these putative coding sequences have been detected and their biological implications are discussed. A web service is provided, by which a user can obtain the annotated result immediately by pasting the SARS-CoV genome sequences into the input window on the web site (http://tubic.tju.edu.cn/sars/). The software ZCURVE_CoV can also be downloaded freely from the web address mentioned above and run in computers under the platforms of Windows or Linux.Entities:
Mesh:
Substances:
Year: 2003 PMID: 12859968 PMCID: PMC7134609 DOI: 10.1016/s0006-291x(03)01192-6
Source DB: PubMed Journal: Biochem Biophys Res Commun ISSN: 0006-291X Impact factor: 3.575
The genes predicted by GeneMark.hmm for the SARS-CoV, TOR2 strain
| Gene | Start | Stop | Gene length (bp) |
| 1 | <3 | 53 | 51 |
| 2 | 265 | 13,413 | 13,149 |
| 3 | 13,599 | 21,485 | 7887 |
| 4 | 21,492 | 25,259 | 3768 |
| 5 | 25,268 | 26,092 | 825 |
| 6 | 26,398 | 27,063 | 666 |
| 7 | 27,074 | 27,265 | 192 |
| 8 | 27,273 | 27,641 | 369 |
| 9 | 27,864 | 28,118 | 255 |
| 10 | 28,130 | 28,426 | 297 |
| 11 | 28,423 | 29,388 | 966 |
The same genome was submitted several times to the website, however, the prediction results were not identical at all times, indicating that the system is unstable. An important structural protein gene (N protein), which is located from 28120 to 29388, was predicted as ‘gene 10’ and ‘gene 11’ in some predicted results. Sometimes, ‘gene 9,’ a quite conserved ORF in all of the 11 SARS-CoV genomes mentioned above, was not predicted. In addition, the gene coding for a structural protein (small envelope protein E) was also missed by the prediction. For more details, see the text.
Comparison of the genes annotated and those predicted by ZCURVE_CoV 1.0, for the SARS-CoV, TOR2 strain
| Genes annotated | Genes predicted by ZCURVE_CoV 1.0 | ||||||||
| Start | Stop | bp | a.a. | Feature | Start | Stop | bp | a.a. | Feature |
| 265 | 13,398 | 13,134 | 4377 | ORF 1a | 265 | 13,398 | 13,134 | 4377 | ORF 1a |
| 13,398 | 21,485 | 8088 | 2695 | ORF 1b | 13,398 | 21,485 | 8088 | 2695 | ORF 1b |
| 21,492 | 25,259 | 3768 | 1255 | S protein | 21,492 | 25,259 | 3768 | 1255 | S protein |
| 25,268 | 26,092 | 825 | 274 | ORF 3 | 25,268 | 26,092 | 825 | 274 | Sars274 |
| 25,689 | 26,153 | 465 | 154 | ORF 4 | |||||
| 26,117 | 26,347 | 231 | 76 | E protein | 26,117 | 26,347 | 231 | 76 | E protein |
| 26,398 | 27,063 | 666 | 221 | M protein | 26,398 | 27,063 | 666 | 221 | M protein |
| 27,074 | 27,265 | 192 | 63 | ORF 7 | 27,074 | 27,265 | 192 | 63 | Sars63 |
| 27,273 | 27,641 | 369 | 122 | ORF 8 | 27,273 | 27,641 | 369 | 122 | Sars122 |
| 27,638 | 27,772 | 135 | 44 | ORF 9 | 27,638 | 27,772 | 135 | 44 | Sars44 |
| 27,779 | 27,898 | 120 | 39 | ORF 10 | 27,779 | 27,898 | 120 | 39 | Sars39 |
| 27,864 | 28,118 | 255 | 84 | ORF 11 | 27,864 | 28118 | 255 | 84 | Sars84 |
| 28,120 | 29,388 | 1269 | 422 | N protein | 28,120 | 29,388 | 1269 | 422 | N protein |
| 28,130 | 28,426 | 297 | 98 | ORF 13 | |||||
| 28,583 | 28,795 | 213 | 70 | ORF 14 | |||||
The program ZCURVE_CoV 1.0 has two options. The default option is to use the heptamer UUUAAAC as the conservative ‘slippery sequence’ to find the coronavirus −1 frameshift site [16]. Once the heptamer is found in the upstream sequence near the ending site of ORF 1a originally predicted, the ending site of ORF 1a and starting site of ORF 1b are both corrected to the frameshift site (13398 in this genome) according to this ‘slippery sequence.’ Otherwise, if this heptamer cannot be found, only the original sites predicted for ORF 1a and ORF 1b are displayed in the output file. The second option is to ignore the −1 frameshift, and the original sites predicted for ORF 1a and ORF 1b are always displayed, regardless of the existence of the heptamer UUUAAAC.
Fig. 1Distribution of the mapping points corresponding to genes, non-genes, predicted genes, and questionable ORFs for the SARS-CoV, TOR2 strain in a 3-dimensional (3-D) space. Each gene or ORF is mapped onto a point in a 9-D space. To visualize the distribution, the mapping points are projected onto the 3-D space spanned by the first three principal axes based on the principal component analysis. The first, second, and third principal vectors are denoted by the X-, Y-, and Z-axes, respectively. The fraction of the first three principal components accounts for 69.59% of the total inertia of the 9-D space. Green and orange balls represent the positive samples (genes) and negative samples (non-coding sequences), respectively. Blue balls correspond to the genes predicted by ZCURVE_CoV for the TOR2 strain, while red balls correspond to ORF 4, ORF 13, and ORF 14 annotated by Marra et al. [8]. It is clear that the three red balls are situated at the side of non-coding sequences, indicating that ORF 4, ORF 13, and ORF 14 are very unlikely to code for proteins.
Comparison of the genes annotated and those predicted by ZCURVE_CoV1.0, for the SARS-CoV, Urbani strain
| Genes annotated | Genes predicted by ZCURVE_CoV 1.0 | ||||||||
| Start | Stop | bp | a.a. | Feature | Start | Stop | bp | a.a. | Feature |
| 265 | 13,398 | 13,134 | 4377 | ORF 1a | 265 | 13,398 | 13,134 | 4377 | ORF 1a |
| 13,398 | 21,485 | 8088 | 2695 | ORF 1b | 13,398 | 21,485 | 8088 | 2695 | ORF 1b |
| 21,492 | 25,259 | 3768 | 1255 | S protein | 21,492 | 25,259 | 3768 | 1255 | S protein |
| 25,268 | 26,092 | 825 | 274 | X1 | 25,268 | 26,092 | 825 | 274 | Sars274 |
| 25,689 | 26,153 | 465 | 154 | X2 | |||||
| 26,117 | 26,347 | 231 | 76 | E protein | 26,117 | 26,347 | 231 | 76 | E protein |
| 26,398 | 27,063 | 666 | 221 | M protein | 26,398 | 27,063 | 666 | 221 | M protein |
| 27,074 | 27,265 | 192 | 63 | X3 | 27,074 | 27,265 | 192 | 63 | Sars63 |
| 27,273 | 27,641 | 369 | 122 | X4 | 27,273 | 27,641 | 369 | 122 | Sars122 |
| 27,638 | 27,772 | 135 | 44 | Sars44 | |||||
| 27,779 | 27,898 | 120 | 39 | Sars39 | |||||
| 27,864 | 28,118 | 255 | 84 | X5 | 27,864 | 28,118 | 255 | 84 | Sars84 |
| 28,120 | 29,388 | 1269 | 422 | N protein | 28,120 | 29,388 | 1269 | 422 | N protein |
See the footnote in Table 2.
Fig. 2Nucleotide mutations of the predicted gene Sars274 based on the alignment of corresponding coding sequences in 11 complete genome sequences. A total of four point mutations are detected, of which one is a silent mutation and the other three cause amino acid changes in the putative genes. The point mutations occur at nucleotide positions 31, 302, 406, and 783, respectively. At the 31st position, G → A (TOR2) ⇒ Gly → Arg. Similarly, at the 302nd position, T (U) → A (HKU-39849) ⇒ Met → Lys; at the 406th position, A → C (BJ01) ⇒ Lys→ Gln; and at the 783rd position, A → C (BJ01), but no amino acid change.