| Literature DB >> 16401352 |
Feng-Biao Guo1, Chun-Ting Zhang.
Abstract
BACKGROUND: It necessary to use highly accurate and statistics-based systems for viral and phage genome annotations. The GeneMark systems for gene-finding in virus and phage genomes suffer from some basic drawbacks. This paper puts forward an alternative approach for viral and phage gene-finding to improve the quality of annotations, particularly for newly sequenced genomes.Entities:
Mesh:
Year: 2006 PMID: 16401352 PMCID: PMC1352377 DOI: 10.1186/1471-2105-7-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The numbers of annotated and additional genes found by ZCURVE_V and GeneMark gene-finding family, respectively, for 30 viral genomes with different chromosome lengths a
| GenBank information | ZCURVE_V | GeneMarkb | Glimmer d | |||||||||
| Organisms | Sequence length (bp) | GC content | No. of annotated genes | No. of predicted genes | No. of predicted genes | No. of predicted genes | ||||||
| CNPV | 359,853 | 30.37 | 328 | 342 | 99.4 | 95.3 | 327 | 97.9 | 98.2 | 351 | 99.1 | 92.6 |
| FPV | 288,539 | 30.89 | 261 | 282 | 95.4 | 88.3 | 257 | 93.5 | 94.9 | 307 | 95.0 | 80.8 |
| THV | 195,859 | 66.61 | 158 | 199 | 90.5 | 71.9 | 109 | 57.6 | 83.5 | 207 | 50.6 | 38.6 |
| ASFV | 170,101 | 38.95 | 151 | 164 | 96.0 | 88.4 | 148 | 93.4 | 95.3 | 185 | 95.4 | 77.8 |
| MYXV | 161,773 | 43.56 | 170 | 170 | 98.8 | 98.8 | 172 | 97.6 | 96.5 | 180 | 98.2 | 92.8 |
| SFV | 159,857 | 39.53 | 165 | 170 | 96.4 | 93.5 | 168 | 97.0 | 95.2 | 188 | 98.2 | 86.2 |
| YLDV | 144,575 | 27.00 | 152 | 156 | 99.3 | 96.8 | 155 | 98.7 | 96.8 | 165 | 98.7 | 90.9 |
| ORFV | 139,962 | 63.44 | 130 | 131 | 92.3 | 91.6 | 133 | 91.5 | 89.5 | 187 | 97.7 | 67.9 |
| BPSV | 134,431 | 64.50 | 131 | 144 | 96.9 | 88.2 | 135 | 93.9 | 91.1 | 150 | 86.3 | 75.3 |
| AcNPV | 133,894 | 40.70 | 155 | 155 | 97.4 | 97.4 | 152 | 94.8 | 96.7 | 174 | 96.2 | 86.2 |
| BmNPV | 128,413 | 40.40 | 143 | 139 | 97.2 | 100 | 139 | 95.1 | 97.8 | 157 | 95.8 | 87.3 |
| PhopGV | 119,217 | 35.7 | 130 | 132 | 96.2 | 94.7 | 130 | 93.1 | 93.1 | 168 | 96.9 | 75.0 |
| AdhoNPV | 113,220 | 35.64 | 125 | 121 | 92.0 | 95.0 | 125 | 94.4 | 94.4 | 143 | 95.2 | 83.2 |
| LCDV-1 | 102,653 | 29.07 | 110 | 112 | 97.3 | 95.5 | 110 | 96.4 | 96.4 | 114 | 98.2 | 94.7 |
| PxGV | 100,999 | 40.69 | 120 | 116 | 93.3 | 96.6 | 123 | 92.5 | 90.2 | 131 | 94.2 | 86.3 |
| AdorGV | 99,657 | 34.49 | 119 | 121 | 97.5 | 95.9 | 116 | 94.1 | 96.6 | 137 | 95.8 | 83.2 |
| NeleNPV | 81,755 | 33.31 | 93 | 101 | 92.5 | 85.1 | 73 | 75.3 | 95.9 | 125 | 91.4 | 68.0 |
| FAdV-9 | 45,063 | 53.78 | 29 | 48 | 100 | 60.4 | 35 | 96.6 | 80 | 60 | 100 | 48.3 |
| PAdV-5 | 32,621 | 50.50 | 30 | 35 | 90.0 | 77.1 | 27 | 76.7 | 85.2 | 39 | 90.0 | 69.2 |
| IBV | 27,608 | 37.93 | 10 | 10 | 90.0 | 90.0 | 7 | 70 | 100 | 10 | 70.0 | 70.0 |
| CTV | 19,296 | 45.27 | 11 | 11 | 100 | 100 | 8 | 72.7 | 100 | 12 | 90.9 | 83.3 |
| SHFV | 15,717 | 50.11 | 11 | 11 | 100 | 100 | 8 | 54.5 | 75 | -- | -- | -- |
| BYV | 15,480 | 46.03 | 8 | 8 | 100 | 100 | 7 | 87.5 | 100 | 11 | 100 | 72.7 |
| FDLV | 15,378 | 43.13 | 8 | 8 | 100 | 100 | 8 | 100 | 100 | 11 | 100 | 72.7 |
| EAV | 12,704 | 51.66 | 9 | 9 | 88.8 | 88.8 | 6 | 55.5 | 83.3 | 11 | 77.8 | 63.6 |
| SFV | 11,442 | 53.22 | 2 | 2 | 100 | 100 | 2 | 100 | 100 | -- | -- | -- |
| BCMV | 9612 | 42.22 | 1 | 1 | 100 | 100 | 2 | 100 | 50 | 3 | 100 | 33.3 |
| GLV | 8363 | 43.86 | 6 | 6 | 100 | 100 | 4 | 66.7 | 100 | 7 | 50 | 42.9 |
| FMV | 7743 | 35.36 | 7 | 7 | 100 | 100 | 7 | 100 | 100 | 7 | 100 | 100 |
| SCMV | 4194 | 51.55 | 4 | 3 | 75 | 100 | 2 | 50 | 100 | 2 | 50 | 100 |
| Average (upper 15)c | - | - | - | - | 95.9 95.9 | 92.8 | - | 92.5 | 94.0 | - | 93.05 | 81.0 |
| Average (lower 15)c | - | - | - | - | 95.6 | 93.2 | - | 80.0 | 91.1 | - | 85.84 | 69.8 |
| Average (30)c | - | - | - | - | 95.7 | 93.0 | - | 86.2 | 92.5 | - | 89.70 | 75.8 |
a The names of the viruses are listed in the descending order of their chromosome sequence lengths. The abbreviation names of viruses are used. See the text for the detail.
b For the genomes of canarypox virus (CNPV), orf virus (ORFV), bovine papular stomatitis virus (BPSV) and neodiprion lecontei nucleopolyhedrovirus (NeleNPV), genes were predicted directly by GeneMarks program, whereas for the other 26 viral genomes the data deposited in the GeneMark VIOLIN database are used.
c The values are averaged over the upper 15, lower 15 and all the 30 viral genomes, respectively.
d Glimmer 2.02 predicted no genes for simian hemorrhagic fever virus (SHFV) and semliki forest virus (SFV) genomes.
The numbers of annotated and additional genes found by ZCURVE_V and the GeneMark VIOLIN database, respectively, for the five genomes with particular features a
| GenBank information | ZCURVE_V | GeneMarkb | ||||
| Organisms | Chromosome sequence length (bp) | No. of annotated genes | No. of annotated genes found | No. of additional genes predicted | No. of annotated genes found | No. of additional genes predicted |
| CYDV-RPV satRNA | 322 | 0 | 0 | 0 | 0 | 0 |
| SatPaMVc | 826 | 2 | 2 | 0 | 1 | 0 |
| SV-MWLMV | 1168 | 1 | 1 | 0 | 0 | 1 |
| SLRSV | 1118 | 1 | 1 | 0 | 1 | 0 |
| AmEPV | 232,392 | 294 | 245 | 5 | 239 | 323 |
a Of the five viral genomes, cereal yellow dwarf virus-RPV satellite RNA (CYDV-RPV satRNA), panicum mosaic satellite virus (satPaMV), satellite maize white line mosaic virus (SV-MWLMV) and strawberry latent ringspot virus satellite RNA (SLRSV) are less than or slightly larger than 1000 bp in length, whereas amsacta moorei entomopoxvirus (AmEPV) has probably the lowest GC content among the sequenced organisms (17.78%).
b Data deposited in the GeneMark VIOLIN database are used.
c For this genome, we adjusted the default settings, i.e., using the 'single-stranded virus' option.
Genes annotated and predicted by ZCURVE_V and the GeneMark VIOLIN database for human immunodeficiency virus 1 (HIV-1) a
| Genes annotated | Genes predicted by ZCURVE_V | Genes predicted by GeneMark | |||||||
| Start | Stop | Length (aa) | Gene | Start | Stop | Length (aa) | Start | Stop | Length (aa) |
| 336 | 1838 | 501 | 336 | 1838 | 501 | 336 | 1838 | 501 | |
| 1631 | 4642 | 1004 | 1904 | 4642 | 913 | 1904 | 4642 | 913 | |
| 4587 | 5165 | 193 | 4587 | 5165 | 193 | ||||
| 5105 | 5341 | 79 | 5105 | 5341 | 237 | 5105 | 5341 | 237 | |
| 5377 | 7970 | 87 | 73 | ||||||
| 5516 | 8199 | 117 | |||||||
| 5608 | 5856 | 83 | 5608 | 5856 | 83 | ||||
| 5771 | 8341 | 857 | 5771 | 8341 | 857 | 5771 | 8341 | 857 | |
| 8343 | 8714 | 124 | 8343 | 8714 | 124 | 8343 | 8714 | 124 | |
a Bold denotes gene found by adapting the default settings of ZCURVE_V, i.e., keeping the overlapping genes. Bold and italic figures are associated with the gene, in which the 3' end is not consistent with annotated one, but is embedded within it.
Genes annotated and predicted by ZCURVE_V and the GeneMark VIOLIN database for hepatitis B virus (HBV) a
| Genes annotated | Genes predicted by ZCURVE_V | Genes predicted by GeneMark | |||||||
| Start | Stop | Length (aa) | Gene | Start | Stop | Length (aa) | Start | Stop | Length (aa) |
| 1 | 1623 | 541 | P | 421 | 1623 | 401 | 421 | 1623 | 401 |
| 155 | 835 | 227 | S | ||||||
| 1374 | 1838 | 155 | X | 1374 | 1838 | 155 | |||
| 1901 | 2452 | 184 | C | 1814 | 2452 | 213 | |||
| 2307 | 3215 | 303 | P | ||||||
a Bold denotes gene found by adapting the default settings of ZCURVE_V, i.e., keeping the overlapping genes. Bold and italic figures are associated with the gene that is embedded within the annotated gene.
The relationship between the values of VZ score and functions of predicted proteins for the bacteriophage P4
| Genes annotated | Genes predicted by ZCURVE_V | ||||||
| Start | Stop | Strand | Function | Start | Stop | Strand | VZ score |
| 247 | 648 | + | Hypothetical protein | 247 | 648 | + | 0.162 |
| 651 | 1718 | + | Hypothetical protein | 651 | 1718 | + | 0.111 |
| 1746 | 2540 | - | Hypothetical protein | 1746 | 2540 | - | 0.071 |
| 2607 | 3926 | - | Integrase | 2607 | 3926 | - | 0.345 |
| __ | __ | __ | __ | 3954 | 4103 | - | 0.307 |
| 4096 | 4431 | - | Hypothetical protein | 4096 | 4431 | - | 0.159 |
| 4636 | 6969 | - | DNA primase | 4636 | 6969 | - | 0.5000 |
| 6984 | 7304 | - | Hypothetical protein | 6984 | 7304 | - | 0.434 |
| 7440 | 7895 | - | Hypothetical protein | 7440 | 7895 | - | 0.380 |
| 7888 | 8175 | - | helper derepression protein | 7888 | 8256 | - | 0.425 |
| 8168 | 8584 | - | Putative CI repressor | 8168 | 8812 | - | 0.377 |
| 8764 | 9030 | - | Transcriptional regulator | 8764 | 9030 | - | 0.408 |
| __ | __ | __ | __ | 8991 | 9173 | - | 0.264 |
| 9583 | 10317 | + | Head size determination protein sid | 9583 | 10317 | + | 0.426 |
| 10,314 | 10,814 | + | Transactivation protein | 10,314 | 10,814 | + | 0.353 |
| 10,888 | 11,460 | + | Amber mutation-suppressing protein | 108,88 | 11,460 | + | 0.419 |
Joint applications of ZCURVE_V and GeneMark for the four viral genomesa
| Organisms | CLYVV | LCDV-1 | TGEV | YLDV | |
| Annotated genes | 5 | 110 | 9 | 152 | |
| ZCURVE_V | Annotated genes found | 4 | 107 | 8 | 151 |
| Additional genes found | 0 | 5 | 0 | 5 | |
| GeneMark VIOLIN | Annotated genes found | 4 | 106 | 8 | 150 |
| Additional genes found | 0 | 4 | 0 | 5 | |
| Joint | Annotated genes found | 5 | 108 | 9 | 152 |
| Additional genes found | 0 | 8 | 0 | 9 | |
a They are clover yellow mosaic virus (CLYMV), lymphocystis disease virus 1 (LCDV-1), ....transmissible gastroenteritis virus (TGEV) and yaba-like disease virus (YLDV) genomes, respectively.