Literature DB >> 14627837

Improving gene annotation of complete viral genomes.

Ryan Mills1, Michael Rozanov, Alexandre Lomsadze, Tatiana Tatusova, Mark Borodovsky.   

Abstract

Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protocols we have better adjusted the GeneMarkS statistical models to sequences of viral genomes. Hundreds of new genes were identified, some in well studied viral genomes. For example, a new gene predicted in the genome of the Epstein-Barr virus was shown to encode a protein similar to alpha-herpesvirus minor tegument protein UL14 with heat shock functions. Convincing evidence of this similarity was obtained after only 12 PSI-BLAST iterations. In another example, several iterations of PSI-BLAST were required to demonstrate that a gene predicted in the genome of Alcelaphine herpesvirus 1 encodes a BALF1-like protein which is thought to be involved in apoptosis regulation and, potentially, carcinogenesis. New predictions were used to refine annotations of viral genomes in the RefSeq collection curated by the National Center for Biotechnology Information. Importantly, even in those cases where no sequence similarities were detected, GeneMarkS significantly reduced the number of primary targets for experimental characterization by identifying the most probable candidate genes. The new genome annotations were stored in VIOLIN, an interactive database which provides access to similarity search tools for up-to-date analysis of predicted viral proteins.

Entities:  

Mesh:

Substances:

Year:  2003        PMID: 14627837      PMCID: PMC290248          DOI: 10.1093/nar/gkg878

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Currently, the complete genome of a virus can be sequenced within days. The next step towards understanding the details of a virus life cycle is to identify the whole complement of viral genes and proteins. This information can provide critical insights on many occasions. For instance, for a team working on an antiviral drug design, promising drug targets would be those viral proteins that are basically identical in all major strains of a virus and are significantly different from the proteins in the host, e.g. human. At the time of this study, the GenBank database (1) contained ∼3000 annotated complete viral genome sequences. In most cases, research groups providing the original annotation are unable to detect and confirm all genes experimentally by the time of submission. Computational approaches have therefore been commonly used since the time of pioneer projects such as the sequencing and annotation of phage λ (2). There are two major approaches to gene identification, intrinsic and extrinsic (3). The intrinsic approach, which can be also called an ab initio statistical approach, uses statistical patterns of nucleotide frequencies and nucleotide ordering observed in a given genome. These patterns are not the same in protein-coding and non-coding DNA sequences; hence a properly trained intrinsic method can recognize protein-coding regions. Extrinsic methods seek to identify evolutionarily conserved sequences in protein-coding regions. These sequences can be detected by similarity searches. The extrinsic method is thus dependent on external information residing outside the sequence of interest. Intrinsic and extrinsic methods have complementary strengths. Tests of their predictive power performed with sets of sequences containing known genes show that the intrinsic methods have higher sensitivity than the extrinsic methods which usually have higher specificity. Using intrinsic and extrinsic methods in concert is therefore a worthwhile approach (3). So far, the use of computational gene identification methods in viral genomes by the groups of researchers submitting genomic data to GenBank was primarily restricted to similarity searches. To reduce the risk of missing real genes, a simple statistics-based rule is frequently applied to take into account the difference in length distributions of real genes and random open-reading frames (ORFs). This rule suggests annotating ‘long enough’ ORFs as genes. For instance, in the rat cytomegalovirus genome any ORF longer than 300 nt not overlapping an adjacent ORF to an extent larger than 60% was annotated as a gene (4). Such a simplistic rule, however, could cause substantial over-annotation, especially in genomes with high G+C content. Another frequently used simplification is the annotation of a gene start by the ‘longest ORF’ rule (assignment of a gene start to the 5′-most ATG codon). A screening of GenBank identified 26 complete viral genomes with a total of 4400 genes, all annotated using this rule. It was nevertheless shown earlier that the true start may not be pinpointed by this rule in ∼25% of cases (5). Viral genomes are different from the genomes of their hosts in several aspects that hamper immediate successful application of the gene finding methods developed for their hosts. An important factor is the rather small size of a viral genomic sequence. Currently, the RefSeq collection (19) contains 891 viral genomes shorter than 10 kb with a total of 2900 genes annotated, 169 genomes with lengths between 10 and 100 kb (3500 genes) and 47 genomes longer than 100 kb (7900 genes). A rather short genome size makes it either impossible to apply previously developed training procedures to derive parameters of high order statistical models (for the shortest viral genomes) or significantly limits the accuracy of these models (even in the case of the longest viral genomes). Another important feature of viral genome organization is the high frequency of gene overlaps that occur in viruses of both prokaryotic and eukaryotic hosts. The gene overlaps in viral genomes appear to be considerably longer than those seen in prokaryotic and, much more rarely, eukaryotic genomes. Furthermore, some annotated and experimentally confirmed viral genes are completely overlapped by others. Repetitive DNA may occupy a large portion of a viral genome; for example, in the Epstein–Barr virus genome (NC_001345) repetitive regions amount to ∼30% of the genomic sequence (6), thus making model training more complicated. In spite of the difficulties mentioned above, several groups have attempted to apply earlier developed statistical gene prediction programs for viral genome annotation. For instance, the GeneMark program (7) was used to identify genes in the genomes of Bovine herpesvirus 4 (8), bacteriophage FKZ of Pseudomonas aeruginosa (9), Mycoplasma virus P1 (10), Mycobacteriophage D29 (11), Stx 2e-encoding phage FP27 (12), coliphage T4 and the marine cyanophage S-PM2 (13), as well as to identify genes in genomes of virulence plasmids in Rhodocuccus equi (14), Shigella felxneri (15) and Escherichia coli (16). Still, these initial attempts did not use a tool developed specifically for the problem in hand (except perhaps the case of T4, where the GeneMark models were adjusted to the genomic T4 sequence). A significant difference may exist sometimes between the GenBank record and the original publication. For instance, the annotation of the white spot bacilliform virus (GenBank record AF332093) lists 531 protein-coding genes in comparison with only 181 genes mentioned in the original publication (17). On the other hand, only 23 genes are annotated in Rana tigrina ranavirus (GenBank record AF389451), while the original publication (18) describes 105 genes. In order to improve the quality of DNA sequence annotation, the National Center for Biotechnology Information (NCBI) has created the RefSeq collection. While the original GenBank genomic record is maintained as suggested by the authors, the RefSeq record of the same sequence is continuously updated with regard to new relevant data that become available. There were 1191 RefSeq records for complete genomes of viruses of prokaryotic and eukaryotic hosts as of August 2002. Several attempts have been made to organize data on viral genomes in interactive databases providing tools for analysis of viral genes and proteins (20–22). These projects have been typically focused on specific classes of viruses. To provide a tool for accurate ab initio gene identification in viral genomes we have modified the earlier developed GeneMarkS program (5) to make it suitable for analysis and gene prediction in viral genomes of different types. As a result of the application of this tool, we have created new annotation records for viral genomes present in GenBank (including its RefSeq part). These records have been compiled in the database VIOLIN (viral genomes online) accessible online at http://opal.biology.gatech.edu/GeneMark/VIOLIN/.

MATERIALS AND METHODS

Materials

A set of 2945 complete viral genome records was downloaded from GenBank. Since several genomic variants (strains, mutants, isolates) were determined for many viral species, many viral genome records had several other almost identical entries. To filter out this redundancy we have specifically focused on the analysis of viral genomes from the RefSeq collection containing 1191 complete genomic records of viruses of eukaryotic (1071) and prokaryotic (120) hosts. RefSeq contains only one record for each virus species. Notably, these 1191 RefSeq viral genome annotations included 86 records that had been updated with the aid of our new predictions. In what follows, these 86 records have been treated differently in terms of comparison of predicted and annotated genes.

Methods

For phage genomes with prokaryotic-type gene organization, computer methods of prokaryotic gene finding could be adjusted rather easily. The prokaryotic version of GeneMark.hmm as well as its self-training version GeneMarkS were previously shown to possess high accuracy both in detecting prokaryotic genes as a whole and in exactly pinpointing gene starts (23,24). Therefore, GeneMarkS was the natural choice as the tool to be applied and adjusted for the analysis of phage genomes. For viruses of eukaryotic hosts, the situation is more complex. Current eukaryotic gene finding algorithms are unable to predict the gene overlaps frequently seen in genomes of viruses of eukaryotic hosts. On the other hand, according to the RefSeq annotation of ∼11 000 genes in 1015 genomes of viruses of eukaryotic hosts, only ∼300 genes have introns. Therefore, use of the program able to predict overlapping genes provides more benefits than the one predicting exon–intron structures. The program suitable for immediate use and further modifications was again the prokaryotic GeneMarkS, which could identify overlapping protein-coding ORFs while rarely occurring exons would be predicted as separate ORFs. A viral genomic sequence might not provide enough training data to determine parameters of Markov chain models used in GeneMark.hmm. We turned, therefore, to the heuristic training technique described earlier (24), which is able to derive the parameters of the required models from a DNA sequence as short as 400 nt. For larger viral genomes, the statistical models initially defined by the heuristic procedure could be iteratively refined further by the unsupervised training procedure implemented in GeneMarkS (24). This iterative procedure used simultaneous training and gene prediction to build models of protein-coding and non-coding sequences. For larger phage genomes, GeneMarkS also derived a model for the ribosomal binding site (RBS) and its spacer (the sequence between the rightmost nucleotide of the RBS and the first nucleotide of the start codon). Parameters of both models were determined from the multiple alignment of the nucleotide sequences situated upstream of the predicted gene starts, with the alignment constructed by the Gibbs Motif Sampler (25). For large enough genomes of viruses of eukaryotic hosts, parameters of a model for the Kozak pattern associated with the translational initiation site were determined by GeneMarkS with yet another modification. This GeneMarkS version allowed the use of the Kozak model for gene start prediction. Further modifications were done to adjust the program to different types of viral genome organization. Since a linear viral genome cannot have a partial coding region at either terminus, a specific restriction imposed at the program initialization stage excluded this possibility. Conversely, an additional post-processing step was implemented for circular viral genomes to detect genes possibly divided by the split point chosen in the original annotation. For the single-stranded RNA (ssRNA) positive strand viruses whose genes are located in one strand only, an additional procedure identified the strand where gene predictions clustered predominantly and the opposing strand was assigned as completely non-coding. For every viral genome the training procedure had to determine whether the sequence data were only sufficient for obtaining heuristic models or if a full training cycle of GeneMarkS could be initiated. If GeneMark.hmm with the initially defined heuristic models predicted fewer than a certain number of genes, Nr, then the procedure stopped and these initial predictions were not refined further. Otherwise, the full cycle of GeneMarkS training was initiated. The number 50 was assigned as the default Nr number. In the training process, if several repetitive copies of some predicted protein-coding ORFs were identified, all copies but one were excluded from the training set of protein-coding regions to reduce bias in the protein-coding sequence model. Predicted ORFs longer than 500 nt that appeared in predicted intergenic regions were excluded from the set of non-coding regions to exclude possible ‘contamination’ of the non-coding training set. For viral genomes with a total size of predicted non-coding regions <10 kb, the training set of non-coding regions was augmented with an additional 10 kb sequence generated by the simplest multinomial model, simulating a sequence with the frequencies of the four nucleotides identical to those observed in the native non-coding region (26). The step-wise diagram of GeneMarkS self-training and gene prediction for the genome of a virus of prokaryotic host is shown in Figure 1. For a virus of a eukaryotic host, a reference to the Kozak model should replace the reference to the RBS model. The evaluation of the RBS model fitness was done by assessing both the variance of the RBS signal localization and the information content of the RBS model derived by the Gibbs Sampler. The Kozak model was evaluated in a similar manner. The self-training procedure was terminated as soon as two subsequent iterations produced the same gene predictions. However, in some cases exact convergence was not achieved due to small cyclic variations observed in subsequent iterations. In these cases the self-training was stopped and the reported sequence parse into coding and non-coding regions was the one with the larger number of predicted genes.
Figure 1

Flowchart of the statistical gene identification procedure applied to a complete genome of a virus of a prokaryotic host. For viruses of eukaryotic hosts, the Kozak model is used instead of the RBS model.

Assessment of the accuracy of computer gene prediction is a critically important issue. To characterize errors of two sorts, false positive and false negative, we used two parameters of accuracy, sensitivity and specificity. The value of sensitivity (Sn) is defined as the ratio of the number of true predictions to the number of genes in a test set. The fewer the number of false negatives, the higher the sensitivity. The value of specificity (Sp) is defined as the ratio of the number of true predictions to the total number of predictions made. The fewer the number of false positives, the higher the specificity. To determine sensitivity and specificity values for a particular gene prediction method, one needs a test set of nucleotide sequences with experimentally verified genes. To further define the terms we say that a gene is ‘detected’ if its 3′ end coincides with the 3′ end of a verified one. Additionally, a gene is ‘predicted exactly’ if the positions of both ends coincide with the verified gene ends. The accuracy of ‘exact prediction’ in our terms is the same as the accuracy of the ‘gene start prediction’. This value is defined by the fraction of ‘exactly predicted genes’ among ‘detected’ genes. The BLAST searches used to characterize newly predicted proteins were conducted using standard parameters: BLOSUM62; penalty for gap ‘10’; penalty for gap extension ‘1’; low-complexity filtering ‘on’. In PSI-BLAST searches, the parameters were the same with the exception that the low-complexity filtering was ‘off’.

RESULTS AND DISCUSSION

The overall statistics of the results of our analysis of complete viral genomes from GenBank is shown in Table 1. Our major focus here is on the genomes from the RefSeq collection. Those 86 viral genomes that had previously been reannotated in RefSeq with the aid of our analysis were excluded from our comparisons.
Table 1.

Summary of the results of the analysis of viral genomes currently available in GenBank and those viral genomes for which reference sequences (RefSeq collection) have already been created at NCBI

 GenBanka TotalRefSeq TotalEukaryotic hostsProkaryotic hosts
Database summary
Number of viral genomes analyzed17501107101592
Prediction and annotation comparison
Exact match between prediction and annotation157031042580112414
Predicted gene differs in start location from annotated one1479931368563
Predicted gene overlaps with an intron containing annotated gene38220919019
Annotated gene was not predicted (possible false negative)3885 (25%)b2720 (26%)c2231 (28%)c489 (20%)c
Newly predicted genes (possible false positive)3520 (22%)b1360 (13%)c1047 (13%)c313 (13%)c
Analysis of newly predicted genes
Prediction has a BLASTP and CD-Search hit with E-value <0.005622998910
Prediction has a BLASTP hit with E-value <0.005, no CD-Search hit124833624393
Prediction has a CD-Search hit with E-value <0.005, no BLASTP hit35660
Prediction has no BLASTP or CD-Search hit with E-value <0.0051615919709210

The numbers in the RefSeq columns do not reflect 86 genomes annotated in RefSeq with the aid of the VIOLIN data. Newly predicted genes have been further analyzed by BLASTP and these results are shown in the bottom rows.

aThe GenBank records used in the current analysis did not include RefSeq records; however, the original records for each RefSeq record were included in this GenBank set of genomes.

bThe percentage value is defined with regard to the number of predicted genes exactly matching the annotation in GenBank.

cThe percentage value is defined with regard to the number of predicted genes exactly matching the annotation in RefSeq.

As shown in the RefSeq section of Table 1, 8011 protein-coding genes predicted in 1015 complete genomes of viruses of eukaryotic hosts matched the earlier annotation exactly. However, 1047 gene predictions did not match any previously annotated gene, and for 332 out of these 1047 new predictions, hits to known proteins with E-values <10–5 were found by BLASTP search (27). Interestingly, 135 out of these 332 similarity search supported predictions overlapped with annotated genes but the reading frames were different. A rather large number of 2231 genes in the RefSeq annotated genomes of viruses of eukaryotic hosts were not confirmed by our analysis. In 92 RefSeq phage genomes, 2414 gene predictions matched the existing annotation exactly. There were 313 entirely new predictions, and 103 of them were corroborated by the BLASTP search with hits to known proteins (E-value <10–5). Again, approximately one-third of predictions corroborated by the similarity search (36 out of 103) overlapped already annotated genes with different reading frames. Our analysis did not confirm 489 genes annotated in phage genomes from the RefSeq collection. Those 2720 (2231 + 489) genes that were annotated in the RefSeq viral genomes but were not predicted in this study are of a special interest. Subsequent BLASTP searches of these genes protein products against the non-redundant database detected similarity to other known proteins only for 848 out of the 2231 genes annotated in genomes of viruses of eukaryotic hosts and for 137 out of the 489 genes annotated in phages. Overall, we came to the number 985 as the total number of genes not predicted by the ab initio method, though these annotated genes had significant similarity with other known proteins. Therefore, given the whole number of 14 076 genes annotated in 1107 viral genomes, the false negative rate of the ab initio prediction method might be estimated at <10%. Interestingly, in 620 RefSeq viral genomes no annotated gene was missed in predictions. As is indicated in Table 1, analysis of the original GenBank genomic records produced a larger fraction of newly predicted genes than determined in the genomes from the RefSeq collection. In turn, a larger fraction (28%) of these new genes produced significant BLASTP hits in comparison with the fraction of new genes in RefSeq (10%) supported by BLASTP search. The gene prediction results for the RefSeq complete viral genomes were grouped together by virus length and type (Tables 2 and 3). Interestingly, a large number of new genes were identified in genomes shorter than 10 kb (892 genomes). For example, in the 8454 nt long genome of single-stranded DNA (ssDNA) enterobacteria phage IF1 (NC_001954) we identified a new 192 nt long gene coding for a homolog of Vibrio cholerae RasR protein. In contrast to all other known genes of this phage, this new gene was located in the DNA strand complementary to the ssDNA present in the virion. The largest numbers of newly identified genes or genes with new start predictions turned out to reside in 193 genomes of double-stranded DNA (dsDNA) viruses and 418 genomes of ssRNA viruses (Table 3).
Table 2.

Distribution of the results of the comparative analysis of gene prediction and annotation for viral genomes from the RefSeq collection with the three sets of viruses clustered by genome length

aL < 10 000 ntb (891)c10000 <= L <= 100 000 ntb (169)cL > 100 000 ntb (47)c
Exact match177224936160
Different start225 (12.7%)483 (19.4%)223 (3.6%)
Overlap with interrupted gene79 ( 4.5%)43 (1.7%)87 (1.4%)
Annotated gene not predicted731 (41.3%)499 (20.0%)1490 (24.1%)
New predictions331 (18.7%)350 (14.0%)679 (11.0%)
Analysis of newly predicted genes   
BLASTP and CD-Search hit263439
BLASTP only hit51104181
CD-Search only hit105
No hits253212454

aThe meaning of the categories in this column is the same as in the left-most column in Table 1.

bThe genome length is designated as L.

cThe number in parentheses designates the number of genomes of a given category.

Table 3.

Distribution of the results of the comparative analysis of gene prediction and annotation for viral genomes from the RefSeq collection joined in classes defined by viral classification

adsDNA (193)bssDNA (185)bdsRNA (127)bssRNA positive strand (418)bssRNA negative strand (82)bRetroid (65)bSatellite (27)bVirus not classified (6)bPhage not classified (3)b
Exact match85324401427502521511212132
Different start64456511536320142
Overlap with interrupted gene1252451324000
Annotated gene not predicted20532751224532454649
New predictions102588547232534329
Analysis of newly predicted genes
BLASTP and CD-Search hit79205013000
BLASTP only hit2792161238106
CD-Search only hit500100000
No hits66265485429323323

aThe meaning of the categories in this column is the same as in the left-most column in Table 1.

bThe number in parentheses designates the number of genomes of a given category.

Quite a few new predictions among those that had no BLASTP search support were found to overlap already annotated genes. This occurred 274 times (20% of newly predicted genes) in the RefSeq genomes. In 117 of these cases the product of the annotated gene showed similarity to a protein in another species. Nevertheless, the fact of overlap does not indicate a likely false positive prediction per se. Gene overlap is a quite frequent phenomenon in viral genomes, as 52% of viral genes annotated in RefSeq overlap each other. Ideally, the characteristics of gene prediction accuracy, sensitivity and specificity (defined in Methods), should be determined for a test set of sequences containing experimentally verified genes. However, any given viral genome, except perhaps several of tiny size, would not have a large fraction of genes annotated experimentally. For this reason, we have compiled sets of so-called ‘trustable’ genes and used them as the test sets. For instance, in nine genomes of human herpesviruses (Table 4) we identified as trustable the genes both annotated and ab initio predicted. Also, we included in this category those genes that were either annotated or predicted and possessed additional ‘extrinsic’ evidence for being a real gene. This could be an experimentally characterized function or statistically significant sequence similarity to previously characterized proteins. For this compiled set of trustable genes of human herpesviruses, we obtained the average values of Sn = 91% and Sp = 84% as the estimates of the accuracy of our method.
Table 4.

Gene prediction accuracy assessment for nine human herpesviruses

VirusNumber of genes predictedNumber of genes annotatedNumber of genes in test setNumber of correct predictionsPrediction sensitivity (%)Prediction specificity (%)
HHV-1 (HSV-1)767375699290
HHV-2 (HSV-2)777171659284
HHV-3 (VZV)727171699796
HHV-4 (EBV)909478708978
HHV-5 (HCMV)1641981481258476
HHV-6A1151211191048790
HHV-6B1149185819571
HHV-7109107104908783
HHV-8 (KSHV)968288839486
Total9139088397679184

The test set was compiled as explained in text.

Length comparison between newly predicted genes and genes annotated but missed in predictions indicated that the newly predicted genes tend to be shorter than the ones supposedly real but missed in predictions (Fig. 2). The ratio of newly predicted genes to missed genes decreased from 3.81 for genes shorter than 300 nt to 0.49 for genes longer than 300 nt. This observation seems to be related to a preference in the original records to have longer ORFs annotated as genes. The longer ORFs are generally assumed to be more likely to be real genes while ORFs shorter than 300 nt are difficult to discriminate from random non-coding ORFs and are more risky to annotate as genes. This conventional wisdom could lead to over-annotation of ORFs longer than 300 nt as genes while some short genes could be missed. As Figure 2 shows, many ‘long’ annotated genes were indeed not confirmed while quite a few new ‘short’ genes were predicted.
Figure 2

Length distributions of several categories of genes predicted or annotated in 1047 RefSeq viral genomes. Dark gray bars are used for genes annotated but not predicted; light gray bars are used for predicted but not annotated genes whose protein products produce BLASTP hits with E-values <10–5; white bars are used for predicted but not annotated genes whose protein products do not produce BLASTP hits with E-values <10–5.

Assessing and improving the gene start prediction accuracy is another important issue. As described above, for more precise gene start prediction we used the RBS model for long enough viruses of prokaryotic hosts and the Kozak model for viruses of eukaryotic hosts. To give an example, the positional frequency matrices of RBS models specific for phage T4 and phage λ are visualized in ‘logo’ images (28) in Figure 3b and c. Notably, these images emphasize the similarity of the nucleotide frequency patterns existing in the RBS of phages to the pattern known for E.coli (Fig. 3a). This observation could be expected given that T4 and λ use the E.coli translational mechanism. While the positional frequency matrix of the RBS model has a fixed length and variable pattern of positional frequencies, the model of the RBS spacer allows for sequences of variable lengths (distances between RBS and start codon) with an invariant positional frequency pattern of the non-coding region.
Figure 3

The positional nucleotide frequency patterns of the GeneMarkS models of the RBS pattern for phage T4 (b) and phage λ (c) are shown in the logo form (27), as compared with the RBS pattern of E.coli shown in (a). Similarly, the Kozak pattern for human herpesvirus 4 (e) and human herpesvirus 8 (f) are shown in the logo form, with the Kozak pattern for human genes shown in (d).

The logos for the Kozak model determined for the Epstein–Barr virus (HHV4) and for Kaposi’s sarcoma herpesvirus (HHV8) shown in Figure 3e and f clearly indicate that the information content of these signals is lower than that of RBS. However, the Kozak patterns observed in these viruses are still similar to the Kozak pattern known for the genome of the human host (Fig. 3d). Accurate evaluation of the gene start prediction accuracy requires a set of genes with experimentally verified gene starts. Evaluation of GeneMarkS performance was done earlier on the test set of E.coli genes with 5′ ends verified by sequencing of N-terminals of encoded proteins (29). In this test the accuracy of start prediction was observed to be as high as 94% (5). A comparison of predictions for phage T4 both with and without the use of the RBS model was carried out (Supplementary Material, Table 1). This comparison showed that predictions made with the use of the RBS model made an almost 10% better match with the annotation, which we consider sufficiently accurate for this well studied phage genome. Considering viruses of eukaryotic hosts, we compiled a set of genes from nine human herpesviruses with translation starts confirmed by similarity search on a protein level. The 5′ end of the protein having the highest BLASTP hit (excluding one or several self hits) was compared with the 5′ end of the query protein to assess the accuracy of the gene start prediction. After selection of the most unambiguous cases, we obtained an estimate of the accuracy of start prediction as 85% (Supplementary Material, Table 2). The whole set of newly predicted genes was used further to search for similarity and reconstruct possible orthologous relationships. A database of 1360 newly predicted proteins was compiled and was cross-searched using BLASTP. We found that 237 predicted proteins had some similarity to other members in the database and could be further grouped into 106 protein clusters (Supplementary Material, Table 3). Some of these clusters show highly conserved regions; for instance, a cluster of protein products of new genes identified in poxviruses. Now we take a closer look at several individual gene predictions. In the well studied genome of Bacteriophage λ (JO2459) we identified as many as five new genes. These genes have already been included in the RefSeq version of the phage λ annotation (NC_001416). Two genes, coding for a putative envelope protein (NP_597781) and Bor protein precursor (NP_597780), are similar to genes in prophage CP-933X, being a part of the E.coli O157 genome (NC_002655). A gene for superinfection exclusion protein B (NP_597779) must have been known for some time since its protein product had been included into the PIR database (P03762). The other two genes were classified as hypothetical. Our predictions of 16 new genes in Porcine adenovirus A (NC_001997) were corroborated by similarity search. For instance, the protein encoded by predicted ORF6 is a member of a family of DNA polymerases present in 39 other adenoviruses. A potentially important finding was a gene located in positions 10443–11138 of the genome of Alcelaphine herpesvirus 1 (NC_002531) coding for a 231 amino acid long putative protein (NP_597933). Initially, the new protein was shown to be similar to the uncharacterized putative protein ORF E4 (NP_042601, AAC13792) of unclassified γ-herpesvirus Equine herpesvirus 2. A subsequent PSI-BLAST search revealed a striking similarity between these two proteins and recently discovered antagonists of the lymphocryptovirus antiapoptotic BCL-2 proteins (30). Later, the sequence of a third non-lymphocryptovirus protein, hypothetical v-BCL2 of another unclassified γ-herpesvirus (Porcine lymphotropic herpesvirus 1) was released (31) and we have found its sequence to be very similar to the newly identified protein (NP_597933). The PSI-BLAST search profile built from the three proteins further identified similarity with ORF1 protein of Callitrichine herpesvirus 3 (a lymphocryptovirus BALF1-like BCL-2 like protein) and with the BALF1 protein (AAK01916) of Allitrichine herpesvirus 3 (a lymphocryptovirus) with E-values of 8 × 10–4 and 0.007, respectively. This range of E-values has been characterized as being indicative of significant sequence similarity (32,33). The output of the third iteration of PSI-BLAST included all the BALF1-like proteins at the top of the list. Human GRS protein and other BCL-2-like non-viral proteins were also present in the list at a substantial score distance. In the next round of analysis, the RPS-BLAST (the NCBI program comparing protein sequences with the Conserved Domain Database) readily detected a BCL motif in all three non-lymphocryptovirus proteins. Moreover, multiple alignment by hierarchical clustering (34) of the newly predicted protein (NP_597933) with proteins NP_042601, AAM22111 and all the lymphocryptovirus BALF1 proteins (Fig. 4) further supported the probable functional significance of the observed pairwise similarity by making evident the patterns of amino acids conserved in all sequences. Interestingly enough, a TBLASTN search failed to reveal additional un-annotated homologs of NP_597933. It is tempting to speculate that, given the function of BALF1 (30), the newly identified BALF1-like protein may be involved in a complex regulation of the host cell apoptosis, presumably as an antagonist of the herpesvirus antiapoptotic BCL-2 proteins, and, perhaps, as a part of a gene network involved in carcinogenesis.
Figure 4

MultAlin alignment of (putative) BALF1-like proteins (33). The variable N- and C-termini are shown in lower case. Protein names are abbreviated as follows: AHV-1 BALF1, BALF1 homolog (NP_597933) predicted by GeneMarkS in the genome of Alcelaphine herpesvirus 1 (NC_002531); PLHV-1 vbcl2, Porcine lymphotropic herpesvirus 1 hypothetical v-bcl2 (AAM22111); CALHV-3 ORF1, Callitrichine herpesvirus 3 ORF1 (AAK38208); HVP BALF1, Herpesvirus papio BALF1 (AAK01916); PoHV-3 BALF1, Pongine herpesvirus 3 BALF1 (AAK60342); HHV-4 BALF1, Human herpesvirus 4 BALF1 (NP_039912); PoHV-1 BALF1, Pongine herpesvirus 1 (AAK01917); CeHV-15 BALF1, Cercopithicine herpesvirus 15 (AAK95480); EHV-2 ORF E4, Equine herpesvirus 2 ORF E4 protein (NP_042601). The conserved positions are color coded based on the type of amino acid residue as indicated in the consensus line, where h and a stand for hydrophobic residues (A, C, F, I, L, M, V, W, Y: yellow background in alignment) and for aromatic residues (F, Y, W), respectively; b stands for ‘large ’ residues (E, K, R, I, L, M, F, Y, W: gray background); p stands for polar residues (D, E, H, K, N, Q, R, S, T: shown in pink); s and u stand for small residues (A, C, S, T, D, N, V, G, P: green background) and tiny residues (G, A, S), respectively; c and + stand for charged residues (K, R, D, E, H: shown in pink) and positively charged residues (K, R), respectively. Invariant amino acid residues (in 85% or more sequences) are highlighted with black background.

Another interesting new finding was a gene (ORF65) predicted in the genome of Epstein–Barr virus (HHV-4, NC_001345). Initially, the protein product of this gene was found to be significantly similar (with an E-value of <10–5) to uncharacterized ORF26/ORF35 proteins of other γ-herpesviridae. The subsequent PSI-BLAST search revealed after four iterations a similarity (with an E-value of 6 × 10–4) to the ORF26/ORF35 protein family and the ORF48 protein of Equine herpesvirus 4, an α-herpesvirus. The ORF48 protein belongs to the UL14 family of proteins which are present in a minor component of the virion tegument and possess heat shock protein-like functions (35). Eight further PSI-BLAST iterations brought up all the members of this family. Multiple alignment of the ORF26/ORF35 and UL14-like protein sequences (Fig. 5) highlights common features that could not be readily seen in pair-wise alignments, particularly, similar patterns of distribution of charged residues. The observed sequence similarity strongly indicates a common function which remains to be determined by direct experiments. It is likely that these proteins play an important role since the members of the ORF26/ORF35 protein family are now confirmed to be present in all complete genomes of γ-herpesviruses. Interestingly, none of the β-herpesviruses genomes has a TBLASTN detectable homolog of ORF26/ORF35 or UL14, which indicates that ORF26/ORF35 proteins are likely to fulfill a subfamily-level function.
Figure 5

Alignment of the sequences of ORF26/ORF35 and UL14-like proteins. For most sequences, the N- and C-termini are not shown. The coloring is as in Figure 4. The protein gi numbers and the organism names are: HHV-4 GeneMark_65 prediction (positions 1–139) (Human herpesvirus 4); SaHV-2 ORF35 (1–147), 9625991 (Saimiriine herpesvirus 2); HHV-3 MTP (minor tegument protein, positions 11–159), 9625920 (Human herpesvirus 3); CeHV-7 unknown (11–159), 13242439 (Cercopithecine herpesvirus 7); AtHV-3 ORF35 (1–147), 9631227 (Ateline herpesvirus 3); EHV-4 ORF48 (7–155), 9629775 (Equine herpesvirus 4); BoHV-4 unknown (4–150), 13095612 (Bovine herpesvirus 4); RRV unknown (3–146), 18653842 (Rhesus rhadinovirus, Macaca mulatta rhadinovirus); HHV-1 UL14 (7–151), 9629394 (Human herpesvirus 1); HHV-2 UL14 (7–155), 9629283 (Human herpesvirus 2); HHV-1 (HSV1/17) UL14 (3–155), 136823 [Herpes simplex virus (type 1/strain 17)]; EHV-1 ORF48 (7–155), 9626785 (Equine herpesvirus 1); EHV-2 ORF35 (5–150), 9628038 (Equine herpesvirus 2); CalHV-3 ORF26 (3–148), 13676668 (Callitrichine herpesvirus 3); HHV-8 ORF35 (3–147), 18846002 (Human herpesvirus 8); PLHV-1 unknown (3–149), 20453822 (Porcine lymphotropic herpesvirus 1); AlHV-1 ORF35 (2–148), 10140956 (Alcelaphine herpesvirus 1); GaHV-2 UL14 (19–161), 9635049 (Gallid herpesvirus 2); MeHV-1 UL14 MTP (13–156), 12084842 (Meleagrid herpesvirus 1); GaHV-3 UL14 (8–156), 10834883 (Gallid herpesvirus 3); MuHV-4 unknown (3–149), 9629576 (murid herpesvirus 4); PsHV-1 UL14 (15–163), 13094667 (Psittacid herpesvirus 1); BoHV-1 unknown (18–170), 9629861 (Bovine herpesvirus 1); GaHV-1 Ul14 (62–210), 5708112 (Gallid herpesvirus 1); CHV unnamed (1–112, the entire sequence; appears to be incomplete), 1066253 (Canine herpesvirus); SuHV-1 UL14 (6–159, end of sequence), 267201 [Suid herpesvirus 1 (strain NIA-3)].

Some coding regions in viral genomes were missed in the earlier annotation because of their unusual organization. For instance, some viral genes contain a weak, read-through stop codon, which in the original annotation is considered the end of the gene; thus, a part of the real gene (and protein) is missed. In Barmah Forest virus a GeneMarkS prediction (ORF2), recovers the second part of the non-structural polyprotein gene in positions 5679–7298, missed in the original record U73745. Only after combining together these two parts, the protein (NC_001786) shows full-length similarity to the complete polyprotein encoded, for instance, in Ross River virus. The vast majority of genes in viral genomes have no introns. There are, however, a few genes with introns and even some with whole separate genes located inside introns, such as an IE glycoprotein gene, HCMVUL37, in Human herpesvirus 5 (NC_001347). Genes interrupted by introns were identified by GeneMarkS as series of separate protein-coding ORFs. For instance, in Enterobacteria phage T4 (introns may appear not only in viruses of eukaryotic hosts but in phages as well) a gene for DNA topoisomerase small subunit protein (NC_000866) consists of two exons both predicted by GeneMarkS as separate ORFs. Developing an ab initio approach for exact prediction of introns in viral genes is a challenging problem. However, quite frequently the combination of data obtained by intrinsic and extrinsic methods becomes easily amenable to further delineation of exon–intron structure by expert analysis. For instance, in the complete genome of Human adenovirus D (Human adenovirus type 17), GeneMarkS revealed 32 potential genes or gene fragments missed in the original annotation (AF108105). Only 11 of them appeared to be complete genes while the other 21 predicted coding regions were manually assembled into nine genes in the RefSeq record (NC_002067). The above discussed examples of confirmation and functional characterization of new ab initio predictions by subsequent application of an extrinsic method make it quite plausible that many not yet confirmed ab initio predictions will be supported extrinsically as more DNA and protein data become available. Still, the absence of similarity to known proteins may also indicate the uniqueness of the protein whose expression and function might be established only by direct experiments.

The VIOLIN database

Newly defined genome annotations were compiled in the VIOLIN database http://opal.biology.gatech.edu/GeneMark/VIOLIN/. This database currently has flat text file architecture. Differences between the VIOLIN and GenBank annotations are visualized by color codes (Fig. 6). The VIOLIN web site provides hypertext links to the NCBI similarity search programs directly from a genome annotation record. For a gene exactly matching an already known one, the line citing its coordinates is linked to the original gene record in GenBank as well as to the BLink program providing up-to-date information on the protein product (the BLink program, ‘BLAST Link’, displays the prerecorded results of BLAST searches that have been done for every protein sequence in the Entrez proteins data domain). For a predicted gene with no exact or partial match to the previous annotation, links to the programs PSI-BLAST and RPS-BLAST allow one to proceed with further up-to-date characterization of the putative protein. Genes annotated in a GenBank record but not confirmed by our analysis are shown at the bottom of the VIOLIN record with links to the BLink, PSI-BLAST and RPS-BLAST programs to help re-analyze the previously annotated genes.
Figure 6

Snapshot of a sample viral genome record as it appears at the VIOLIN web site.

VIOLIN has been regularly used by the NCBI curators to improve the annotation of viral genomes in the RefSeq collection (36). Gene predictions have been subjected to additional analysis and manual curation by NCBI staff for quality control and functional assignment. Some of the new findings that originally appeared in VIOLIN and that are now included into annotations of 86 viral genomes in the RefSeq collection are shown in Table 5. For example, in Fowl adenovirus D (NC_000899) 14 proteins have been added to 15 existing in the original GenBank record AF083975. This was a particularly difficult case because many of the newly added genes were disrupted by frameshifts that likely resulted from sequencing errors. The new tentative protein sequences were assembled from fragments predicted by GeneMarkS using the ORF Finder (R. Tatusov and T. Tatusova, unpublished results), and BLASTP searches. In another example, in Lymphocystis disease virus (NC_001824) 110 coding regions were identified while the original GenBank record (AF083975) contained only one gene for a major capsid protein.
Table 5.

Sample of the newly added RefSeq genes identified by the statistical gene finding methods described in this work

GroupPredictionPredicted lengthBest BLASTP hitBLASTP lengthScoreE-valueAnnotated function
dsDNAAlcelaphine herpesvirus 1 NC_002531    
 10443–11138231gi|9628007|18366.34.00E–10Putative BALF1 homolog
 Amsacta moorei entomopoxvirus NC_002520    
 complement (114621–114773)50gi|9629968|5265.69.00E–11Conotoxin-like protein
 Ateline herpesvirus 3 NC_001987    
 73911–75053380gi|331012|3846031.00E–171Immediate-early phosphoprotein (transactivator)
 Avian adenovirus CELO NC_001720    
 26793–27119108gi|9633186|30295.62.00E–19Late 33 kDa protein
 Bovine adenovirus 2 NC_002513    
 10583–12295570gi|13487865|5737550Peripentonal hexon-associated protein
 12347–13783478gi|13487866|4717930Penton protein
 15888–16382164gi|13487870|2332014.00E–51Minor capsid protein VI precursor
 16628–19324898gi|13487871|91015460Hexon protein
 21366–23579737gi|13487873|72210040Hexon assembly-associated 100 kDa protein
 complement (30406–30735)109gi|13487881|2451013.00E–21245R protein homolog
 complement (30823–31383)186gi|13487880|2531885.00E–47253R protein homolog
 Deer papillomavirus NC_001523    
 3914–404844gi|137747|4485.49.00E–17E5 transforming protein
 Equine herpesvirus 1 NC_001491    
 complement (112994–113785)263gi|15235673|6081795.00E–44Glycine-rich protein
 Fowl adenovirus 8 NC_000899    
 14583–16211542gi|9628848|5757990Peripentonal hexon associated protein
 complement (38665–40446)593gi|3845680|1953811.00E–104Glycine-rich protien
 Fowlpox virus NC_002188    
 52914–54572552gi|1083970|55211220Rifampicin resistance N3L protein
 Human adenovirus type 2 NC_001405    
 30444–30830128gi|119063|1282645.00E–70Early E3B protein
 complement (30852–31019)55gi|9626584|531434.00E–09U protein
 complement (35146–35532)128gi|119716|2832461.00E–64E4 protein
 Human adenovirus type 12 NC_001460    
 25202–25558118gi|9626562|2111352.00E–3133 kDa phosphoprotein
 complement (31183–31407)74gi|93525|741541.00E–37Early E4 17 kDa protein
 Human adenovirus type 17 NC_002067    
 560–1138192gi|4323354|2513161.00E–85Early E1A protein
 1491–2117208gi|4323357|1823771.00E–104Small T-antigen fragment
 2165–2533122gi|4323358|4952145.00E–55Small T-antigen fragment
 2530–2976148gi|4323358|4953015.00E–81Small T-antigen fragment
 3033–3359108gi|4323358|4952274.00E–59Small T-antigen fragment
 complement (3888–4499)203gi|130244|4484081.00E–113IVa2 maturation protein
 complement (4501–4935)144gi|130244|4482508.00E–66IVa2 maturation protein
 15724–1596078gi|9626191|36874.52.00E–13V minor core protein
 16177–16713178gi|9626570|3581484.00E–35V minor core protein
 16798–1695351gi|9626571|7074.52.00E–13L2 protein mu precursor
 17754–18065103gi|780528|9471613.00E–39Hexon capsid protein
 18068–20617849gi|780528|94715950Hexon capsid protein
 complement (21293–21745)150gi|118737|5172383.00E–62E2A DNA binding protein
 complement (21724–22503)259gi|118735|5123416.00E–93E2A DNA binding protein
 23513–2377988gi|209871|65299.87.00E–21Hexon assembly-associated protein
 23799–24956385gi|9626180|8053311.00E–89Hexon assembly-associated protein
 25472–25774100gi|9626578|2331299.00E–30pVIII protein
 27021–27494157gi|1279435|1663147.00E–85HLA-binding protein
 29892–30287131gi|6940696|1302644.00E–70E3B protein
 30280–30672130gi|6940697|1302721.00E–72E3B protein
 complement (30770–30919)49gi|9626584|5354.32.00E–07U protein
 complement (32308–32970)220gi|3913555|2924641.00E–130E4 protein
 complement (33116–33478)120gi|1699394|1202592.00E–68E4 protein
 complement (33481–33834)117gi|1699393|1172437.00E–64E4 protein
 complement (33831–34058)75gi|1699392|1301425.00E–34E4 protein
 complement (34266–34463)65gi|1699391|1251327.00E–31E4 protein
 Human herpesvirus 3 NC_001348    
 10678–1090575gi|13242466|871129.00E–25Membrane protein
 Human herpesvirus 4 NC_001345    
 503–805100gi|330387|3651605.00E–39Latent membrane protein
 1546–168044gi|330387|36585.87.00E–17Latent membrane protein
 166576–166920114gi|126379|4972574.00E–68Latent membrane protein
 complement (169031–169474)147gi|126373|3862246.00E–58Latent membrane protein
 Human herpesvirus 5 NC_001347    
 160003–16017356gi|7542409|17697.13.00E–20Interleukin-10-like protein
 Human herpesvirus 6B NC_000898    
 23343–23774143gi|11346494|3053001.00E–80G-protein coupled receptor
 Human herpesvirus 7 NC_001716    
 129708–12984846gi|2746315|1531012.00E–21Membrane glycoprotein
 Human papillomavirus type 1a NC_001356    
 812–2650612gi|137646|61212510Replication protein E1
 Human papillomavirus type 53 NC_001593    
 892–114082gi|9627323|6311251.00E–28Replication protein E1
 1391–159166gi|9627323|6311042.00E–22Replication protein E1
 Human papillomavirus type 56 NC_001594    
 895–114984gi|9628585|6301121.00E–24Replication protein E1
 1395–2804469gi|9628585|6309270Replication protein E1
 Human papillomavirus type 71 NC_002644    
 559–82889gi|1491685|10099.88.00E–21Transforming protein E7
 3004–3858284gi|9626037|3832641.00E–69Regulatory protein E2
 4443–5783446gi|13186281|5245831.00E–165Minor capsid protein L2
 5776–7341521gi|3845719|5056890Late major capsid protein L1
 Macaca mulatta rhadinovirus NC_003401    
 70403–70888161gi|13506781|2342792.00E–74bZIP transcription factor
 71468–72160230gi|13506783|2752924.00E–78Glycoprotein R8.1
 Murine adenovirus type 1 NC_000942    
 2897–317592gi|209749|971872.00E–47Early E1A protein
 complement (29726–30076)116gi|9800520|81067.96.00E–11Tropoelastin
 Ovine papillomavirus 1 NC_001789    
 747–2624625gi|9627078|6117440Replication protein E1
 2611–3780389gi|9627069|4163791.00E–104Regulatory protein E2
 3780–394153gi|137747|4466.35.00E–11Transforming protein E5
 4268–5623451gi|9627086|4474451.00E–124Minor capsid protein L2
 Ovine papillomavirus 2 NC_001790    
 745–2628627gi|9627078|6117530Replication protein E1
 2615–3778387gi|9627069|4163691.00E–101Regulatory protein E2
 3778–393050gi|137747|4465.21.00E–10E5 protein
 4122–5615497gi|9627086|4775251.00E–148Minor capsid protein L2
 Tupaia herpesvirus NC_002794    
 complement (60731–61684)317gi|9845327|4781202.00E–26US22 family protein
 Vaccinia virus NC_001559    
 complement (5422–5526)34gi|3096964|35167.13.00E–11TNF receptor II
 complement (6231–6377)48gi|3096965|58696.74.00E–20K1R protein (ankyrin repeat protein)
 76530–7672163gi|11346541|631303.00E–30RNA polymerase
 162151–16226437gi|401315|19358.99.00E–09Guanylate kinase
 183524–18364038gi|3096966|67266.74.00E–11D4L protein (ankyrin repeat protein)
 185397–18550736gi|3096965|58670.63.00E–12K1R protein (ankyrin repeat protein)
 186212–18631634gi|3096964|35167.13.00E–11TNF receptor II
ssDNAChloris striate mosaic virus NC_001466    
 complement (1864–2376)170gi|137410|2953483.00E–95Replication-associated protein
 Periplaneta fuliginosa densovirus NC_000936    
 complement (5134–5388)84gi|5689346|29183.16.00E–16Structural protein
PhageBacteriophage bIL311 NC_002670    
 2252–246470gi|15673928|681394.00E–33ps3 protein 14-like transcriptional regulator
 Bacteriophage L5 NC_003695    
 2–340112gi|4098413|3482168.00E–56Integrase
 Bacteriophage lambda NC_001416    
 34482–35036184gi|140702|1833741.00E–103Superinfection exclusion protein B
 complement (46459–46752)97gi|137520|971965.00E–50Bor protein precursor
 complement (47042–47575)177gi|16128541|1503092.00E–83Putative envelope protein
 Bacteriophage VT2-Sa provirus NC_000902    
 complement (11467–11595)42gi|15830439|21759.75.00E–09c1 repressor protein
 Chlamydia phage phiCPAR39 NC_002180    
 1–14748gi|9634956|841042.00E–22Non-structural protein
 4425–453235gi|9634956|8475.31.00E–13Non-structural protein
 Enterobacteria phage HK022 virion NC_002166    
 19015–20130371gi|9634179|3212702.00E–71Tail fiber protein
 complement (26155–26307)50gi|9634191|501041.00E–22kil protein
 32436–33047203gi|15832758|1881063.00E–22Endonuclease
 33876–34316146gi|9910800|1462944.00E–79Protein Nin B
 35667–36029120gi|9634210|1202444.00E–64Holiday-junction resolvase
 Enterobacteria phage Mu NC_000929    
 complement (33531–34064)177gi|96899|1773608.00E–99Tail fiber assembly protein
 complement (34067–35053)328gi|96901|5366780Tail fiber
 Roseophage SIO1 NC_002519    
 complement (39527–39826)99gi|9964612|2711243.00E–28gp5-like protein
 Streptococcus thermophilus bacteriophage 7201 NC_002185    
 3148–333060gi|9634634|2181163.00E–26Erf protein
 Streptococcus thermophilus bacteriophage Sfi21 NC_000872    
 37175–37687170gi|9635004|1673176.00E–86DNA binding protein
 Sulfolobus Virus 1 NC_001338    
 12585–13001138gi|75696|1442707.00E–72Structural protein VP1
RetroidAbelson murine leukemia virus NC_001499    
 4425–458051gi|332031|6361042.00E–22env polyprotein
 Feline immunodeficiency virus NC_001482    
 9006–917054gi|128015|1221181.00E–26nef protein
 Friend spleen focus-forming virus NC_001500    
 2173–229239gi|11120675|173379.65.00E–15gag polyprotein
 2289–254384gi|510896|5381682.00E–41gag polyprotein
 Human foamy virus NC_001736    
 11054–11827257gi|227764|3565621.00E–159bel-2 protein
 Human T-cell lymphotropic virus type 2 NC_001488    
 6–11937gi|6539751|4877.62.00E–14tax protein
 Moloney murine sarcoma virus NC_001502    
 2485–2967160gi|9626961|17372715.00E–72pol polyprotein
 2945–3388147gi|9626961|17372931.00E–78pol polyprotein
 4563–471851gi|332031|6361025.00E–22Envelope protein
 Murine osteosarcoma virus NC_001506    
 complement (2305–2706)133gi|15822914|1372504.00E–66Ubiquitin-like protein
 Murine sarcoma virus NC_001363    
 2970–3452160gi|9626961|17372715.00E–72pol polyprotein
 3430–3873147gi|9626961|17372931.00E–78pol polyprotein
 5048–520351gi|332031|6361025.00E–22spike protein
 Simian foamy virus NC_001364    
 3–377124gi|9626108|4172791.00E–74bet protein
 Simian immunodeficiency virus NC_001549    
 3–335110gi|9627209|2232475.00E–65nef protein
 Simian type D virus 1 NC_001551    
 5194–5973259gi|9627214|17714501.00E–125pol polyprotein
 Y73 sarcoma virus NC_001404    
 2865–3194109gi|13508442|6112067.00E–53Transmembrane envelope protein
 Barmah Forest virus NC_001786    
 5679–7298539gi|7444406|24938160Non-structural polyprotein
ssRNA(+)Northern cereal mosaic virus NC_002251    
 6740–129162058gi|2961429|19675361.00E–150Polymerase

CONCLUDING REMARKS

We have demonstrated that GeneMarkS, the ab initio gene finding method can be adjusted for analysis of viral genomes of different types and can generate useful information. In small viral genomes, any single missed gene could be of significant interest and the reliable identification of a narrow set of putative proteins to work with by extrinsic and experimental methods saves a considerable amount of time and effort. As the never ending discovery of new viruses brings about new names such as Mimivirus (37) or SARS (38), accurate ab initio computer methods for viral gene identification will remain of great value.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.
  35 in total

1.  The estimation of statistical parameters for local alignment score distributions.

Authors:  S F Altschul; R Bundschuh; R Olsen; T Hwa
Journal:  Nucleic Acids Res       Date:  2001-01-15       Impact factor: 16.971

2.  Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes.

Authors:  D Hiscock; C Upton
Journal:  Bioinformatics       Date:  2000-05       Impact factor: 6.937

3.  GenBank.

Authors:  Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Barbara A Rapp; David L Wheeler
Journal:  Nucleic Acids Res       Date:  2002-01-01       Impact factor: 16.971

4.  A giant virus in amoebae.

Authors:  Bernard La Scola; Stéphane Audic; Catherine Robert; Liang Jungang; Xavier de Lamballerie; Michel Drancourt; Richard Birtles; Jean-Michel Claverie; Didier Raoult
Journal:  Science       Date:  2003-03-28       Impact factor: 47.728

Review 5.  Comparing the predicted and observed properties of proteins encoded in the genome of Escherichia coli K-12.

Authors:  A J Link; K Robison; G M Church
Journal:  Electrophoresis       Date:  1997-08       Impact factor: 3.535

6.  A conserved genetic module that encodes the major virion components in both the coliphage T4 and the marine cyanophage S-PM2.

Authors:  E Hambly; F Tétart; C Desplats; W H Wilson; H M Krisch; N H Mann
Journal:  Proc Natl Acad Sci U S A       Date:  2001-09-11       Impact factor: 11.205

7.  Sequence analysis of the complete genome of an iridovirus isolated from the tiger frog.

Authors:  Jian G He; Ling Lü; Min Deng; Hua H He; Shao P Weng; Xiao H Wang; Song Y Zhou; Qin X Long; Xun Z Wang; Siu M Chan
Journal:  Virology       Date:  2002-01-20       Impact factor: 3.616

8.  The Genome sequence of the SARS-associated coronavirus.

Authors:  Marco A Marra; Steven J M Jones; Caroline R Astell; Robert A Holt; Angela Brooks-Wilson; Yaron S N Butterfield; Jaswinder Khattra; Jennifer K Asano; Sarah A Barber; Susanna Y Chan; Alison Cloutier; Shaun M Coughlin; Doug Freeman; Noreen Girn; Obi L Griffith; Stephen R Leach; Michael Mayo; Helen McDonald; Stephen B Montgomery; Pawan K Pandoh; Anca S Petrescu; A Gordon Robertson; Jacqueline E Schein; Asim Siddiqui; Duane E Smailus; Jeff M Stott; George S Yang; Francis Plummer; Anton Andonov; Harvey Artsob; Nathalie Bastien; Kathy Bernard; Timothy F Booth; Donnie Bowness; Martin Czub; Michael Drebot; Lisa Fernando; Ramon Flick; Michael Garbutt; Michael Gray; Allen Grolla; Steven Jones; Heinz Feldmann; Adrienne Meyers; Amin Kabani; Yan Li; Susan Normand; Ute Stroher; Graham A Tipples; Shaun Tyler; Robert Vogrig; Diane Ward; Brynn Watson; Robert C Brunham; Mel Krajden; Martin Petric; Danuta M Skowronski; Chris Upton; Rachel L Roper
Journal:  Science       Date:  2003-05-01       Impact factor: 47.728

9.  Genome structure of mycobacteriophage D29: implications for phage evolution.

Authors:  M E Ford; G J Sarkis; A E Belanger; R W Hendrix; G F Hatfull
Journal:  J Mol Biol       Date:  1998-05-29       Impact factor: 5.469

10.  VIDA: a virus database system for the organization of animal virus genome open reading frames.

Authors:  M M Albà; D Lee; F M Pearl; A J Shepherd; N Martin; C A Orengo; P Kellam
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

View more
  20 in total

1.  National center for biotechnology information viral genomes project.

Authors:  Yiming Bao; Scott Federhen; Detlef Leipe; Vyvy Pham; Sergei Resenchuk; Mikhail Rozanov; Roman Tatusov; Tatiana Tatusova
Journal:  J Virol       Date:  2004-07       Impact factor: 5.103

2.  Complete genomic nucleotide sequence of the temperate bacteriophage Aa Phi 23 of Actinobacillus actinomycetemcomitans.

Authors:  Grégory Resch; Eva M Kulik; Fred S Dietrich; Jürg Meyer
Journal:  J Bacteriol       Date:  2004-08       Impact factor: 3.490

3.  Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation.

Authors:  Corinne Rancurel; Mahvash Khosravi; A Keith Dunker; Pedro R Romero; David Karlin
Journal:  J Virol       Date:  2009-07-29       Impact factor: 5.103

4.  Redefining the genetics of murine gammaherpesvirus 68 via transcriptome-based annotation.

Authors:  L Steven Johnson; Erin K Willert; Herbert W Virgin
Journal:  Cell Host Microbe       Date:  2010-06-25       Impact factor: 21.023

5.  Ab initio gene identification in metagenomic sequences.

Authors:  Wenhan Zhu; Alexandre Lomsadze; Mark Borodovsky
Journal:  Nucleic Acids Res       Date:  2010-04-19       Impact factor: 16.971

6.  VIGOR, an annotation program for small viral genomes.

Authors:  Shiliang Wang; Jaideep P Sundaram; David Spiro
Journal:  BMC Bioinformatics       Date:  2010-09-07       Impact factor: 3.169

7.  Identification of proteins associated with murine cytomegalovirus virions.

Authors:  Lisa M Kattenhorn; Ryan Mills; Markus Wagner; Alexandre Lomsadze; Vsevolod Makeev; Mark Borodovsky; Hidde L Ploegh; Benedikt M Kessler
Journal:  J Virol       Date:  2004-10       Impact factor: 5.103

8.  Viral infection: an evolving insight into the signal transduction pathways responsible for the innate immune response.

Authors:  Girish J Kotwal; Steven Hatch; William L Marshall
Journal:  Adv Virol       Date:  2012-09-11

9.  Orthopoxvirus genome evolution: the role of gene loss.

Authors:  Robert Curtis Hendrickson; Chunlin Wang; Eneida L Hatcher; Elliot J Lefkowitz
Journal:  Viruses       Date:  2010-09-15       Impact factor: 5.818

10.  Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop.

Authors:  James Rodney Brister; Yiming Bao; Carla Kuiken; Elliot J Lefkowitz; Philippe Le Mercier; Raphael Leplae; Ramana Madupu; Richard H Scheuermann; Seth Schobel; Donald Seto; Susmita Shrivastava; Peter Sterk; Qiandong Zeng; William Klimke; Tatiana Tatusova
Journal:  Viruses       Date:  2010-10-13       Impact factor: 5.818

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.