| Literature DB >> 14627837 |
Ryan Mills1, Michael Rozanov, Alexandre Lomsadze, Tatiana Tatusova, Mark Borodovsky.
Abstract
Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protocols we have better adjusted the GeneMarkS statistical models to sequences of viral genomes. Hundreds of new genes were identified, some in well studied viral genomes. For example, a new gene predicted in the genome of the Epstein-Barr virus was shown to encode a protein similar to alpha-herpesvirus minor tegument protein UL14 with heat shock functions. Convincing evidence of this similarity was obtained after only 12 PSI-BLAST iterations. In another example, several iterations of PSI-BLAST were required to demonstrate that a gene predicted in the genome of Alcelaphine herpesvirus 1 encodes a BALF1-like protein which is thought to be involved in apoptosis regulation and, potentially, carcinogenesis. New predictions were used to refine annotations of viral genomes in the RefSeq collection curated by the National Center for Biotechnology Information. Importantly, even in those cases where no sequence similarities were detected, GeneMarkS significantly reduced the number of primary targets for experimental characterization by identifying the most probable candidate genes. The new genome annotations were stored in VIOLIN, an interactive database which provides access to similarity search tools for up-to-date analysis of predicted viral proteins.Entities:
Mesh:
Substances:
Year: 2003 PMID: 14627837 PMCID: PMC290248 DOI: 10.1093/nar/gkg878
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Flowchart of the statistical gene identification procedure applied to a complete genome of a virus of a prokaryotic host. For viruses of eukaryotic hosts, the Kozak model is used instead of the RBS model.
Summary of the results of the analysis of viral genomes currently available in GenBank and those viral genomes for which reference sequences (RefSeq collection) have already been created at NCBI
| GenBanka Total | RefSeq Total | Eukaryotic hosts | Prokaryotic hosts | |
|---|---|---|---|---|
| Number of viral genomes analyzed | 1750 | 1107 | 1015 | 92 |
| Exact match between prediction and annotation | 15703 | 10425 | 8011 | 2414 |
| Predicted gene differs in start location from annotated one | 1479 | 931 | 368 | 563 |
| Predicted gene overlaps with an intron containing annotated gene | 382 | 209 | 190 | 19 |
| Annotated gene was not predicted (possible false negative) | 3885 (25%)b | 2720 (26%)c | 2231 (28%)c | 489 (20%)c |
| Newly predicted genes (possible false positive) | 3520 (22%)b | 1360 (13%)c | 1047 (13%)c | 313 (13%)c |
| Prediction has a BLASTP and CD-Search hit with | 622 | 99 | 89 | 10 |
| Prediction has a BLASTP hit with | 1248 | 336 | 243 | 93 |
| Prediction has a CD-Search hit with | 35 | 6 | 6 | 0 |
| Prediction has no BLASTP or CD-Search hit with | 1615 | 919 | 709 | 210 |
The numbers in the RefSeq columns do not reflect 86 genomes annotated in RefSeq with the aid of the VIOLIN data. Newly predicted genes have been further analyzed by BLASTP and these results are shown in the bottom rows.
aThe GenBank records used in the current analysis did not include RefSeq records; however, the original records for each RefSeq record were included in this GenBank set of genomes.
bThe percentage value is defined with regard to the number of predicted genes exactly matching the annotation in GenBank.
cThe percentage value is defined with regard to the number of predicted genes exactly matching the annotation in RefSeq.
Distribution of the results of the comparative analysis of gene prediction and annotation for viral genomes from the RefSeq collection with the three sets of viruses clustered by genome length
| a | 10000 <= | ||
|---|---|---|---|
| Exact match | 1772 | 2493 | 6160 |
| Different start | 225 (12.7%) | 483 (19.4%) | 223 (3.6%) |
| Overlap with interrupted gene | 79 ( 4.5%) | 43 (1.7%) | 87 (1.4%) |
| Annotated gene not predicted | 731 (41.3%) | 499 (20.0%) | 1490 (24.1%) |
| New predictions | 331 (18.7%) | 350 (14.0%) | 679 (11.0%) |
| BLASTP and CD-Search hit | 26 | 34 | 39 |
| BLASTP only hit | 51 | 104 | 181 |
| CD-Search only hit | 1 | 0 | 5 |
| No hits | 253 | 212 | 454 |
aThe meaning of the categories in this column is the same as in the left-most column in Table 1.
bThe genome length is designated as L.
cThe number in parentheses designates the number of genomes of a given category.
Distribution of the results of the comparative analysis of gene prediction and annotation for viral genomes from the RefSeq collection joined in classes defined by viral classification
| a | dsDNA (193)b | ssDNA (185)b | dsRNA (127)b | ssRNA positive strand (418)b | ssRNA negative strand (82)b | Retroid (65)b | Satellite (27)b | Virus not classified (6)b | Phage not classified (3)b |
|---|---|---|---|---|---|---|---|---|---|
| Exact match | 8532 | 440 | 142 | 750 | 252 | 151 | 12 | 12 | 132 |
| Different start | 644 | 56 | 5 | 115 | 36 | 32 | 0 | 1 | 42 |
| Overlap with interrupted gene | 125 | 2 | 4 | 51 | 3 | 24 | 0 | 0 | 0 |
| Annotated gene not predicted | 2053 | 275 | 12 | 245 | 32 | 45 | 4 | 6 | 49 |
| New predictions | 1025 | 88 | 54 | 72 | 32 | 53 | 4 | 3 | 29 |
| BLASTP and CD-Search hit | 79 | 2 | 0 | 5 | 0 | 13 | 0 | 0 | 0 |
| BLASTP only hit | 279 | 21 | 6 | 12 | 3 | 8 | 1 | 0 | 6 |
| CD-Search only hit | 5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| No hits | 662 | 65 | 48 | 54 | 29 | 32 | 3 | 3 | 23 |
aThe meaning of the categories in this column is the same as in the left-most column in Table 1.
bThe number in parentheses designates the number of genomes of a given category.
Gene prediction accuracy assessment for nine human herpesviruses
| Virus | Number of genes predicted | Number of genes annotated | Number of genes in test set | Number of correct predictions | Prediction sensitivity (%) | Prediction specificity (%) |
|---|---|---|---|---|---|---|
| HHV-1 (HSV-1) | 76 | 73 | 75 | 69 | 92 | 90 |
| HHV-2 (HSV-2) | 77 | 71 | 71 | 65 | 92 | 84 |
| HHV-3 (VZV) | 72 | 71 | 71 | 69 | 97 | 96 |
| HHV-4 (EBV) | 90 | 94 | 78 | 70 | 89 | 78 |
| HHV-5 (HCMV) | 164 | 198 | 148 | 125 | 84 | 76 |
| HHV-6A | 115 | 121 | 119 | 104 | 87 | 90 |
| HHV-6B | 114 | 91 | 85 | 81 | 95 | 71 |
| HHV-7 | 109 | 107 | 104 | 90 | 87 | 83 |
| HHV-8 (KSHV) | 96 | 82 | 88 | 83 | 94 | 86 |
| Total | 913 | 908 | 839 | 767 | 91 | 84 |
The test set was compiled as explained in text.
Figure 2Length distributions of several categories of genes predicted or annotated in 1047 RefSeq viral genomes. Dark gray bars are used for genes annotated but not predicted; light gray bars are used for predicted but not annotated genes whose protein products produce BLASTP hits with E-values <10–5; white bars are used for predicted but not annotated genes whose protein products do not produce BLASTP hits with E-values <10–5.
Figure 3The positional nucleotide frequency patterns of the GeneMarkS models of the RBS pattern for phage T4 (b) and phage λ (c) are shown in the logo form (27), as compared with the RBS pattern of E.coli shown in (a). Similarly, the Kozak pattern for human herpesvirus 4 (e) and human herpesvirus 8 (f) are shown in the logo form, with the Kozak pattern for human genes shown in (d).
Figure 4MultAlin alignment of (putative) BALF1-like proteins (33). The variable N- and C-termini are shown in lower case. Protein names are abbreviated as follows: AHV-1 BALF1, BALF1 homolog (NP_597933) predicted by GeneMarkS in the genome of Alcelaphine herpesvirus 1 (NC_002531); PLHV-1 vbcl2, Porcine lymphotropic herpesvirus 1 hypothetical v-bcl2 (AAM22111); CALHV-3 ORF1, Callitrichine herpesvirus 3 ORF1 (AAK38208); HVP BALF1, Herpesvirus papio BALF1 (AAK01916); PoHV-3 BALF1, Pongine herpesvirus 3 BALF1 (AAK60342); HHV-4 BALF1, Human herpesvirus 4 BALF1 (NP_039912); PoHV-1 BALF1, Pongine herpesvirus 1 (AAK01917); CeHV-15 BALF1, Cercopithicine herpesvirus 15 (AAK95480); EHV-2 ORF E4, Equine herpesvirus 2 ORF E4 protein (NP_042601). The conserved positions are color coded based on the type of amino acid residue as indicated in the consensus line, where h and a stand for hydrophobic residues (A, C, F, I, L, M, V, W, Y: yellow background in alignment) and for aromatic residues (F, Y, W), respectively; b stands for ‘large ’ residues (E, K, R, I, L, M, F, Y, W: gray background); p stands for polar residues (D, E, H, K, N, Q, R, S, T: shown in pink); s and u stand for small residues (A, C, S, T, D, N, V, G, P: green background) and tiny residues (G, A, S), respectively; c and + stand for charged residues (K, R, D, E, H: shown in pink) and positively charged residues (K, R), respectively. Invariant amino acid residues (in 85% or more sequences) are highlighted with black background.
Figure 5Alignment of the sequences of ORF26/ORF35 and UL14-like proteins. For most sequences, the N- and C-termini are not shown. The coloring is as in Figure 4. The protein gi numbers and the organism names are: HHV-4 GeneMark_65 prediction (positions 1–139) (Human herpesvirus 4); SaHV-2 ORF35 (1–147), 9625991 (Saimiriine herpesvirus 2); HHV-3 MTP (minor tegument protein, positions 11–159), 9625920 (Human herpesvirus 3); CeHV-7 unknown (11–159), 13242439 (Cercopithecine herpesvirus 7); AtHV-3 ORF35 (1–147), 9631227 (Ateline herpesvirus 3); EHV-4 ORF48 (7–155), 9629775 (Equine herpesvirus 4); BoHV-4 unknown (4–150), 13095612 (Bovine herpesvirus 4); RRV unknown (3–146), 18653842 (Rhesus rhadinovirus, Macaca mulatta rhadinovirus); HHV-1 UL14 (7–151), 9629394 (Human herpesvirus 1); HHV-2 UL14 (7–155), 9629283 (Human herpesvirus 2); HHV-1 (HSV1/17) UL14 (3–155), 136823 [Herpes simplex virus (type 1/strain 17)]; EHV-1 ORF48 (7–155), 9626785 (Equine herpesvirus 1); EHV-2 ORF35 (5–150), 9628038 (Equine herpesvirus 2); CalHV-3 ORF26 (3–148), 13676668 (Callitrichine herpesvirus 3); HHV-8 ORF35 (3–147), 18846002 (Human herpesvirus 8); PLHV-1 unknown (3–149), 20453822 (Porcine lymphotropic herpesvirus 1); AlHV-1 ORF35 (2–148), 10140956 (Alcelaphine herpesvirus 1); GaHV-2 UL14 (19–161), 9635049 (Gallid herpesvirus 2); MeHV-1 UL14 MTP (13–156), 12084842 (Meleagrid herpesvirus 1); GaHV-3 UL14 (8–156), 10834883 (Gallid herpesvirus 3); MuHV-4 unknown (3–149), 9629576 (murid herpesvirus 4); PsHV-1 UL14 (15–163), 13094667 (Psittacid herpesvirus 1); BoHV-1 unknown (18–170), 9629861 (Bovine herpesvirus 1); GaHV-1 Ul14 (62–210), 5708112 (Gallid herpesvirus 1); CHV unnamed (1–112, the entire sequence; appears to be incomplete), 1066253 (Canine herpesvirus); SuHV-1 UL14 (6–159, end of sequence), 267201 [Suid herpesvirus 1 (strain NIA-3)].
Figure 6Snapshot of a sample viral genome record as it appears at the VIOLIN web site.
Sample of the newly added RefSeq genes identified by the statistical gene finding methods described in this work
| Group | Prediction | Predicted length | Best BLASTP hit | BLASTP length | Score | Annotated function | |
|---|---|---|---|---|---|---|---|
| 10443–11138 | 231 | gi|9628007| | 183 | 66.3 | 4.00E–10 | Putative BALF1 homolog | |
| complement (114621–114773) | 50 | gi|9629968| | 52 | 65.6 | 9.00E–11 | Conotoxin-like protein | |
| 73911–75053 | 380 | gi|331012| | 384 | 603 | 1.00E–171 | Immediate-early phosphoprotein (transactivator) | |
| 26793–27119 | 108 | gi|9633186| | 302 | 95.6 | 2.00E–19 | Late 33 kDa protein | |
| 10583–12295 | 570 | gi|13487865| | 573 | 755 | 0 | Peripentonal hexon-associated protein | |
| 12347–13783 | 478 | gi|13487866| | 471 | 793 | 0 | Penton protein | |
| 15888–16382 | 164 | gi|13487870| | 233 | 201 | 4.00E–51 | Minor capsid protein VI precursor | |
| 16628–19324 | 898 | gi|13487871| | 910 | 1546 | 0 | Hexon protein | |
| 21366–23579 | 737 | gi|13487873| | 722 | 1004 | 0 | Hexon assembly-associated 100 kDa protein | |
| complement (30406–30735) | 109 | gi|13487881| | 245 | 101 | 3.00E–21 | 245R protein homolog | |
| complement (30823–31383) | 186 | gi|13487880| | 253 | 188 | 5.00E–47 | 253R protein homolog | |
| 3914–4048 | 44 | gi|137747| | 44 | 85.4 | 9.00E–17 | E5 transforming protein | |
| complement (112994–113785) | 263 | gi|15235673| | 608 | 179 | 5.00E–44 | Glycine-rich protein | |
| 14583–16211 | 542 | gi|9628848| | 575 | 799 | 0 | Peripentonal hexon associated protein | |
| complement (38665–40446) | 593 | gi|3845680| | 195 | 381 | 1.00E–104 | Glycine-rich protien | |
| 52914–54572 | 552 | gi|1083970| | 552 | 1122 | 0 | Rifampicin resistance N3L protein | |
| 30444–30830 | 128 | gi|119063| | 128 | 264 | 5.00E–70 | Early E3B protein | |
| complement (30852–31019) | 55 | gi|9626584| | 53 | 143 | 4.00E–09 | U protein | |
| complement (35146–35532) | 128 | gi|119716| | 283 | 246 | 1.00E–64 | E4 protein | |
| 25202–25558 | 118 | gi|9626562| | 211 | 135 | 2.00E–31 | 33 kDa phosphoprotein | |
| complement (31183–31407) | 74 | gi|93525| | 74 | 154 | 1.00E–37 | Early E4 17 kDa protein | |
| 560–1138 | 192 | gi|4323354| | 251 | 316 | 1.00E–85 | Early E1A protein | |
| 1491–2117 | 208 | gi|4323357| | 182 | 377 | 1.00E–104 | Small T-antigen fragment | |
| 2165–2533 | 122 | gi|4323358| | 495 | 214 | 5.00E–55 | Small T-antigen fragment | |
| 2530–2976 | 148 | gi|4323358| | 495 | 301 | 5.00E–81 | Small T-antigen fragment | |
| 3033–3359 | 108 | gi|4323358| | 495 | 227 | 4.00E–59 | Small T-antigen fragment | |
| complement (3888–4499) | 203 | gi|130244| | 448 | 408 | 1.00E–113 | IVa2 maturation protein | |
| complement (4501–4935) | 144 | gi|130244| | 448 | 250 | 8.00E–66 | IVa2 maturation protein | |
| 15724–15960 | 78 | gi|9626191| | 368 | 74.5 | 2.00E–13 | V minor core protein | |
| 16177–16713 | 178 | gi|9626570| | 358 | 148 | 4.00E–35 | V minor core protein | |
| 16798–16953 | 51 | gi|9626571| | 70 | 74.5 | 2.00E–13 | L2 protein mu precursor | |
| 17754–18065 | 103 | gi|780528| | 947 | 161 | 3.00E–39 | Hexon capsid protein | |
| 18068–20617 | 849 | gi|780528| | 947 | 1595 | 0 | Hexon capsid protein | |
| complement (21293–21745) | 150 | gi|118737| | 517 | 238 | 3.00E–62 | E2A DNA binding protein | |
| complement (21724–22503) | 259 | gi|118735| | 512 | 341 | 6.00E–93 | E2A DNA binding protein | |
| 23513–23779 | 88 | gi|209871| | 652 | 99.8 | 7.00E–21 | Hexon assembly-associated protein | |
| 23799–24956 | 385 | gi|9626180| | 805 | 331 | 1.00E–89 | Hexon assembly-associated protein | |
| 25472–25774 | 100 | gi|9626578| | 233 | 129 | 9.00E–30 | pVIII protein | |
| 27021–27494 | 157 | gi|1279435| | 166 | 314 | 7.00E–85 | HLA-binding protein | |
| 29892–30287 | 131 | gi|6940696| | 130 | 264 | 4.00E–70 | E3B protein | |
| 30280–30672 | 130 | gi|6940697| | 130 | 272 | 1.00E–72 | E3B protein | |
| complement (30770–30919) | 49 | gi|9626584| | 53 | 54.3 | 2.00E–07 | U protein | |
| complement (32308–32970) | 220 | gi|3913555| | 292 | 464 | 1.00E–130 | E4 protein | |
| complement (33116–33478) | 120 | gi|1699394| | 120 | 259 | 2.00E–68 | E4 protein | |
| complement (33481–33834) | 117 | gi|1699393| | 117 | 243 | 7.00E–64 | E4 protein | |
| complement (33831–34058) | 75 | gi|1699392| | 130 | 142 | 5.00E–34 | E4 protein | |
| complement (34266–34463) | 65 | gi|1699391| | 125 | 132 | 7.00E–31 | E4 protein | |
| 10678–10905 | 75 | gi|13242466| | 87 | 112 | 9.00E–25 | Membrane protein | |
| 503–805 | 100 | gi|330387| | 365 | 160 | 5.00E–39 | Latent membrane protein | |
| 1546–1680 | 44 | gi|330387| | 365 | 85.8 | 7.00E–17 | Latent membrane protein | |
| 166576–166920 | 114 | gi|126379| | 497 | 257 | 4.00E–68 | Latent membrane protein | |
| complement (169031–169474) | 147 | gi|126373| | 386 | 224 | 6.00E–58 | Latent membrane protein | |
| 160003–160173 | 56 | gi|7542409| | 176 | 97.1 | 3.00E–20 | Interleukin-10-like protein | |
| 23343–23774 | 143 | gi|11346494| | 305 | 300 | 1.00E–80 | G-protein coupled receptor | |
| 129708–129848 | 46 | gi|2746315| | 153 | 101 | 2.00E–21 | Membrane glycoprotein | |
| 812–2650 | 612 | gi|137646| | 612 | 1251 | 0 | Replication protein E1 | |
| 892–1140 | 82 | gi|9627323| | 631 | 125 | 1.00E–28 | Replication protein E1 | |
| 1391–1591 | 66 | gi|9627323| | 631 | 104 | 2.00E–22 | Replication protein E1 | |
| 895–1149 | 84 | gi|9628585| | 630 | 112 | 1.00E–24 | Replication protein E1 | |
| 1395–2804 | 469 | gi|9628585| | 630 | 927 | 0 | Replication protein E1 | |
| 559–828 | 89 | gi|1491685| | 100 | 99.8 | 8.00E–21 | Transforming protein E7 | |
| 3004–3858 | 284 | gi|9626037| | 383 | 264 | 1.00E–69 | Regulatory protein E2 | |
| 4443–5783 | 446 | gi|13186281| | 524 | 583 | 1.00E–165 | Minor capsid protein L2 | |
| 5776–7341 | 521 | gi|3845719| | 505 | 689 | 0 | Late major capsid protein L1 | |
| 70403–70888 | 161 | gi|13506781| | 234 | 279 | 2.00E–74 | bZIP transcription factor | |
| 71468–72160 | 230 | gi|13506783| | 275 | 292 | 4.00E–78 | Glycoprotein R8.1 | |
| 2897–3175 | 92 | gi|209749| | 97 | 187 | 2.00E–47 | Early E1A protein | |
| complement (29726–30076) | 116 | gi|9800520| | 810 | 67.9 | 6.00E–11 | Tropoelastin | |
| 747–2624 | 625 | gi|9627078| | 611 | 744 | 0 | Replication protein E1 | |
| 2611–3780 | 389 | gi|9627069| | 416 | 379 | 1.00E–104 | Regulatory protein E2 | |
| 3780–3941 | 53 | gi|137747| | 44 | 66.3 | 5.00E–11 | Transforming protein E5 | |
| 4268–5623 | 451 | gi|9627086| | 447 | 445 | 1.00E–124 | Minor capsid protein L2 | |
| 745–2628 | 627 | gi|9627078| | 611 | 753 | 0 | Replication protein E1 | |
| 2615–3778 | 387 | gi|9627069| | 416 | 369 | 1.00E–101 | Regulatory protein E2 | |
| 3778–3930 | 50 | gi|137747| | 44 | 65.2 | 1.00E–10 | E5 protein | |
| 4122–5615 | 497 | gi|9627086| | 477 | 525 | 1.00E–148 | Minor capsid protein L2 | |
| complement (60731–61684) | 317 | gi|9845327| | 478 | 120 | 2.00E–26 | US22 family protein | |
| complement (5422–5526) | 34 | gi|3096964| | 351 | 67.1 | 3.00E–11 | TNF receptor II | |
| complement (6231–6377) | 48 | gi|3096965| | 586 | 96.7 | 4.00E–20 | K1R protein (ankyrin repeat protein) | |
| 76530–76721 | 63 | gi|11346541| | 63 | 130 | 3.00E–30 | RNA polymerase | |
| 162151–162264 | 37 | gi|401315| | 193 | 58.9 | 9.00E–09 | Guanylate kinase | |
| 183524–183640 | 38 | gi|3096966| | 672 | 66.7 | 4.00E–11 | D4L protein (ankyrin repeat protein) | |
| 185397–185507 | 36 | gi|3096965| | 586 | 70.6 | 3.00E–12 | K1R protein (ankyrin repeat protein) | |
| 186212–186316 | 34 | gi|3096964| | 351 | 67.1 | 3.00E–11 | TNF receptor II | |
| complement (1864–2376) | 170 | gi|137410| | 295 | 348 | 3.00E–95 | Replication-associated protein | |
| complement (5134–5388) | 84 | gi|5689346| | 291 | 83.1 | 6.00E–16 | Structural protein | |
| 2252–2464 | 70 | gi|15673928| | 68 | 139 | 4.00E–33 | ps3 protein 14-like transcriptional regulator | |
| 2–340 | 112 | gi|4098413| | 348 | 216 | 8.00E–56 | Integrase | |
| 34482–35036 | 184 | gi|140702| | 183 | 374 | 1.00E–103 | Superinfection exclusion protein B | |
| complement (46459–46752) | 97 | gi|137520| | 97 | 196 | 5.00E–50 | Bor protein precursor | |
| complement (47042–47575) | 177 | gi|16128541| | 150 | 309 | 2.00E–83 | Putative envelope protein | |
| complement (11467–11595) | 42 | gi|15830439| | 217 | 59.7 | 5.00E–09 | c1 repressor protein | |
| 1–147 | 48 | gi|9634956| | 84 | 104 | 2.00E–22 | Non-structural protein | |
| 4425–4532 | 35 | gi|9634956| | 84 | 75.3 | 1.00E–13 | Non-structural protein | |
| 19015–20130 | 371 | gi|9634179| | 321 | 270 | 2.00E–71 | Tail fiber protein | |
| complement (26155–26307) | 50 | gi|9634191| | 50 | 104 | 1.00E–22 | kil protein | |
| 32436–33047 | 203 | gi|15832758| | 188 | 106 | 3.00E–22 | Endonuclease | |
| 33876–34316 | 146 | gi|9910800| | 146 | 294 | 4.00E–79 | Protein Nin B | |
| 35667–36029 | 120 | gi|9634210| | 120 | 244 | 4.00E–64 | Holiday-junction resolvase | |
| complement (33531–34064) | 177 | gi|96899| | 177 | 360 | 8.00E–99 | Tail fiber assembly protein | |
| complement (34067–35053) | 328 | gi|96901| | 536 | 678 | 0 | Tail fiber | |
| complement (39527–39826) | 99 | gi|9964612| | 271 | 124 | 3.00E–28 | gp5-like protein | |
| 3148–3330 | 60 | gi|9634634| | 218 | 116 | 3.00E–26 | Erf protein | |
| 37175–37687 | 170 | gi|9635004| | 167 | 317 | 6.00E–86 | DNA binding protein | |
| 12585–13001 | 138 | gi|75696| | 144 | 270 | 7.00E–72 | Structural protein VP1 | |
| 4425–4580 | 51 | gi|332031| | 636 | 104 | 2.00E–22 | env polyprotein | |
| 9006–9170 | 54 | gi|128015| | 122 | 118 | 1.00E–26 | nef protein | |
| 2173–2292 | 39 | gi|11120675| | 1733 | 79.6 | 5.00E–15 | gag polyprotein | |
| 2289–2543 | 84 | gi|510896| | 538 | 168 | 2.00E–41 | gag polyprotein | |
| 11054–11827 | 257 | gi|227764| | 356 | 562 | 1.00E–159 | bel-2 protein | |
| 6–119 | 37 | gi|6539751| | 48 | 77.6 | 2.00E–14 | tax protein | |
| 2485–2967 | 160 | gi|9626961| | 1737 | 271 | 5.00E–72 | pol polyprotein | |
| 2945–3388 | 147 | gi|9626961| | 1737 | 293 | 1.00E–78 | pol polyprotein | |
| 4563–4718 | 51 | gi|332031| | 636 | 102 | 5.00E–22 | Envelope protein | |
| complement (2305–2706) | 133 | gi|15822914| | 137 | 250 | 4.00E–66 | Ubiquitin-like protein | |
| 2970–3452 | 160 | gi|9626961| | 1737 | 271 | 5.00E–72 | pol polyprotein | |
| 3430–3873 | 147 | gi|9626961| | 1737 | 293 | 1.00E–78 | pol polyprotein | |
| 5048–5203 | 51 | gi|332031| | 636 | 102 | 5.00E–22 | spike protein | |
| 3–377 | 124 | gi|9626108| | 417 | 279 | 1.00E–74 | bet protein | |
| 3–335 | 110 | gi|9627209| | 223 | 247 | 5.00E–65 | nef protein | |
| 5194–5973 | 259 | gi|9627214| | 1771 | 450 | 1.00E–125 | pol polyprotein | |
| 2865–3194 | 109 | gi|13508442| | 611 | 206 | 7.00E–53 | Transmembrane envelope protein | |
| 5679–7298 | 539 | gi|7444406| | 2493 | 816 | 0 | Non-structural polyprotein | |
| 6740–12916 | 2058 | gi|2961429| | 1967 | 536 | 1.00E–150 | Polymerase |