| Literature DB >> 20211023 |
Doug Hyatt1, Gwo-Liang Chen, Philip F Locascio, Miriam L Land, Frank W Larimer, Loren J Hauser.
Abstract
BACKGROUND: The quality of automated gene prediction in microbial organisms has improved steadily over the past decade, but there is still room for improvement. Increasing the number of correct identifications, both of genes and of the translation initiation sites for each gene, and reducing the overall number of false positives, are all desirable goals.Entities:
Mesh:
Year: 2010 PMID: 20211023 PMCID: PMC2848648 DOI: 10.1186/1471-2105-11-119
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Pseudocode description of the Prodigal algorithm.
Dynamic Programming Connections in Prodigal
| Left Node | Right Node | Connection Type | Connection Score |
|---|---|---|---|
| 5' forward | 3' forward | Gene | Start+coding score |
| 3' reverse | 5' reverse | Gene | Start+coding score |
| 3' forward | 5' forward | Intergenic Space | Distance modifiers |
| 3' forward | 3' reverse | Intergenic Space | Distance modifiers |
| 5' reverse | 3' reverse | Intergenic Space | Distance modifiers |
| 5' reverse | 5' forward | Intergenic Space | Distance modifiers |
| 3' forward | 3' forward | Overlapping Genes | Score of 2nd gene |
| 3' reverse | 3' reverse | Overlapping Genes | Score of 2nd gene |
| 3' forward | 5' reverse | Opposite Strand Overlap | Score of 2nd gene |
Table 1 shows the types of dynamic programming connections in the algorithm. Each end of a gene is a node, and connections between these nodes represent either genes or the space between genes. The more complicated connections indicate overlapping genes.
Shine-Dalgarno RBS Motifs in Prodigal
| Bin # | RBS Motif | RBS Spacer |
|---|---|---|
| 0 | None | None |
| 1 | 3-4 bp | |
| 2 | 13-15 bp | |
| 3 | 13-15 bp | |
| 4 | 11-12 bp | |
| 5 | 3-4 bp | |
| 6 | 11-12 bp | |
| 7 | 11-12 bp | |
| 8 | 3-4 bp | |
| 9 | 5-10 bp | |
| 10 | 13-15 bp | |
| 11 | 3-4 bp | |
| 12 | 11-12 bp | |
| 13 | 5-10 bp | |
| 14 | 5-10 bp | |
| 15 | 5-10 bp | |
| 16 | 5-10 bp | |
| 17 | 11-12 bp | |
| 18 | 3-4 bp | |
| 19 | 5-10 bp | |
| 20 | 11-12 bp | |
| 21 | 3-4 bp | |
| 22 | 5-10 bp | |
| 23 | 3-4 bp | |
| 24 | 5-10 bp | |
| 25 | 11-12 bp | |
| 26 | 3-4 bp | |
| 27 | 5-10 bp | |
Table 2 shows the default bins for the RBS motifs. An 'x' in the middle of a motif indicates a mismatch is allowed. The right column shows the spacer distance allowed between the translation start and the motif. The leftmost column indicates the initial "score" assigned to these bins, i.e. higher bins are better. In subsequent iterations, however, these values may change, and, in non-SD-using organisms, bin 0 (no RBS) may emerge as the highest scoring.
Figure 2Illustration of the dynamic programming connections in Prodigal. The red arrows represent gene connections, and the black arrows represent intergenic connections. (a) 5' forward to 3' forward: Gene on the forward strand. (b) 3' forward to 5' forward: Intergenic space between two forward strand genes. (c) 3' forward to 3' forward: Overlapping genes on the forward strand. (d) 3' forward to 5' reverse: Forward and reverse strand genes whose 3' ends overlap. (e) 5' reverse to 3' reverse: Intergenic space between two reverse strand genes. (f) 3' reverse to 5' reverse: Gene on the reverse strand. (g) 3' reverse to 3' reverse: Overlapping genes on the reverse strand. (h) 5' reverse to 5' forward: Intergenic space between two opposite strand genes. (i) 3' forward to 3' reverse: Intergenic space between two opposite strand genes.
Gene Prediction Performance
| Organism | %GC | Verified | Prodigal 1.20 | Prodigal 1.20+TriTisa | Prodigal 1.20+TiCo | GeneMarkHMM 2.6 | EasyGene 1.2 | Glimmer 3.02 | MED 2.0 |
|---|---|---|---|---|---|---|---|---|---|
| 50.8 | 884 | 884/853 | 884/840 | 884/843 | 882/835 | 880/809 | 880/804 | 875/810 | |
| 68.0 | 550 | 549/533 | 549/525 | 549/520 | 548/510 | 544/494 | 549/478 | 531/418 | |
| 63.4 | 321 | 320/314 | 320/314 | 320/313 | 321/307 | 314/300 | 320/304 | 315/265 | |
| 43.5 | 148 | 148/144 | 148/145 | 148/144 | 147/145 | 144/139 | 144/140 | 146/142 | |
| 56.3 | 131 | 131/128 | 131/127 | 131/128 | 130/123 | 130/124 | 130/121 | 131/116 | |
| 47.8 | 102 | 102/99 | 102/98 | 102/93 | 102/92 | 101/87 | 102/84 | 100/88 | |
| 66.6 | 122 | 118/116 | 118/113 | 118/115 | 115/105 | 122/112 | 120/113 | 117/113 | |
| 65.6 | 62 | 62/58 | 62/58 | 62/57 | 61/54 | 62/58 | 61/55 | 60/56 | |
| 38.2 | 67 | 67/66 | 67/67 | 67/67 | 67/65 | 67/67 | 67/65 | 66/65 | |
| 35.8 | 56 | 56/51 | 56/49 | 56/49 | 56/48 | 56/51 | 56/49 | 56/50 | |
| All Genomes | --- | 2443 | 2437/2362 | 2437/2336 | 2437/2329 | 2429/2284 | 2420/2241 | 2429/2213 | 2397/2123 |
Table 3 shows the performance of gene-finding algorithms on ten sets of experimentally verified genes with experimentally verified translation initiation sites. The first number in each entry indicates the number of 3' ends of genes correctly identified. The second number in each entry indicates the number of 5'+3' ends (genes and their correct starts) exactly identified. Beneath these numbers are % representations for each of those values. The final row shows the performance over the entire set of organisms.
Comparison with Genbank Annotations
| Organism | Genbank Genes with no Joins | Prodigal 1.20 | Prodigal 1.20+TiCo | Prodigal 1.20+TriTisa | GenemarkHMM 2.6 | Glimmer 3.02 | EasyGene 1.2 | MED 2.0 |
|---|---|---|---|---|---|---|---|---|
| 4268 | 4118/3823 | 4118/3779 | 4118/3778 | 4122/3685 | 4076/3563 | 3977/3565 | 4102/3711 | |
| 2110 | 2062/1857 | 2062/1809 | 2061/1790 | 2042/1676 | 2054/1609 | 2018/1692 | 2008/1469 | |
| 2661 | 2630/2398 | 2630/2358 | 2630/2348 | 2624/2251 | 2622/2220 | 2548/2271 | 2586/1953 | |
| 4174 | 4113/3705 | 4113/3678 | 4113/3679 | 4136/3713 | 4102/3569 | 3977/3578 | 4127/3596 | |
| 1699 | 1670/1430 | 1670/1363 | 1670/1353 | 1672/1364 | 1671/1317 | 1652/1389 | 1689/1309 | |
| 3171 | 3146/2587 | 3146/2364 | 3146/2447 | 3124/2337 | 3123/2236 | 3053/2288 | 3126/2192 | |
| 5565 | 5514/5038 | 5514/4885 | 5514/4821 | 5484/4698 | 5491/4705 | 5522/4761 | 5292/4539 | |
Table 4 shows the performance of gene-finding algorithms on seven Genbank files. The first number in each entry indicates the number of 3' ends of genes correctly identified. The second number in each entry indicates the number of 5'+3' ends (genes and their correct starts) exactly identified. Beneath these numbers are % representations for each of those values. It should be noted that Genbank genes are not experimentally verified; this table is just meant to provide a snapshot of performance over entire genomes.
Number of Genes Predicted By Each Method
| Organism | Genbank | EasyGene 1.2 | Prodigal 1.20 | GenemarkHMM 2.6 | Glimmer 3.02 | MED 2.0 |
|---|---|---|---|---|---|---|
| 4321 (1.00) | 4099 (0.95) | 4305 (1.00) | 4378 (1.01) | 4476 (1.04) | 4811 (1.11) | |
| 2110 (1.00) | 2097 (0.99) | 2101 (1.00) | 2085 (0.99) | 2141 (1.01) | 2385 (1.13) | |
| 2661 (1.00) | 2587 (0.97) | 2678 (1.01) | 2685 (1.01) | 2720 (1.02) | 3111 (1.17) | |
| 4177 (1.00) | 4019 (0.96) | 4224 (1.01) | 4354 (1.04) | 4429 (1.06) | 4601 (1.10) | |
| 1700 (1.00) | 1686 (0.99) | 1717 (1.01) | 1738 (1.02) | 1789 (1.05) | 2419 (1.42) | |
| 3172 (1.00) | 3089 (0.97) | 3306 (1.04) | 3462 (1.09) | 3677 (1.16) | 3778 (1.19) | |
| 5566 (1.00) | 5910 (1.06) | 5679 (1.02) | 5712 (1.03) | 5878 (1.06) | 6709 (1.21) | |
Table 5 shows the number of genes predicted by each method on seven Genbank files. The Genbank column indicates the number of genes in the Genbank file. The number in parentheses indicates the number of predicted genes divided by the number of genes in the Genbank file, e.g. 1.10 indicates 10% more genes predicted than the Genbank file.