| Literature DB >> 22131910 |
Michael E Wall1, Sindhu Raghavan, Judith D Cohn, John Dunbar.
Abstract
Recent studies have noted extensive inconsistencies in gene start sites among orthologous genes in related microbial genomes. Here we provide the first documented evidence that imposing gene start consistency improves the accuracy of gene start-site prediction. We applied an algorithm using a genome majority vote (GMV) scheme to increase the consistency of gene starts among orthologs. We used a set of validated Escherichia coli genes as a standard to quantify accuracy. Results showed that the GMV algorithm can correct hundreds of gene prediction errors in sets of five or ten genomes while introducing few errors. Using a conservative calculation, we project that GMV would resolve many inconsistencies and errors in publicly available microbial gene maps. Our simple and logical solution provides a notable advance toward accurate gene maps.Entities:
Mesh:
Year: 2011 PMID: 22131910 PMCID: PMC3219611 DOI: 10.1371/journal.pcbi.1002284
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Flow diagram for the pipeline implementing the Genome Majority Vote algorithm.
Individual steps A–E are explained in the text (Methods).
Figure 2Example of a GMV modification of gene starts that is typical in terms of ortholog sequence identity, change in the length of the gene, and the start codon before and after the change.
Consistency statistics for ortholog sets.
| 5 genomes | 10 genomes | |||||||
| Low | Medium | High | Very High | Low | Medium | High | Very High | |
| Total # of ortholog sets generated in the pipeline | 3633 | 2446 | 1414 | 988 | 3271 | 2133 | 1317 | 380 |
| # of ortholog sets for which Prodigal starts were initially inconsistent | 213 (5.9%) | 536 (21.9%) | 574 (40.6%) | 547 (55.4%) | 251 (7.7%) | 614 (28.8%) | 634 (48.1%) | 235 (61.8%) |
| # of ortholog sets for which Prodigal starts were already consistent | 3420 (94.1%) | 1910 (78.1%) | 840 (59.4%) | 441 (44.6%) | 3020 (92.3%) | 1519 (71.2%) | 683 (51.9%) | 145 (38.2%) |
| # of inconsistent ortholog sets that were made consistent by GMV | 74 (34.7%) | 278 (51.9%) | 204 (35.5%) | 74 (16.8%) | 89 (35.5%) | 286 (46.6%) | 227 (35.8%) | 31 (13.2%) |
| # of ortholog sets with consistent starts after GMV | 3494 (96.2%) | 2188 (89.5%) | 1044 (73.8%) | 515 (52.1%) | 3109 (95.0%) | 1805 (84.6%) | 910 (69.1%) | 176 (46.3%) |
| # of ortholog sets with at least one consistent start | 3626 (99.8%) | 2428 (99.3%) | 1326 (93.8%) | 863 (87.3%) | 3269 (99.9%) | 2098 (98.4%) | 1215 (92.3%) | 310 (81.6%) |
The genomes in each set are listed in Supplementary Table S1.
Percentage is with respect to total # of ortholog sets generated in the pipeline.
Percentage is with respect to # of ortholog sets for which Prodigal starts were initially inconsistent.
Codon change statistics for GMV start site changes in medium and high diversity genome test sets.
| 5 genomes | 10 genomes | ||||
| Codon before change | Codon after change | Medium | High | Medium | High |
|
|
| 243 | 184 | 354 | 249 |
|
|
| 47 | 26 | 66 | 41 |
|
|
| 16 | 10 | 33 | 26 |
|
|
| 31 | 15 | 42 | 22 |
|
|
| 8 | 5 | 5 | 7 |
|
|
| 0 | 1 | 0 | 1 |
|
|
| 9 | 10 | 14 | 11 |
|
|
| 3 | 1 | 7 | 5 |
|
|
| 0 | 0 | 1 | 1 |
| Total Changes | 357 | 252 | 522 | 363 | |
| Same codon | 251 | 189 | 360 | 257 | |
| Different codon | 106 | 63 | 162 | 106 | |
Validation statistics for ortholog sets.
| 5 genomes | 10 genomes | |||||||
| Low | Medium | High | Very High | Low | Medium | High | Very High | |
| # of ortholog sets for which | 833 | 683 | 457 | 274 | 800 | 618 | 414 | 129 |
| # of ortholog sets for which | 825 (99.0%) | 613 (89.8%) | 382 (83.6%) | 245 (89.4%) | 787 (98.4%) | 546 (88.3%) | 329 (79.5%) | 107 (82.9%) |
| # of ortholog sets for which | 8 (0.96%) | 70 (10.2%) | 75 (16.4%) | 29 (10.6%) | 13 (1.63%) | 72 (11.7%) | 85 (20.5%) | 22 (17.1%) |
| # of ortholog sets with start sites matching a validated | 799 (95.9%) | 664 (97.2%) | 444 (97.2%) | 271 (98.9%) | 769 (96.1%) | 602 (97.4%) | 406 (98.1%) | 126 (97.7%) |
| # of ortholog sets with start sites matching a validated | 792 (96.0%) | 609 (99.3%) | 381 (99.7%) | 245 (100%) | 760 (96.6%) | 544 (99.6%) | 328 (99.7%) | 107 (100%) |
| # of ortholog sets with start sites matching a validated | 7 (87.5%) | 55 (78.6%) | 63 (84.0%) | 26 (89.7%) | 9 (69.2%) | 58 (80.6%) | 78 (91.8%) | 19 (86.3%) |
Percentage is with respect to total # of ortholog sets.
Percentage is with respect to # of ortholog sets for which E. coli validation was available and for which all Prodigal predictions were already consistent. This represents accuracy of the consistent subset.
Percentage is with respect to # of ortholog sets for which E. coli validation was available and for which Prodigal predictions were inconsistent. This represents accuracy of the inconsistent subset.
Validation statistics for GMV algorithm corrections to Prodigal gene maps.
| 5 genomes | 10 genomes | |||||||
| Low | Medium | High | Very High | Low | Medium | High | Very High | |
| # of ortholog sets with an incorrect | 34 | 19 | 13 | 3 | 31 | 16 | 8 | 3 |
| # of corrected validated starts in | 0 | 9 | 11 | 3 | 1 | 8 | 7 | 3 |
| # of | 1 | 2 | 2 | 1 | 0 | 0 | 0 | 0 |
| Error Rate ( | 1.00 | 0.182 | 0.154 | 0.25 | 0.5 | 0.111 | 0.125 | 0.25 |
| Sensitivity ( | 0 | 0.474 | 0.846 | 1.0 | 0.032 | 0.5 | 0.875 | 1.00 |
| Total # of changes in | 13 | 51 | 41 | 12 | 9 | 38 | 21 | 4 |
| Total # of changes in all genomes | 92 | 357 | 252 | 88 | 169 | 522 | 363 | 40 |
| Total # of changes that agree with a validated start | 9 | 76 | 82 | 31 | 20 | 114 | 126 | 28 |
| Total # of changes that disagree with a validated start | 4 | 7 | 6 | 1 | 4 | 15 | 0 | 0 |
E = FP/(TP+FP), where TP = number of true positives (second row), and FP = number of false positives (third row).
Estimated by adding one additional false positive to obtain a nonzero value.
S = TP/GP, where TP = number of true positives (second row), and GP = number of ground truth positives (first row).
Figure 3Impact of gene prediction changes in high diversity genome sets.
Number of correct and incorrect changes are estimated using validated starts in E. coli, as described in the text. A) E. coli; B) All genomes.
Ortholog set yield calculated for medium and high diversity genome test sets.
| 5 genomes | 10 genomes | |||
| Medium | High | Medium | High | |
| Maximum possible # ortholog sets, | 4282 | 4332 | 4151 | 3710 |
| Ortholog set yield, | 57.1% | 32.6% | 51.3% | 35.4% |
| Increase in consistency after applying GMV, | 11.4% | 14.4% | 13.4% | 17.2% |
Calculated as percentage of M using values from the first row of Table 1.
Calculated as percentage of the number actual ortholog sets by subtracting the third from the fifth row of Table 1.
Comparison of inconsistencies for Prodigal vs. GenBank or Glimmer3 start sites.
| 5 genomes Medium Diversity | 5 genomes High Diversity | ||
| Prodigal vs. GenBank | # of shared ortholog sets | 2289 | 1234 |
| # of ortholog sets for which Prodigal starts were initially inconsistent | 455 | 413 | |
| # of ortholog sets for which GenBank starts were initially inconsistent | 925 | 311 | |
| # made consistent by Prodigal | 552 | 50 | |
| # made consistent by GMV | 194 | 47 | |
| Prodigal vs. Glimmer3 | # of shared ortholog sets | 2427 | 1398 |
| # of ortholog sets for which Prodigal starts were initially inconsistent | 532 | 566 | |
| # of ortholog sets for which GenBank starts were initially inconsistent | 869 | 767 | |
| # made consistent by Prodigal | 432 | 248 | |
| # made consistent by GMV | 193 | 155 |
Median sequence identities among orthologs from all genome test sets.
| # Genomes, Diversity | Median Sequence Identity |
| 5, Low | 99.3% |
| 5, Medium | 85.2% |
| 5, High | 71.6% |
| 5, Very High | 64.4% |
| 10, Low | 98.8% |
| 10, Medium | 82.5% |
| 10, High | 69.5% |
| 10, Very High | 51.3% |
The sequence identity used is the minimum value among all gene pairs in each ortholog set. The percentage value is normalized using sequence length information (Methods).
Figure 4Sequence identity statistics for the high diversity ortholog sets.
A) 5-genome set; B) 10-genome test set.