| Literature DB >> 23226328 |
Stephen J Goodswen1, Paul J Kennedy, John T Ellis.
Abstract
Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen's genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23226328 PMCID: PMC3511556 DOI: 10.1371/journal.pone.0050609
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Gene finders in chronological order based on release year.
| Year | Gene Finder Name | Type | Comments |
| 1991 | GRAIL |
| No longer supported |
| 1992 | GeneID |
| |
| 1993 | GeneParser |
| |
| 1994 | Fgeneh |
| Finds single exon only |
| 1996 | Genie | Hybrid | |
| 1996 | PROCRUSTES | Evidence based | |
| 1997 | Fgenes | Hybrid | No download version |
| 1997 | GeneFinder |
| Unpublished work |
| 1997 | GenScan |
| |
| 1997 | HMMGene |
| No download version |
| 1997 | GeneWise | Evidence based | |
| 1998 | GeneMark.hmm |
| |
| 2000 | GenomeScan | Comparative | |
| 2001 | Twinscan | Comparative | |
| 2002 | GAZE | Comparative | |
| 2004 | Ensembl | Evidence based | |
| 2004 | GeneZilla/TIGRSCAN |
| No longer supported |
| 2004 | GlmmerHMM |
| |
| 2004 | SNAP |
| |
| 2006 | AUGUSTUS+ | Hybrid | |
| 2006 | N-SCAN | Comparative | |
| 2006 | Twinscan_EST | Comparative+Evidence | |
| 2006 | N_Scan_EST | Comparative+Evidence | |
| 2007 | Conrad |
| |
| 2007 | Contrast |
| |
| 2009 | mGene |
| No longer supported |
Hybrid = ab inito and evidence based; Comparative = genome sequence comparison.
Figure 1Example of exon location file.
The first column is the feature name. The second and third column defines that start and end location of the exon relative to the “Einit” feature. The last column is the name of the gene sequence relative to the exon locations.
Figure 2Example of flanking nucleotide bases appended to coding segment.
It is required that a set number of nucleotide bases are added before and after the coding segment (CDS) sequence when assembling training genes.
Figure 3Schematic representation of gene prediction evaluation at the nucleotide level.
Abbreviations: C = coding nucleotide located on exon, N = non-coding nucleotide located on intron, TP = true positive, FP = false positive, TN = true negative, and FN = false negative.
Figure 4Schematic representation of gene prediction evaluation at the exon level.
Exons are represented by shaded rectangles. Introns are represented by the adjoining solid lines. Abbreviations: TP = true positive, FP = false positive, and FN = false negative.
Figure 5The seven classifications of the predicted gene locations relative to a test gene.
Evaluation of gene finders at the gene, exon, and nucleotide level (with 250, 500, and 1000 training genes).
| GeneFinder | # of flankingbases | Accuracy level | |||||||||||||||||
| GENE | EXON | NUCLEOTIDE | |||||||||||||||||
| Number of training genes | |||||||||||||||||||
| 250 | 500 | 1000 | 250 | 500 | 1000 | 250 | 500 | 1000 | |||||||||||
| SN | SP | SN | SP | SN | SP | SN | SP | SN | SP | SN | SP | SN | SP | SN | SP | SN | SP | ||
|
|
| 0.11 | 0.11 | 0.13 | 0.13 | 0.12 | 0.11 | 0.24 | 0.26 |
|
| 0.25 | 0.27 | 0.52 | 0.49 | 0.56 | 0.56 | 0.52 | 0.48 |
|
| 0.11 | 0.11 | 0.13 | 0.12 | 0.13 | 0.12 | 0.23 | 0.25 | 0.26 | 0.29 | 0.26 | 0.29 | 0.51 | 0.48 | 0.55 | 0.53 | 0.54 | 0.50 | |
|
| 0.11 | 0.11 |
|
|
|
| 0.23 | 0.26 | 0.26 | 0.30 |
| 0.29 | 0.51 | 0.47 | 0.56 |
|
|
| |
|
|
| 0.11 | 0.07 | 0.15 | 0.10 | 0.15 | 0.10 | 0.40 | 0.27 | 0.42 | 0.30 |
|
| 0.44 | 0.37 | 0.45 |
|
| 0.37 |
|
| 0.14 | 0.08 | 0.15 | 0.10 |
|
| 0.40 | 0.28 | 0.40 | 0.28 | 0.42 | 0.31 | 0.44 | 0.36 | 0.44 | 0.36 | 0.46 | 0.36 | |
|
| 0.13 | 0.08 | 0.14 | 0.10 | 0.15 | 0.10 | 0.37 | 0.26 | 0.32 | 0.23 | 0.37 | 0.27 | 0.43 | 0.35 | 0.42 | 0.32 | 0.41 | 0.32 | |
|
| N/A | 0.24 | 0.28 | 0.27 | 0.31 |
|
| 0.44 | 0.52 | 0.47 | 0.52 |
|
| 0.78 | 0.72 | 0.80 | 0.76 |
|
|
Test genes (299 in total) are excluded from the training genes.
The values underlined indicate the highest accuracy in each accuracy level for each gene finder.
Abbreviations:
SN = sensitivity, SP = specificity.
gl = GlimmerHMM; aug = AUGUSTUS.
N/A = not applicable – the AUGUSTUS training program does not give the option to control the number of bases that precede and follow the coding segment (CDS) sequences of the training genes.
Evaluation of gene finders with various training genes.
| GeneFinder | Traininggenes | Accuracy level | ||||||||
| GENE | EXON | NUCLEOTDE | ||||||||
| SN | SP | SN | SP | SN | SP | Predicted | Matched | Duplicate | ||
|
| All validated genes excepttest genes | 0.16 | 0.15 | 0.27 | 0.30 | 0.54 | 0.50 | 684 | 269 (47) | 52 |
| All validated genes includingtest gene | 0.20 | 0.20 | 0.33 | 0.35 | 0.61 | 0.55 | 710 | 273 (64) | 47 | |
| Using a trained model fromprogram creator | Not available for | |||||||||
| Using a model trained onhuman genes | 0.02 | 0.01 | 0.04 | 0.05 | 0.23 | 0.14 | 1129 | 247 (5) | 131 | |
|
| All validated genes excepttest genes | 0.18 | 0.12 | 0.44 | 0.33 | 0.46 | 0.35 | 889 | 277 (53) | 172 |
| All validated genes includingtest genes | 0.18 | 0.12 | 0.46 | 0.35 | 895 | 279 (54) | 170 | |||
| Using a trained model fromprogram creator | Not available for | |||||||||
| Using a model trained onhuman genes | 0.09 | 0.04 | 0.06 | 0.09 | 0.16 | 0.11 | 1759 | 267 (25) | 315 | |
|
| All validated genes excepttest genes | 0.33 | 0.38 | 0.54 | 0.57 | 0.81 | 0.78 | 510 | 261 (99) | 2 |
| All validated genesincluding test genes | 0.37 | 0.42 | 0.57 | 0.59 | 0.82 | 0.79 | 514 | 265 (111) | 2 | |
| Using a trained modelfrom program creator | 0.36 | 0.42 | 0.57 | 0.56 | 0.78 | 0.84 | 470 | 256 (108) | 0 | |
| Using a model trained onhuman genes | 0.12 | 0.09 | 0.19 | 0.19 | 0.34 | 0.25 | 114 | 282 (37) | 150 | |
|
| Using a trained modelfrom program creator | 0.06 | 0.07 | 0.15 | 0.13 | 0.43 | 0.37 | 580 | 240 (19) | 49 |
|
| Using a self-training procedure.i.e. no training genes required | 0.08 | 0.09 | 0.23 | 0.19 | 0.56 | 0.44 | 630 | 248 (25) | 45 |
The types of training genes used in the training model. The number of validated genes = 3,432 (includes test genes) and the number of test genes = 299.
Number of predicted genes that align entirely or partly with the test genes and meet the criteria E-value = 0 and 100% coverage – a value in brackets is the number of predicted genes that are exactly the same as the test genes i.e. the start and end genomic coordinates of each exon is the same as each test gene exon.
Number of predicted genes that align to the same test gene i.e. the predicted gene is only a part of the entire test gene and there can be one or more predictions per test gene.
Number of matching predicted genes with 299 test genes using BLASTN (with 250, 500, and 1000 training genes).
| Gene Finder | # of flanking bases | Number of training genes | ||||||||
| 250 | 500 | 1000 | ||||||||
| Predicted | Matched | Duplicate | Predicted | Matched | Duplicate | Predicted | Matched | Duplicate | ||
|
| 250 | 575 | 255 (34) | 43 | 627 | 261 (40) | 44 | 666 | 260 (35) | 58 |
| 500 | 579 | 254 (33) | 45 | 631 |
| 44 | 668 |
| 48 | |
| 1000 | 594 | 256 (33) | 47 | 640 | 259 (43) | 44 | 659 |
| 45 | |
|
| 250 | 882 | 273 (34) | 192 | 891 | 271 (45) | 186 | 892 |
| 169 |
| 500 | 880 | 266 (40) | 193 | 851 | 265 (44) | 187 | 862 | 270 (48) | 172 | |
| 1000 | 824 | 256 (39) | 184 | 829 | 255 (43) | 189 | 838 | 262 (45) | 182 | |
|
| N/A | 485 | 248 (72) | 5 | 506 | 256 (82) | 6 | 508 |
| 2 |
Abbreviations:
gl = GlimmerHMM; aug = AUGUSTUS.
N/A = not applicable – the AUGUSTUS training program does not give the option to control the number of bases that precede and follow the coding segment (CDS) sequence of the training genes.
Number of predicted genes that align entirely or partly with the test genes and meet the criteria E-value = 0 and 100% coverage – a value in brackets is the number of predicted genes that are exactly the same as the test genes i.e. each exon genomic coordinate is the same.
Number of predicted genes that align to the same test gene i.e. the predicted gene is only a part of the entire test gene and there can be one or more predictions per test gene.
The values underlined indicate the highest number of matches for each gene finder.
Protein homology search on translated gene finder predictions.
| Gene Finder | Gene predictions | Homology found | Homology not found |
| AUGUSTUS | 514 | 509 | 5 |
| GeneMark.hmm | 580 | 481 | 99 |
| SNAP | 895 | 734 | 161 |
| GlimmerHMM | 710 | 657 | 53 |
Includes duplicate proteins. Duplicate proteins are when several gene predictions match to the same protein.
Identical proteins found in protein database per number of gene finders.
| No. of proteins | No. of Gene Finders | Gene Finders |
| 923 | 4 | AUGUSTUS, Glimmer, GeneMark, SNAP |
| 257 | 3 | AUGUSTUS, Glimmer, GeneMark |
| 84 | 3 | Glimmer, GeneMark, SNAP |
| 25 | 3 | AUGUSTUS, Glimmer, SNAP |
| 8 | 3 | AUGUSTUS, GeneMark, SNAP |
| 57 | 2 | Glimmer, GeneMark |
| 43 | 2 | AUGUSTUS, Glimmer |
| 25 | 2 | Glimmer, SNAP |
| 14 | 2 | AUGUSTUS, SNAP |
| 8 | 2 | AUGUSTUS, GeneMark |
| 8 | 2 | GeneMark, SNAP |
| 23 | 1 | SNAP |
| 27 | 1 | AUGUSTUS |
| 34 | 1 | GeneMark |
| 67 | 1 | Glimmer |
|
| ||
Statistics for predicted and test genes.
| Statistics for … | AUGUSTUS | GlimmerHMM | SNAP | GeneMark_hmm | Test genes |
| Number of genes | 514 | 710 | 895 | 580 | 299 |
|
| |||||
| Shortest | 270 | 201 | 399 | 303 | 298 |
| Longest | 44325 | 37271 | 22713 | 45369 | 47133 |
| Average | 5733 | 5677 | 4679 | 7979 | 5388 |
| Range | 44055 | 37070 | 22314 | 45066 | 46835 |
| Number of genes containing an N | 10 | 17 | 22 | 19 | 5 |
|
| |||||
| Shortest | 29 | 0 | 0 | 104 | 52 |
| Longest | 31658 | 8549 | 21815 | 4677 | 106560 |
| Average | 3894 | 1398 | 3112 | 664 | 11081 |
| Range | 31629 | 8549 | 21813 | 4573 | 106508 |
| Percentage of overlaps | 0 | 0.4 | 26.9 | 0 | 0 |
|
| |||||
| Length of chromosome | 5023922 | 5023922 | 5023922 | 5023922 | 5023922 |
| Start of first gene | 54055 | 635 | 54055 | 7447 | 78150 |
| End of last gene | 5002530 | 5020498 | 5023141 | 5020134 | 5002376 |
| Range | 4948475 | 5019863 | 4969086 | 5012687 | 4924226 |
| Distance to start of chromosome | 54055 | 635 | 50455 | 7447 | 78150 |
| Distance to end of chromosome | 21392 | 3424 | 781 | 3788 | 21546 |
|
| |||||
| Partition 1 | 26.1 | 26.3 | 25.1 | 27.1 | 26.4 |
| Partition 2 | 24.7 | 23.3 | 25.6 | 24.5 | 25.7 |
| Partition 3 | 24.5 | 25.1 | 24.5 | 24.1 | 21.4 |
| Partition 4 | 24.7 | 25.2 | 24.8 | 24.3 | 26.4 |
|
| |||||
| Total number | 3357 | 3334 | 4746 | 4172 | 2013 |
| Shortest exon | 3 | 5 | 5 | 7 | 3 |
| Longest exon | 9981 | 9981 | 9977 | 9985 | 9981 |
| Average length | 403 | 448 | 364 | 380 | 350 |
| Average number per gene | 7 | 5 | 6 | 8 | 7 |
| Highest number per gene | 46 | 31 | 29 | 47 | 47 |
| Lowest number per gene | 1 | 1 | 2 | 1 | 1 |
| Number of single exons | 67 | 123 | 0 | 63 | 39 |
|
| |||||
| Total number | 2844 | 2624 | 3851 | 3592 | 1714 |
| Shortest intron | 43 | 4 | 4 | 23 | 51 |
| Longest intron | 5834 | 5707 | 6734 | 9961 | 3560 |
| Average length | 560 | 968 | 640 | 848 | 530 |
| Average number per gene | 6 | 4 | 5 | 7 | 6 |
| Highest number per gene | 45 | 30 | 28 | 46 | 46 |
| Lowest number per gene | 1 | 1 | 1 | 1 | 1 |
The target chromosome was divided into four equal parts (partitions 1 to 4). The genomic location of each gene prediction determined the relevant partition allocation.
Comparison of genomic start and end locations of gene predictions with 299 test genes.
| Classification | gm | aug | gl | snap | all | aug:gl:snap | aug:gl | aug:snap | aug:gm | gl:snap |
| Start and End | 31 | 152 | 93 | 102 | 89 | 112 | 127 | 125 | 116 | 109 |
| Start | 55 | 47 | 76 | 69 | 70 | 64 | 60 | 65 | 58 | 68 |
| End | 57 | 57 | 85 | 82 | 104 | 92 | 81 | 84 | 82 | 98 |
| Totally Within | 27 | 4 | 21 | 90 | 75 | 47 | 29 | 21 | 39 | 51 |
| Totally Over | 116 | 7 | 27 | 27 | 3 | 3 | 4 | 4 | 6 | 7 |
| Overlaps Start | 42 | 6 | 29 | 49 | 7 | 5 | 7 | 6 | 10 | 13 |
| Overlaps End | 40 | 5 | 27 | 61 | 7 | 7 | 7 | 6 | 10 | 11 |
|
| ||||||||||
| Predictions | 580 | 514 | 710 | 895 | 666 | 624 | 594 | 584 | 585 | 730 |
| Test genes identified | 299 | 273 | 297 | 283 | 267 | 271 | 273 | 271 | 268 | 281 |
| Test genes not identified | 0 | 26 | 2 | 16 | 32 | 28 | 26 | 28 | 31 | 18 |
| Matches with test genes(includes partial predictions) | 368 | 278 | 358 | 480 | 355 | 330 | 315 | 311 | 321 | 357 |
| Partial predictions | 69 | 5 | 61 | 197 | 88 | 59 | 42 | 40 | 53 | 76 |
| Non-matches | 212 | 236 | 352 | 415 | 311 | 294 | 279 | 273 | 264 | 373 |
Abbreviations:
gm = GeneMark_hmm, aug = AUGUSTUS, gl = GlimmerHMM.
See Figure 5 for explanation on classifications.
Number of predicted genes that predict part of an entire gene such that there can be more than one prediction to the same test gene.
Number of predictions that did not overlap the test genes in any way.
Comparison of test genes not identified by gene finders.
| Statistics for … | AUGUSTUS | GlimmerHMM | SNAP | Test genes |
| Test genes not identified | 26 | 2 | 16 | 299 |
| Reverse strand | 16 | 2 | 12 | 153 |
| Consecutive groups | 3 | 0 | 2 | – |
| Highest consecutive number | 4 | 0 | 3 | – |
| Number containingan N | 0 | 0 | 0 | 5 |
|
| ||||
| Average | 1861 | 573 | 1996 | 5733 |
| Shortest | 342 | 492 | 342 | 298 |
| Longest | 7332 | 654 | 7332 | 47133 |
|
| ||||
| Shortest | 52 | 248 | 52 | 52 |
| Longest | 69635 | 7237 | 69635 | 106560 |
| Average | 11127 | 2271 | 10515 | 11081 |
|
| ||||
| Shortest exon | 14 | 492 | 14 | 3 |
| Longest exon | 4149 | 654 | 1827 | 9981 |
| Average length | 214 | 573 | 119 | 350 |
| Average number per gene | 4 | 1 | 5 | 7 |
| Highest number per gene | 15 | 1 | 15 | 47 |
| Lowest number per gene | 1 | 1 | 1 | 1 |
| Number of singleexons | 13 | 2 | 9 | 39 |
|
| ||||
| Shortest intron | 51 | 0 | 51 | 51 |
| Longest intron | 1074 | 0 | 1074 | 3560 |
| Average length | 68 | 0 | 43 | 530 |
| Average number per gene | 3 | 0 | 4 | 6 |
| Highest number per gene | 14 | 0 | 14 | 46 |
| Lowest number per gene | 1 | 0 | 1 | 1 |
Number of groups of test genes not found in which the test genes are located consecutively along the chromosome.
The highest number of test genes in a consecutive group.
Commonality of test genes not identified by gene finders.
| Commonality | Number of genesnot found | Single exon gene | Reverse strand | % less thanaverage length |
| AUGUSTUS, Glimmer, SNAP | 1 | 1 | 1 | 89 |
| AUGUSTUS, SNAP | 13 | 7 | 9 | 64 |
| AUGUSTUS, Glimmer | 1 | 1 | 1 | 91 |
The percentage less than the average length of all the test genes.
Comparison of genomic start and end locations of exon predictions with exons in test genes (values are percentages).
| Classification | GeneMark_hmm | GeneMark_hmm ES | AUGUSTUS | GlimmerHMM | SNAP |
| 1. Start and End | 16 | 23 | 57 | 33 | 44 |
| 2. Start | 13 | 15 | 7 | 12 | 19 |
| 3. End | 1 | 3 | 2 | 1 | 2 |
| 4. Totally Within | 3 | 3 | 1 | 2 | 3 |
| 5. Totally Over | 9 | 9 | 2 | 5 | 6 |
| 6. Overlaps Start | 11 | 8 | 5 | 7 | 9 |
| 7. Overlaps End | 7 | 6 | 6 | 6 | 8 |
| Number of exons not classified (no overlap) | 40 | 33 | 20 | 34 | 9 |
Figure 6Number of BLASTX hits using DNA consensus sequences from AUGUSTUS and GlimmerHMM predictions.
The figure shows the BLASTX hits when using the consensus of predicted sequences from AUGUSTUS and GlimmerHMM as queries in an attempt to find novel Toxoplasma gondii proteins. These consensus sequences were derived from aligning predicted DNA sequences based on overlapping genomic locations (see text for details).
Accuracy of predictions from previous studies (grouped according to target organism).
| Gene finder | Gene | Exon | Nucleotide | Organism | Publication | |||
| SN | SP | SN | SP | SN | SP | |||
| SNAP | 0.54 | 0.47 | 0.83 | 0.81 | 0.97 | 0.95 |
| SNAP creator |
| GlimmerHMM | 33% | 0.71 | 0.79 | 96% |
| gl creator | ||
| GlimmerHMM | 21% | 0.36 | 0.49 | 91% |
| gl creator | ||
| SNAP | 0.51 | 0.38 | 0.79 | 0.67 | 0.94 | 0.87 |
| SNAP creator |
| AUGUSTUS | 0.51 | 0.32 | 0.77 | 0.68 | 0.92 | 0.89 |
| SNAP creator |
| AUGUSTUS | 0.68 | 0.38 | 0.85 | 0.86 | 0.98 | 0.93 |
| aug creator |
| AUGUSTUS | – | – | – | – | 0.92 | 0.88 |
| gm creator |
| SNAP | – | – | – | – | 0.94 | 0.86 |
| gm creator |
| GeneMark_hmm | 0.93 | 0.88 |
| gm creator | ||||
| AUGUSTUS | 0.47 | 0.51 | 0.71 | 0.79 | – | – |
| Independent |
| AUGUSTUS | 0.48 | 0.47 | 0.80 | 0.81 | 0.93 | 0.90 |
| aug creator |
| AUGUSTUS | 0.24 | 0.17 | 0.52 | 0.63 | 0.78 | 0.75 |
| Independent |
| AUGUSTUS | – | – | 0.64 | 0.63 | 0.81 | 0.78 |
| Independent |
| GeneMark_hmm | 0.17 | 0.08 | 0.48 | 0.47 | 0.76 | 0.62 |
| Independent |
| GlimmerHMM | – | – | 0.69 | 0.63 | 0.89 | 0.79 |
| Independent |
| SNAP | – | – | 0.40 | 0.36 | 0.72 | 0.71 |
| Independent |
| AUGUSTUS | 0.37 | 0.38 | 0.57 | 0.59 | 0.82 | 0.79 |
| This paper |
| GeneMark _hmm | 0.06 | 0.07 | 0.15 | 0.13 | 0.43 | 0.37 |
| This paper |
| GlimmerHMM | 0.20 | 0.20 | 0.33 | 0.35 | 0.61 | 0.55 |
| This paper |
| SNAP | 0.18 | 0.12 | 0.44 | 0.33 | 0.46 | 0.35 |
| This paper |
% indicates the percentage of genes and nucleotides predicted exactly. There were no SN or SP values for GlimmerHMM at the gene and nucleotide level.
– No values available.