| Literature DB >> 22839106 |
William L Trimble1, Kevin P Keegan, Mark D'Souza, Andreas Wilke, Jared Wilkening, Jack Gilbert, Folker Meyer.
Abstract
BACKGROUND: Gene prediction algorithms (or gene callers) are an essential tool for analyzing shotgun nucleic acid sequence data. Gene prediction is a ubiquitous step in sequence analysis pipelines; it reduces the volume of data by identifying the most likely reading frame for a fragment, permitting the out-of-frame translations to be ignored. In this study we evaluate five widely used ab initio gene-calling algorithms-FragGeneScan, MetaGeneAnnotator, MetaGeneMark, Orphelia, and Prodigal-for accuracy on short (75-1000 bp) fragments containing sequence error from previously published artificial data and "real" metagenomic datasets.Entities:
Mesh:
Year: 2012 PMID: 22839106 PMCID: PMC3526449 DOI: 10.1186/1471-2105-13-183
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Running times per gigabase of sequence data on a single 2 GHz processor
| FragGeneScan | Hidden Markov Model | FGS3,FGS5 | [ | 6 hours |
| MetaGeneAnnotator | Codon usage + start site heuristics | MGA | [ | 15 min |
| MetaGeneMark | Codon usage + gc-content heuristics | MGM | [ | 20 min |
| Orphelia | Neural network | OPH | [ | 13 hours |
| Prodigal | Codon usage + dynamic programming | PRD | [ | 30 min |
Compared with downstream analyses, ab initio gene calling is computationally inexpensive.
Figure 1Reading frame accuracy as function of fragment length for fragments at varying insertion/deletion error rates. (A) Error-free fragments. (B) Fragments with 0.2% insertion/deletion errors. (C) Fragments with 0.5% insertion/deletion errors. (D) Fragments with 2.8% insertion/deletion errors. For error-free fragments, longer fragments result in more accurate predictions.
Accuracy, sensitivity, specificity, and PPV for benchmark datasets with simulated 454-style errors
| 0.00% | 91.0% | 93.6% | 94.5% | 90.5% | 91.7% |
| 0.20% | 87.8% | 81.3% | 82.3% | 76.0% | 78.9% |
| 0.50% | 83.8% | 69.7% | 70.5% | 62.9% | 66.1% |
| 2.80% | 58.4% | 25.9% | 26.0% | 22.1% | 23.2% |
| Sensitivity | |||||
| 0.00% | 95.2% | 94.7% | 95.9% | 92.8% | 95.0% |
| 0.20% | 91.5% | 80.8% | 82.1% | 76.3% | 80.7% |
| 0.50% | 87.1% | 67.7% | 68.6% | 61.3% | 66.5% |
| 2.80% | 59.7% | 18.3% | 18.2% | 15.0% | 19.4% |
| Specificity | |||||
| 0.00% | 59.0% | 84.2% | 82.9% | 71.8% | 68.5% |
| 0.20% | 59.7% | 85.1% | 84.0% | 73.3% | 66.7% |
| 0.50% | 58.8% | 85.9% | 85.6% | 75.5% | 66.4% |
| 2.80% | 49.0% | 89.2% | 89.8% | 81.5% | 65.8% |
| Positive Predictive Value | |||||
| 0.00% | 91.6% | 96.1% | 96.4% | 93.2% | 94.0% |
| 0.20% | 88.9% | 90.9% | 91.2% | 86.5% | 86.6% |
| 0.50% | 85.5% | 86.0% | 85.9% | 80.2% | 78.1% |
| 2.80% | 62.1% | 58.0% | 56.5% | 44.0% | 35.4% |
Figure 2Receiver operating characteristics for three gene callers at varying rates of error. (A) Three rates of insertion/deletion error in 317 bp fragments. (B) Three rates of substitution-error in 700 bp fragments. Colors and symbols indicate gene callers; line style (solid, dashed, dotted) indicates dataset simulated error rate. The default operating point is the rightmost point on each graph. Metagenemark and Orphelia do not output confidence scores; consequently their performance is indicated by only one point per error rate dataset.
Figure 3Example fragment containing an insertion. (A) The fasta header for an error-free artificial fragment from E. sibricum that contained a single, artificial insertion near the center of the fragment. This insertion disrupted gene prediction in all five gene callers. (B) Refseq annotations show this fragment is entirely contained within one annotated gene, though it is artificially split into two reading frames. (C) Fraggenescan predicts an insertion in the wrong place, leading to seven nonsense amino acids adjacent to the insertion. The other gene prediction tools predict one or two shorter fragments, one of which has nonsense residues at the end of the prediction. The alignment-based evaluation technique would count all five as true positives because of the length of the correctly translated regions; the reading-frame technique would count all but FGS as false negatives because of their failure to correctly translate the middle of the fragment.
Figure 4Predicted coding fractions as a function of position in read for three metagenomic datasets. The black lines are proportional to the read-length histograms. All the gene predictors predict fewer genes at the end of fragments. Compare with Additional file 3: Figure S2.