| Literature DB >> 34875010 |
Nicholas J Dimonaco1, Wayne Aubrey2, Kim Kenobi3, Amanda Clare2, Christopher J Creevey4.
Abstract
MOTIVATION: The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CoDing Sequence (CDS) prediction tool and allow them to choose the right tool for their analysis.Entities:
Year: 2021 PMID: 34875010 PMCID: PMC8825762 DOI: 10.1093/bioinformatics/btab827
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
An overview of genome composition for the six MOs selected to evaluate CDS prediction tools compiled from data held by Ensembl bacteria
| Model organism [assembly] | Genome size (Mbp) | Genes [CDSs] | Genome density [CDSs] | GC content (%) |
|---|---|---|---|---|
|
| 4.04 | 4133 | 88.91% | 43.89 |
|
| 4.02 | 3875 | 90.60% | 67.21 |
|
| 4.56 | 4257 | 86.28% | 50.80 |
|
| 0.58 | 559 | 92.03% | 31.69 |
|
| 6.06 | 5266 | 84.75% | 60.13 |
|
| 2.76 | 2556 | 83.93% | 32.92 |
Note: Data are presented for all genes and CDS genes in bold square brackets. Note the relatively broad differences in genome size, gene density (percentage covered with annotation) and GC content.
Version number and reference for all tools used in this study
| No. | Tool name | Version | Reference |
|---|---|---|---|
| 1 | Augustus | 3.3.3 |
|
| 2 | EasyGene | 1.2 |
|
| 3 | GeneMark.hmm | 3.2.5 |
|
| 4 | GeneMark | 2.5 |
|
| 5 | FGENESB | ‘2020’ |
|
| 6 | Prodigal | 2.6.3 |
|
| 7 | GeneMarkS | 4.25 |
|
| 8 | GeneMarkS 2 | ‘2020’ |
|
| 9 | GLIMMER 3 | 3.02 |
|
| 10 | GeneMark (H.A) | 3.25 |
|
| 11 | TransDecoder | 5.5.0 |
|
| 12 | FragGeneScan | 1.3.0 |
|
| 13 | MetaGene | 2.24.0 |
|
| 14 | MetaGeneMark | ‘2020’ |
|
| 15 | MetaGene Annotator | 2008/8/19 |
|
Note: Tools 1–5 inclusive are model-based tools. Tools 6–15 inclusive are ab initio-based tools. Where no version number is available, the year when the tool was used is listed in single quotes.
Fig. 1.Illustration of how predicted CDSs are classified as having detected or not detected the CEA genes. Predicted CDSs are compared to the genes held in Ensembl. (A) The predicted CDS covers at least 75% and is in-frame with Ensembl gene and therefore it is recorded as detected. (B) The predicted CDS covers <75% of the Ensembl gene and therefore is recorded as not detected. (C) The predicted CDS covers part of an Ensembl gene but is out of frame (dotted outline) and therefore is recorded as missed. (D) The use of alternative stop codons causes the predicted CDS to be truncated or divided into two CDSs that span the Ensembl genes and therefore is recorded as missed
Fig. 2.The result of all 15 gene prediction tools (21 with chosen models) on the 6 MO genomes, ordered by the summed ranks across the 12 metrics. The Y axis represents the Percentage of Genes Detected (M1) by each tool in black and the Percentage of Perfect Matches (M5) in white. M5, which represents the ability for a tool to detect the correct start codon, has more variance between the tools than M1. Each column on the X axis represents a different tool (some model-based tools were run multiple times). There is considerable variation in how well each tool performs across the different genomes, while all tools perform relatively poorly on the M.genitalium genome