| Literature DB >> 24433567 |
Ate van der Burgt, Edouard Severing, Jérôme Collemare, Pierre J G M de Wit1.
Abstract
BACKGROUND: Automated gene-calling is still an error-prone process, particularly for the highly plastic genomes of fungal species. Improvement through quality control and manual curation of gene models is a time-consuming process that requires skilled biologists and is only marginally performed. The wealth of available fungal genomes has not yet been exploited by an automated method that applies quality control of gene models in order to obtain more accurate genome annotations.Entities:
Mesh:
Year: 2014 PMID: 24433567 PMCID: PMC3898260 DOI: 10.1186/1471-2105-15-19
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow diagram of informant gene selection for the alignment-based fungal gene prediction (ABFGP) method.
Figure 2Flow diagram of the ABFGP method.
Figure 3ABFGP-based curation of the MFS transporter-encoding gene Cf189922 of A. Selected tracks of the GFF results obtained by applying ABFGP on the Cf189922 gene locus using 17 fungal informant genes. The annotated (blue) and the ABFGP-predicted gene model (green and grey) are shown on top. The grey part of the ABFGP prediction indicates an intron-exon boundary with status ‘doubtful’. Below are indicated the introns (orange) and exons (red) that were revised; the red box highlights the site of the second revision. The intron evidence track lists intron-exon boundaries obtained from informants; the colours used in the informant gene similarity track represent a measure for pairwise amino acid similarity. The alignment similarity track represents a summed representation of the inferred multiple sequence alignment of all informants. B. Multiple protein sequence alignment of currently annotated gene models of Cf189922 and its informants. Sequence is restricted to the red box shown in panel A. C. Multiple protein sequence alignment of the ABFGP-revised gene model of Cf189922 and its informants. Sequence is restricted to the red box shown in panel A. The proposed revision is highlighted in the black box.
Benchmarking of the ABFGP performance on validated genes compared to GeneMark-ES
| Method | | ABFGP | ABFGP | GeneMark-ES | ABFGP | GeneMark-ES |
| # unigenes | | 6,965 | 956 | 169 | 1154 | 327 |
| Intron | Sn | 91.16 | 91.5 | 89.3 | 92.2 | 90.7 |
| | Pr | 97.08 | 97.4 | 90.5 | 98.2 | 94.3 |
| Exon | Sn | 88.54 | 89.1 | 88.0 | 90.4 | 85.4 |
| | Pr | 98.91 | 99.4 | 89.1 | 98.9 | 87.9 |
| Nucleotide | Sn | 98.75 | 98.3 | 98.2 | 99.3 | 98.8 |
| | Pr | 99.08 | 99.3 | 97.1 | 99.0 | 97.1 |
| Gene3 | Sn | 79.4 (5,533) | 81.7 (781) | n.a. | 82.1 (947) | n.a. |
Sensitivity (Sn) and precision (Pr) of the gene model components (introns, exons, nucleotides) are expressed in percentages. Sn is calculated as true positives divided by: (true positives + false negatives); and Pr as true positives divide by: (true positives + false positives) [3].
1A list of all ten fungal species and results per species are provided in Additional file 4.
2Formerly named Magnaporthe grisea.
3The gene sensitivity is the percentage of gene models that is predicted without a single error. Total number of correctly predicted gene models is indicated in between brackets. Gene sensitivity was not provided for GeneMark-ES.
Gene models in six fungal species re-annotated by the ABFGP method
| Sequence technology | Sanger | 454 | Illumina/454/Sanger | Sanger | Sanger | Sanger | ||||||
| Fold genome coverage | 4.5 | 21 | 34 | 7.1 | 7.5 | 8.9 | ||||||
| # Annotated genes | 16,448 | 14,127 | 12,580 | 10,313 | 10,535 | 10,952 | ||||||
| Annotation pipeline2 | BROAD | GeneMark-ES | JGI | JGI | BROAD | JGI | ||||||
| Annotation year3 | 2005 | 2009 | 2010 | ≤2008 | 2008 | 2008 | ||||||
| Reference | [ | [ | [ | n.a. | [ | [ | ||||||
| Total eligible gene models | 8,503 | 7,574 | 8,090 | 7,283 | 8,362 | 7,893 | ||||||
| Bi Directional Best Hit | 7,165 | 6,990 | 7,511 | 6,773 | 7,814 | 7,317 | ||||||
| Gene Model Error | 1,338 | 584 | 579 | 510 | 548 | 576 | ||||||
| Confirmed/unchanged4 | 4,832 | 57% | 5,823 | 77% | 6,249 | 77% | 4,775 | 66% | 5,390 | 64% | 5,262 | 67% |
| Revised4 | 3,505 | 41% | 1,724 | 23% | 1,770 | 22% | 2,456 | 34% | 2,870 | 34% | 2,553 | 32% |
| Bi Directional Best Hit5 | 2,481 | 35% | 1,304 | 19% | 1,404 | 19% | 2,064 | 30% | 2,511 | 32% | 2,137 | 29% |
| Gene Model Error5 | 1,024 | 77% | 420 | 72% | 366 | 63% | 392 | 77% | 359 | 66% | 416 | 72% |
| Aborted4 | 166 | 2,0% | 27 | 0,4% | 71 | 0,9% | 52 | 0,7% | 102 | 1,2% | 78 | 1,0% |
1Formerly named Mycosphaerella graminicola.
2Sequencing centre which sequenced and annotated this genome (BROAD institute or Joint Genome Institute); C. fulvum was sequenced at Wageningen University and annotated using GeneMark-ES version 2.2 [20].
3Estimated year the gene calling was performed.
4Number and percentage of all gene models in this category.
5Number and percentage of revised gene models in this category.
Introspection of results obtained by the ABFGP method
| Total number of assessed genes3 | 8,337 | | 7,547 | | 8,019 | | 7,231 | | 8,260 | | 7,815 | | 6,965 |
| Confirmed/unchanged | 4,832 | | 5,823 | | 6,249 | | 4,775 | | 5,390 | | 5,262 | | Correct |
| Labeled ‘ok’4 | 3,942 | 82% | 5,186 | 89% | 5,505 | 88% | 4,216 | 88% | 4,536 | 84% | 4,539 | 84% | 5,015 (TP) |
| Labeled ‘doubtful’4 | 890 | 16% | 637 | 11% | 744 | 12% | 559 | 12% | 854 | 16% | 723 | 16% | 533 (FN) |
| Revised | 3,505 | | 1,724 | | 1,770 | | 2,456 | | 2,870 | | 2,553 | | Incorrect |
| Labeled ‘ok’4 | 2,137 | 61% | 1,160 | 67% | 1,209 | 68% | 1,730 | 70% | 1,864 | 65% | 1,734 | 68% | 899 (FP) |
| Labeled ‘doubtful’4 | 1,368 | 29% | 564 | 33% | 561 | 32% | 726 | 30% | 1,006 | 35% | 819 | 32% | 496 (TN) |
1Formerly named Mycosphaerella graminicola.
2Correctly predicted gene models (benchmarked on the full-length unigenes) that were labelled by the introspection procedure as ‘ok’ are true positives (TP) and labelled ‘doubtful’ are false negatives (FN). Genes that were incorrectly predicted and were labelled ‘ok’ are false positives (FP) and labelled ‘doubtful’ are true negatives (TN).
3Total eligible number of genes minus number of genes aborted during execution (Table 2).
4Number and percentage of genes that are labelled ‘ok’ and ‘doubtful’ by the introspection procedure in each category.
Types of revisions in annotated gene models made by the ABFGP method
| Total revised genes2 | 3,473 | | 1,721 | | 1,761 | | 2,448 | | 2,865 | | 2,552 | |
| Genes containing SE and/or DMs3 | 353 | | 333 | | 176 | | 127 | | 515 | | 66 | |
| Genes split by ABFGP | 195 | | 183 | | 62 | | 91 | | 94 | | 130 | |
| Genes merged by ABFGP | 102 | | 12 | | 16 | | 27 | | 19 | | 28 | |
| Total annotated exons | 12967 | | 5675 | | 5211 | | 7525 | | 10709 | | 8316 | |
| Unrevised | 5970 | | 2372 | | 2078 | | 2593 | | 4956 | | 3116 | |
| Boundary revision4 | 4851 | | 2274 | | 2341 | | 3230 | | 4355 | | 3357 | |
| 5’ or 3’ removed (−) / added (+)5,6 | −783 | +617 | −451 | +252 | −265 | +224 | −297 | +341 | −529 | +333 | −415 | +335 |
| Internal removed (−) / added (+)5,7 | −51 | +616 | −20 | +98 | −24 | +35 | −59 | +74 | −66 | +346 | −76 | +75 |
| Total annotated introns | 9459 | | 3947 | | 3438 | | 5058 | | 7838 | | 5753 | |
| Unrevised | 4907 | | 2019 | | 1740 | | 2276 | | 4048 | | 2836 | |
| Boundary revision8 | 1799 | | 692 | | 727 | | 889 | | 1738 | | 839 | |
| Stopless 3n removed (−) / added (+)9 | −447 | +189 | −166 | +146 | −365 | +146 | −1032 | +99 | −331 | +244 | −953 | +130 |
1Formerly named Mycosphaerella graminicola.
2The total number of revisions can exceed the total number of revised genes because a gene model can contain more than one revision.
3Genes for which the revision(s) include sequence errors or mutations.
4Exons with a different start and/or end coordinate when comparing both gene models.
5Exons incorporated in only one of both gene models (not in the ABFGP model/only in the ABFGP model).
6Omitted and additional exons in recognized false gene splits and fusions were not counted.
7(Large) intron in one gene model, split into two smaller introns with intermediary (small) exon in the other gene model.
8Introns with a different donor and/or acceptor site when comparing both gene models.
9Stopless 3n introns incorporated in only one of both gene models (not in the ABFGP model/only in the ABFGP model).