| Literature DB >> 19099578 |
Avril Coghlan1, Tristan J Fiedler, Sheldon J McKay, Paul Flicek, Todd W Harris, Darin Blasiar, Lincoln D Stein.
Abstract
BACKGROUND: While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19099578 PMCID: PMC2651883 DOI: 10.1186/1471-2105-9-549
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Participating groups and submitted gene sets.
| Blasiar et al, Saint Louis, USA | GESECA (D. Blasiar, unpublished) | cat4:1 |
| Borodovsky et al, Atlanta, USA | GeneMark.hmm [ | cat1:1 |
| Brent et al, Saint Louis, USA | N-SCAN [ | cat2:1 |
| Durbin et al, Cambridge, UK | GENOMIX [ | cat4:2 |
| Guigó et al, Barcelona, Spain | GeneID1 [ | GeneID: cat1:1, cat4:2; SGP2: cat2:1 |
| Korf et al, Davis, USA | SNAP [ | cat1:1 |
| Krogh et al, Copenhagen, Denmark | Agene [ | cat1:1 |
| Liang et al, Cold Spring Harbor, USA | Gramene (Liang et al, unpublished) | cat3:2, cat4:1 |
| Pereira et al, Pennsylvania, USA | Evigan [ | CRAIG: cat1:1; Evigan: cat4:1 |
| Rätsch et al, Tübingen, Germany | MGENE (Schweikert et al, submitted) | cat1:3, cat2:2, cat3:3 |
| Roos et al, Pennsylvania, USA | GLEAN [ | cat4:1 |
| Salzberg et al, Maryland, USA | JIGSAW [ | GlimmerHMM: cat1:1; JIGSAW: cat4:2 |
| Schiex et al, Toulouse, France | EUGENE [ | cat1:1, cat2:1, cat3:2, cat4:4 |
| Solovyev et al, University of London and Softberry Inc, New York, USA | Fgenesh, Fgenesh++, Fgenesh++C [ | Fgenesh: cat1:1; Fgenesh++: cat3:1; Fgenesh++C: cat4:1 |
| Stanke, Santa Cruz, USA | AUGUSTUS [ | cat1:2, cat3:1 |
| Brejová & Vinar, New York, USA | ExonHunter [ | cat1:1, cat3:2 |
| Yandell et al, Berkeley, USA | MAKER (using | cat3:2 |
1The GeneID gene set was submitted after the nGASP deadline. The research groups that participated in nGASP, the names of the software used to produce gene prediction sets, and the number of gene sets submitted in each of the nGASP categories by a research group, are given. There were four nGASP categories: category 1 predictions were made by ab initio gene-finders; category 2 predictions by gene-finders that use multi-genome alignments; category 3 predictions by gene-finders that take advantage of EST/mRNA or protein alignments; and category 4 predictions by gene prediction systems that use gene models created by other annotation software, and any of the data used as input for gene-finders in the other three categories. Here 'cat3:2' means that 2 gene sets in category 3 were submitted. In some cases a group submitted two gene sets produced by using different parameters of their software to the same nGASP category.
The nGASP test and training genomic regions.
| Training | High conservation, high gene density, autosomal | II: 2000001–3000000 |
| Training | High conservation, high gene density, autosomal | V: 9000001–10000000 |
| Training | High conservation, low gene density, autosomal | III: 1000001–2000000 |
| Training | High conservation, low gene density, autosomal | IV: 2000001–3000000 |
| Training | Low conservation, high gene density, autosomal | I: 12000001–13000000 |
| Training | Low conservation, high gene density, autosomal | V: 4000001–5000000 |
| Training | Low conservation, low gene density, autosomal | I: 2000001–3000000 |
| Training | Low conservation, low gene density, autosomal | II: 13000001–14000000 |
| Training | High conservation, low gene density, X-chromosome | X: 3000001–4000000 |
| Training | High conservation, low gene density, X-chromosome | X: 2000001–3000000 |
| Test | High conservation, high gene density, autosomal | IV: 7000001–8000000 |
| Test | High conservation, high gene density, autosomal | V: 12000001–13000000 |
| Test | High conservation, low gene density, autosomal | IV: 1–1000000 |
| Test | High conservation, low gene density, autosomal | I: 14000001–15000000 |
| Test | Low conservation, high gene density, autosomal | V: 16000001–17000000 |
| Test | Low conservation, high gene density, autosomal | II: 1–1000000 |
| Test | Low conservation, low gene density, autosomal | IV: 14000001–15000000 |
| Test | Low conservation, low gene density, autosomal | I: 1000001–2000000 |
| Test | High conservation, low gene density, X-chromosome | X: 4000001–5000000 |
| Test | High conservation, low gene density, X-chromosome | X: 8000001–9000000 |
The ten 1-Mb regions of the C. elegans genome provided to the nGASP participants for training their gene-finders, and ten 1-Mb test regions in which they were asked to make gene predictions for the nGASP assessment.
Evaluation of submitted gene sets.
| Agene | 1 | 93.8 | 83.4 | 68.9 | 61.1 | 9.8 | 13.1 | 12.0 | 14.1 |
| A | 1 | 97.0 | 89.0 | 86.1 | 72.6 | 50.1 | 28.7 | 61.1 | 38.4 |
| A | 1 | 96.8 | 89.3 | 84.8 | 74.3 | 49.3 | 31.9 | 60.5 | 32.7 |
| C | 1 | 95.6 | 90.9 | 80.2 | 78.2 | 35.7 | 36.3 | 43.8 | 37.8 |
| EUGENE | 1 | 94.0 | 89.5 | 80.3 | 73.0 | 49.1 | 28.8 | 60.2 | 30.2 |
| ExonHunter | 1 | 95.4 | 86.0 | 72.6 | 62.5 | 15.5 | 18.6 | 19.1 | 19.2 |
| Fgenesh | 1 | 98.2 | 87.1 | 86.4 | 73.6 | 47.1 | 34.6 | 57.8 | 35.4 |
| GeneID | 1 | 93.9 | 88.2 | 77.0 | 68.6 | 36.2 | 22.8 | 44.4 | 25.1 |
| GeneMark.hmm | 1 | 98.3 | 83.1 | 83.2 | 65.6 | 37.7 | 24.0 | 46.3 | 24.5 |
| GlimmerHMM | 1 | 97.6 | 87.6 | 84.4 | 71.4 | 47.3 | 29.3 | 58.0 | 30.6 |
| 1 | 97.2 | 91.5 | 84.6 | 78.6 | 44.6 | 40.9 | 54.8 | 42.3 | |
| 1 | 96.9 | 91.6 | 84.2 | 78.7 | 44.0 | 40.9 | 54.0 | 42.4 | |
| 1 | 96.9 | 91.6 | 84.2 | 78.6 | 43.5 | 40.5 | 53.4 | 44.8 | |
| SNAP | 1 | 94.0 | 84.5 | 74.6 | 61.3 | 32.6 | 18.6 | 40.0 | 19.1 |
| EUGENE | 2 | 96.2 | 87.5 | 82.8 | 72.8 | 50.3 | 30.2 | 61.7 | 31.4 |
| 2 | 97.7 | 90.9 | 85.8 | 78.4 | 51.6 | 41.2 | 63.3 | 42.5 | |
| 2 | 97.7 | 90.9 | 85.8 | 78.3 | 51.2 | 40.9 | 62.7 | 43.8 | |
| N-SCAN | 2 | 97.4 | 88.1 | 83.5 | 70.8 | 39.2 | 27.7 | 48.1 | 28.4 |
| SGP2 | 2 | 93.5 | 90.0 | 77.3 | 70.3 | 36.4 | 24.9 | 44.6 | 27.1 |
| A | 3 | 99.0 | 90.5 | 92.5 | 80.2 | 68.3 | 47.1 | 80.1 | 51.8 |
| EUGENE v1 | 3 | 97.3 | 85.3 | 88.5 | 72.2 | 55.7 | 33.7 | 68.4 | 34.2 |
| EUGENE v2 | 3 | 98.5 | 85.1 | 92.1 | 70.3 | 60.8 | 31.5 | 68.8 | 36.1 |
| ExonHunter v1 | 3 | 97.6 | 87.3 | 83.9 | 69.3 | 38.5 | 31.9 | 47.3 | 32.5 |
| ExonHunter v2 | 3 | 93.7 | 92.0 | 81.2 | 76.9 | 37.2 | 39.7 | 45.6 | 40.5 |
| Fgenesh++ | 3 | 97.6 | 89.7 | 90.4 | 80.9 | 65.5 | 53.4 | 78.3 | 54.2 |
| Gramene v11 | 3 | 98.2 | 95.4 | 88.5 | 71.8 | 41.7 | 19.6 | 48.7 | 37.2 |
| Gramene v21 | 3 | 98.6 | 94.8 | 88.3 | 67.8 | 38.7 | 16.3 | 46.0 | 39.0 |
| MAKER (using | 3 | 92.9 | 88.5 | 80.7 | 66.3 | 41.3 | 19.6 | 50.7 | 47.6 |
| MAKER (using | 3 | 92.6 | 91.1 | 80.5 | 69.5 | 40.8 | 23.2 | 50.1 | 28.0 |
| 3 | 98.7 | 91.9 | 91.0 | 80.7 | 57.7 | 48.0 | 70.8 | 48.9 | |
| 3 | 98.9 | 87.9 | 91.9 | 75.9 | 62.6 | 38.7 | 76.9 | 39.5 | |
| 3 | 98.7 | 91.9 | 91.0 | 80.6 | 57.7 | 48.0 | 70.6 | 51.1 | |
| EUGENE v1 | 4 | 98.5 | 85.6 | 90.5 | 75.1 | 60.4 | 39.3 | 75.9 | 39.5 |
| EUGENE v2 | 4 | 99.4 | 85.4 | 94.3 | 72.6 | 63.9 | 35.9 | 74.7 | 42.0 |
| EUGENE v3 | 4 | 98.6 | 85.6 | 90.6 | 74.2 | 63.3 | 36.9 | 79.5 | 37.4 |
| EUGENE v4 | 4 | 99.2 | 85.3 | 94.0 | 71.8 | 67.1 | 33.9 | 77.9 | 39.8 |
| Evigan | 4 | 99.3 | 89.6 | 91.1 | 82.3 | 64.2 | 52.4 | 80.7 | 52.7 |
| Fgenesh++C | 4 | 98.7 | 89.7 | 91.1 | 82.7 | 66.1 | 56.3 | 80.3 | 57.1 |
| GeneID v1 | 4 | 99.3 | 91.5 | 93.0 | 83.8 | 63.9 | 53.3 | 78.3 | 57.7 |
| GeneID v2 | 4 | 99.0 | 92.0 | 90.7 | 85.0 | 61.7 | 55.5 | 77.5 | 57.1 |
| GENOMIX v1 | 4 | 97.1 | 88.6 | 86.2 | 77.4 | 52.4 | 39.0 | 65.9 | 42.2 |
| GENOMIX v2 | 4 | 98.1 | 90.4 | 89.7 | 83.5 | 60.4 | 53.3 | 75.9 | 56.1 |
| GESECA | 4 | 98.8 | 82.8 | 87.6 | 66.8 | 45.1 | 25.9 | 52.6 | 27.4 |
| GLEAN | 4 | 98.9 | 87.3 | 88.3 | 75.4 | 51.4 | 37.0 | 64.7 | 37.6 |
| Gramene1 | 4 | 97.5 | 80.9 | 82.7 | 48.7 | 22.4 | 6.1 | 27.3 | 30.3 |
| JIGSAW v1 | 4 | 98.9 | 93.2 | 90.5 | 87.4 | 63.6 | 60.2 | 79.9 | 61.0 |
| JIGSAW v2 | 4 | 98.9 | 91.7 | 89.9 | 83.0 | 62.0 | 51.1 | 77.9 | 52.0 |
1After the evaluations were complete, the GRAMENE developers discovered an error in their pipeline which incorrectly moved the end of the coding region by 3 bp in a significant fraction of their gene predictions, and this negatively affected the overall performance of GRAMENE.
The accuracy of the submitted gene sets evaluated using the reference gene sets ref1 and ref2. The sensitivity (Sn) results are given for reference set ref1, and the specificity (Sp) results are given for set ref2. The gene sets are divided according to nGASP category, where category 1 is ab initio gene-finders, 2 is gene-finders that used multi-genome alignments, 3 is gene-finders that used alignments of ESTs, mRNAS and proteins, and 4 is combiners.
Figure 1Accuracy of the submitted gene sets. Plots of the specificity against sensitivity of the submitted gene sets, at the base level (A), exon level (B), isoform level (C) and gene level (D). The submitted gene sets are coloured by nGASP category, with ab initio (category 1) gene sets in red, gene-finders that used multi-genome alignments (category 2) in black, gene-finders that used transcript/protein alignments (category 3) in blue, and combiners (category 4) in green. The gene sets are labelled as follows: AU: AUGUSTUS, MG: MGENE, CR: CRAIG, AG: Agene, EU: EUGENE, FPC: Fgenesh++C, FP: Fgenesh++, FG: Fgenesh, GE: GeneID, GM: GeneMark.hmm, GX: GENOMIX, GS: GESECA, GN: GLEAN, GL: GlimmerHMM, GR: Gramene, JW: JIGSAW, MK: MAKER (using SNAP), MG: MGENE, NS: N-SCAN, SG: SGP2, SN: SNAP, EX: ExonHunter, EV: Evigan.
Figure 2Factors affecting gene-finding accuracy. Plots of gene-level sensitivity against features of genes that are correlated with gene-finding accuracy: (A) the lowest hexamer score of any of the exons in the gene, (B) the number of exons in the gene, (C) the length of the shortest exon in the gene, (D) the length of the longest intron in the gene, (E) the strength of the translation start signal, (F) the lowest score of any of splice sites in the gene, (G) the percent identity with the C. briggsae ortholog at the amino acid level, (H) the maximum distance to a neighbouring gene, and (I) the number of isoforms in the gene. In each plot, the submitted gene sets are coloured by nGASP category, with ab initio (category 1) gene sets in red, gene-finders that used multi-genome alignments (category 2) in black, and gene-finders that used transcript/protein alignments (category 3) in blue. The solid lines show the median sensitivities of the gene sets in a category, while the dashed lines show the maximum sensitivity of the gene sets in a category.
Figure 3A screenshot from the nGASP genome browser. This shows part of an nGASP test region on chromosome I, with the curated WormBase gene models and the ab initio (category 1) gene sets submitted to nGASP for that region.