| Literature DB >> 29843602 |
Jens Keilwagen1, Frank Hartung2, Michael Paulini3, Sven O Twardziok4, Jan Grau5.
Abstract
BACKGROUND: Genome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction.Entities:
Keywords: Genome annotation; Homology-based gene prediction; RNA-seq
Mesh:
Year: 2018 PMID: 29843602 PMCID: PMC5975413 DOI: 10.1186/s12859-018-2203-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1GeMoMa workflow. Blue items represent input data sets, green boxes represent GeMoMa modules, while grey boxes represent external modules. The GeMoMa Annotation Filter allows to combine predictions from different reference species and produces the final output. RNA-seq data is optional
Benchmark results on the BRAKER1 test sets
| MAKER2 + (exonerate) | GeMoMa + without RNA-seq data | GeMoMa + with RNA-seq data | RNAseq-Cufflinks | RNAseq-StringTie | BRAKER1 ∗ | MAKER2 ∗ | CodingQuarry ∗ | MAKER2 + (exonerate, Trinity, Augustus) | MAKER2 + (GeMoMa, Trinity, Augustus) | |
|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||
| (ref. | ||||||||||
| Gene Sn | 44.0 | 61.3 |
| 28.9 | 35.9 | 64.4 | 51.3 | NA | 56.9 | 57.9 |
| Gene Sp | 47.8 | 65.7 |
| 47.9 | 59.1 | 52.0 | 52.5 | NA | 65.7 | 67.8 |
| Transcript Sn | 37.5 | 52.2 |
| 26.6 | 33.7 | 55.0 | 43.5 | NA | 48.3 | 49.1 |
| Transcript Sp | 47.8 | 65.7 | 65.3 | 35.6 | 48.3 | 50.9 | 52.5 | NA | 65.7 |
|
| Exon Sn | 70.0 | 79.3 | 80.6 | 58.1 | 60.8 | 82.9 | 76.1 | NA | 81.8 |
|
| Exon Sp | 81.9 | 86.6 | 87.5 | 81.9 | 87.1 | 79.0 | 76.1 | NA | 87.5 |
|
|
| ||||||||||
| (ref. | ||||||||||
| Gene Sn | 26.2 | 39.6 | 49.1 | 18.7 | 22.6 |
| 41.0 | NA | 40.5 | 47.3 |
| Gene Sp | 38.0 | 49.9 |
| 29.1 | 36.1 | 55.2 | 30.8 | NA | 51.5 | 56.4 |
| Transcript Sn | 21.0 | 30.7 | 39.8 | 16.2 | 20.0 |
| 31.3 | NA | 31.4 | 36.2 |
| Transcript Sp | 38.0 | 49.9 |
| 24.1 | 30.1 | 53.2 | 30.8 | NA | 51.5 | 56.4 |
| Exon Sn | 50.3 | 64.2 | 67.1 | 54.4 | 59.1 |
| 69.4 | NA | 70.5 | 75.2 |
| Exon Sp | 82.6 | 81.5 |
| 81.3 | 84.1 | 85.3 | 62.3 | NA | 85.6 | 86.7 |
|
| ||||||||||
| (ref. | ||||||||||
| Gene Sn | 64.3 | 78.2 |
| 55.7 | 55.2 | 64.9 | 55.2 | NA | 61.5 | 64.0 |
| Gene Sp | 69.2 | 81.6 |
| 71.3 | 73.5 | 59.4 | 46.3 | NA | 69.6 | 71.9 |
| Transcript Sn | 44.1 | 52.9 |
| 48.7 | 49.0 | 46.1 | 38.5 | NA | 42.7 | 44.3 |
| Transcript Sp | 69.2 |
| 81.2 | 60.1 | 65.7 | 57.9 | 46.3 | NA | 69.6 | 71.9 |
| Exon Sn | 69.0 | 76.3 |
| 67.8 | 66.2 | 75.0 | 66.5 | NA | 74.3 | 76.3 |
| Exon Sp | 89.1 | 92.0 |
| 85.4 | 88.3 | 81.7 | 66.9 | NA | 88.0 | 89.1 |
|
| ||||||||||
| (ref. | ||||||||||
| Gene Sn | 49.2 | 76.4 | 79.2 | 69.0 | 65.8 | 77.4 | 42.8 |
| 71.6 | 74.6 |
| Gene Sp | 59.9 | 84.6 | 88.0 | 93.8 | 92.5 | 80.5 | 68.7 | 72.6 | 88.1 |
|
| Transcript Sn | 49.2 | 76.4 | 79.2 | 69.0 | 65.8 | 77.4 | 42.8 |
| 71.6 | 74.6 |
| Transcript Sp | 59.9 | 84.6 | 87.6 | 80.5 | 71.3 | 76.5 | 68.7 | 72.6 | 88.1 |
|
| Exon Sn | 56.1 | 81.6 | 83.1 | 77.2 | 77.7 |
| 50.1 | 79.6 | 79.2 | 81.2 |
| Exon Sp | 73.3 | 88.6 | 91.9 | 87.6 | 81.7 | 83.2 | 71.4 | 81.7 | 92.0 |
|
The target species are given in multi-column rows. The same reference species, which is given in brackets, is used for all tools using homology-based gene prediction indicated by plus. The asterisks indicates that the performance of BRAKER1, MAKER2 and CodingQuarry is given as reported in [11]. The highest value per line is depicted in bold-face
Fig. 2Benchmark results. The y-axis depicts the difference between the GeMoMa with RNA-seq data and the BRAKER1 performance
Fig. 3Gene sensitivity and specificity for D. melanogaster using different or multiple reference species in GeMoMa. The points correspond to the eight reference species. In addition, the dashed line indicates the usage of multiple reference species. Using multiple reference species allows for filtering identical predictions from several reference as indicated by the numbers
Fig. 4Summary of difference for GeMoMa predictions with tie =1. The relaxed evaluation (left panel) depicts differences between GeMoMa predictions and annotation without any filter on the annotation, while the conservative evaluation (right panel) applies additional filters for the annotation (cf. main text). Predictions that do not overlap with any annotated CDS are depicted in yellow, predictions that differ from annotated CDSs only in splice sites are depicted in green, predictions that have additional exons compared to annotated CDSs are depicted in turquoise, predictions that missed some exons compared to annotated CDSs are depicted in blue, predictions with additional and missing exons compared to annotated CDSs are depicted in pink, predictions that only differ in the start of the CDS compared to annotated CDS are depicted in red, and any other category is depicted in gray
Predictions that do not overlap with any high or low confidence annotation
| a) Single-coding-exon predictions | |||
| #evidence | tpc = 0 | 0< tpc <1 | tpc = 1 |
| 1 | 1 971 (11) | 878 (14) | 1 005 (137) |
| 2 | 204 (19) | 158 (8) | 299 (55) |
| 3 | 200 (16) | 126 (5) | 257 (92) |
| 4 | 91 (17) | 43 (9) | 168 (83) |
|
| 2 466 (63) | 1 205 (36) | 1 729 (367) |
| b) Multi-coding-exon predictions | |||
| #evidence | tie = 0 | 0< tie <1 | tie = 1 |
| 1 | 9 671 (287) | 942 (211) | 1 681 (775) |
| 2 | 283 (36) | 86 (32) | 456 (196) |
| 3 | 155 (31) | 64 (43) | 382 (223) |
| 4 | 142 (57) | 55 (37) | 302 (196) |
|
| 10 251 (411) | 1 147 (323) | 2 821 (1 390) |
The numbers in parenthesis depict those predictions that are partially supported by any best BLAT hit of ESTs