| Literature DB >> 33575650 |
Tomáš Brůna1, Katharina J Hoff2, Alexandre Lomsadze3, Mario Stanke2, Mark Borodovsky3.
Abstract
The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.Entities:
Year: 2021 PMID: 33575650 PMCID: PMC7787252 DOI: 10.1093/nargab/lqaa108
Source DB: PubMed Journal: NAR Genom Bioinform ISSN: 2631-9268
Genomes used in the tests; asterisks indicate model organisms
| Species | Annotation version | Genome size (Mb) | # Genes in annotation | # Introns per gene | % Non-canonical or incomplete genes |
|---|---|---|---|---|---|
| Species with early sequenced genomes | |||||
|
| Tair Araport 11 (Jun 2016) | 119 | 27 445 | 4.9 | 0.3 |
|
| WormBase WS271 (May 2019) | 100 | 20 172 | 5.7 | 0.2 |
|
| FlyBase R6.18 (Jun 2019) | 138 | 13 929 | 4.3 | 0.3 |
| Other species | |||||
| Plantae | |||||
|
| JGI Ptrichocarpa_533_v4.1 (Nov 2019) | 389 | 34 488 | 4.9 | 0.3 |
|
| MtrunA17r5.0-ANR-EGN-r1.6 (Feb 2019) | 430 | 44 464 | 2.9 | 0.0 |
|
| Consortium ITAG4.0 (May 2019) | 773 | 33 562 | 3.5 | 14.5 |
| Arthropoda | |||||
|
| NCBI Annotation Release 102 (Apr 2017) | 249 | 10 581 | 7.1 | 4.7 |
|
| VectorBase RproC3.3 (Oct 2017) | 707 | 15 061 | 4.8 | 34.7 |
|
| NCBI Annotation Release 101 (May 2017) | 1445 | 18 602 | 7.3 | 18.2 |
| Vertebrata | |||||
|
| TETRAODON8.99 (Nov 2019) | 359 | 19 589 | 10.4 | 63.8 |
|
| Ensembl GRCz11.96 (May 2019) | 1345 | 25 254 | 8.2 | 11.8 |
|
| NCBI Annotation Release 104 (Apr 2019) | 1449 | 21 821 | 12.1 | 2.4 |
An average number of introns per gene was determined with respect to the number of all annotated genes in the genome. For a gene to be considered complete and canonical, at least one of the gene's transcripts had to be annotated with ATG starting the initial coding exon and the terminal coding exon ending with TAA, TAG or TGA.
Figure 1.Flowchart of the BRAKER2 pipeline. Input, intermediate and output data are shown by ovals. The tools and processes of the ProtHint pipeline are shown in orange; other components of BRAKER2 are shown in blue.
Figure 2.Evidence integration in BRAKER2. (A) Target proteins; (B) Introns, gene start and stop sites defined by spliced alignments of target proteins to genome; (C) CDSpart chains; (D) Genome sequence; (E) Genes predicted by GeneMark-EP+ at a given iteration. The high confidence hints are enforced (red arrows); (F) Anchored sites, the splice sites and gene ends predicted ab initio and corroborated by protein hints; (G) Anchored introns and intergenic sequences bounded by anchored gene ends are selected into training of non-coding sequence model for GeneMark-EP+; (H) Anchored multi-exon and single exon genes predicted by GeneMark-EP+ and selected for training AUGUSTUS; (I) Transcripts predicted by AUGUSTUS with support of an external evidence.
Composition of the clades of OrthoDB v10 used by BRAKER2
| # of species in the OrthoDB clade | ||||||||
|---|---|---|---|---|---|---|---|---|
| Species | Genus | Family | Order | Class | Phylum | Kingdom | Name of the largest OrthoDB segment | # of proteins in the OrthoDB segment |
|
|
|
|
| 100 |
| Plantae | 3 510 742 | |
|
|
|
|
| 6 | 7 |
| Metazoa | 8 266 016 |
|
|
|
|
| 148 |
| Arthropoda | 2 601 995 | |
|
| 1 | 5 |
| 100 |
| Plantae | 3 510 742 | |
|
| 1 | 10 |
| 100 |
| Plantae | 3 510 742 | |
|
| 2 | 10 |
| 100 |
| Plantae | 3 510 742 | |
|
| 1 | 7 |
| 148 |
| Arthropoda | 2 601 995 | |
|
| 1 | 1 |
| 148 |
| Arthropoda | 2 601 995 | |
|
| 1 | 1 |
| 10 |
| Arthropoda | 2 601 995 | |
|
| 0 | 1 |
| 50 |
| Chordata | 5 003 104 | |
|
| 1 | 5 |
| 50 |
| Chordata | 5 003 104 | |
|
| 2 | 2 |
| 3 |
| Chordata | 5 003 104 | |
Numbers in black bold show the largest numbers of species used to support gene predictions for a given species (left column). The numbers of species removed from the largest OrthoDB segment in the tests described below are shown in blue. Species whose proteins are not present in OrthoDB v10 are marked with asterisks.
Figure 3.Exon level Sn and Sp determined for each genome in the three runs of BRAKER2 with protein support, the run of BRAKER1 with RNA-seq support and the run of GeneMark-ES. BRAKER2 was run with support of proteins from OrthoDB excluding proteins (i) of the same species, (ii) of all species of the same taxonomic family, (iii) of all species of the same taxonomic order.
Figure 4.Gene level Sn and Sp determined in the tests described in the legend for Figure 3.
Gene prediction sensitivity of BRAKER2 at the gene and exon levels
| Gene Sn | Exon Sn | ||||
|---|---|---|---|---|---|
| Species | All | Reliable | All | Reliable | % Reliable genes |
|
| 70.2 | 78.8 | 81.5 | 87.9 | 83.5 |
|
| 49.8 | 57.8 | 75.7 | 81.0 | 81.1 |
|
| 59.5 | 61.6 | 71.9 | 74.4 | 93.2 |
|
| 69.3 | 76.4 | 86.2 | 90.4 | 84.6 |
|
| 48.3 | 63.2 | 82.7 | 90.0 | 69.6 |
|
| 40.7 | 68.0 | 78.5 | 92.1 | 54.4 |
|
| 45.7 | 56.7 | 74.6 | 79.5 | 75.1 |
|
| 13.2 | 45.5 | 61.4 | 80.2 | 26.4 |
|
| 24.6 | 40.2 | 67.9 | 79.9 | 50.6 |
|
| 10.4 | 67.7 | 60.6 | 89.5 | 11.2 |
|
| 39.1 | 50.3 | 75.6 | 86.3 | 70.8 |
|
| 38.9 | 46.3 | 75.3 | 80.0 | 74.8 |
The test sets were (All) all annotated multi-exon genes and (Reliable) all annotated complete multi-exon genes having all introns supported by mapped RNA-seq reads, the ones sampled by VARUS (36).
Prediction accuracy of MAKER2 and BRAKER2
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|
| MAKER2 with recommended protocol | MAKER2 with BRAKER2-like protocol | BRAKER2 | MAKER2 with recommended protocol | MAKER2 with BRAKER2-like protocol | BRAKER2 | MAKER2 with recommended protocol | MAKER2 with BRAKER2-like protocol | BRAKER2 | |
| Gene Sn | 49.3 | 53.9 | 70.6 | 25.5 | 30.4 | 43.7 | 42.6 | 48.0 | 60.0 |
| Gene Sp | 42.1 | 55.6 | 65.8 | 22.1 | 38.9 | 51.3 | 31.1 | 50.3 | 59.5 |
| Gene F1 | 45.4 | 54.7 | 68.1 | 23.7 | 34.1 | 47.2 | 35.9 | 49.2 | 59.7 |
| Exon Sn | 73.5 | 74.7 | 80.6 | 61.7 | 62.6 | 71.9 | 62.9 | 63.7 | 71.3 |
| Exon Sp | 72.6 | 83.0 | 85.8 | 64.5 | 81.4 | 87.1 | 58.7 | 76.0 | 83.2 |
| Exon F1 | 73.0 | 78.6 | 83.1 | 63.1 | 70.8 | 78.8 | 60.7 | 69.3 | 76.8 |