| Literature DB >> 31827189 |
Christian Rödelsperger1, Marina Athanasouli2, Maša Lenuzzi2, Tobias Theska2, Shuai Sun2, Mohannad Dardiry2, Sara Wighard2, Wen Hu2, Devansh Raj Sharma2, Ziduan Han2.
Abstract
Nematodes such as Caenorhabditis elegans are powerful systems to study basically all aspects of biology. Their species richness together with tremendous genetic knowledge from C. elegans facilitate the evolutionary study of biological functions using reverse genetics. However, the ability to identify orthologs of candidate genes in other species can be hampered by erroneous gene annotations. To improve gene annotation in the nematode model organism Pristionchus pacificus, we performed a genome-wide screen for C. elegans genes with potentially incorrectly annotated P. pacificus orthologs. We initiated a community-based project to manually inspect more than two thousand candidate loci and to propose new gene models based on recently generated Iso-seq and RNA-seq data. In most cases, misannotation of C. elegans orthologs was due to artificially fused gene predictions and completely missing gene models. The community-based curation raised the gene count from 25,517 to 28,036 and increased the single copy ortholog completeness level from 86% to 97%. This pilot study demonstrates how even small-scale crowdsourcing can drastically improve gene annotations. In future, similar approaches can be used for other species, gene sets, and even larger communities thus making manual annotation of large parts of the genome feasible.Entities:
Mesh:
Year: 2019 PMID: 31827189 PMCID: PMC6906410 DOI: 10.1038/s41598-019-55359-5
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Comparative assessment of nematode genome quality. Genomic data for 22 nematode species was obtained from WormBase ParaSite (release WBPS13) and evaluated based on completeness level of gene annotations and genome assembly contiguity. The barplots show the results of a benchmarking of single copy orthologs (BUSCO[40]) analysis, the number of genes, genome sizes, number of scaffolds, and the N50 measure of assembly contiguity. The genome and annotations of P. pacificus exhibit an overall comparatively high quality. The schematic phylogeny is based on phylogenomic analysis of 108 nematodes[39], Roman numerals indicate phylostrata that are used for further analysis.
Completeness analysis of different P. pacificus data set.
| Data set | BUSCO (%) | Ref | |||
|---|---|---|---|---|---|
| Complete | Duplicate | Fragmented | Missing | ||
| Genome assembly (El Paco assembly) | 91.6 (92.9) | 1.3 | 3.1 | 4.0 | [ |
| El Paco annotation v1/WS268 | 84.0 (85.8) | 1.8 | 4.3 | 9.9 | [ |
| 59.1 (97.1) | 38.0 | 2.6 | 0.3 | [ | |
| Iso-Seq assembly | 48.0 (73.3) | 25.3 | 10.9 | 15.8 | [ |
| El Paco annotation v2 | 95.4 (97.1) | 1.7 | 2.0 | 0.9 | this study |
The high level of duplicates in the two transcriptomic data sets is due to the presence of isoforms.
Figure 2Identification of missing genes. (a) 526 potentially missing genes were identified based on C. elegans genes with homologs in the transcriptome assembly but not in current gene annotations. (b) The 526 missing gene candidates were located in 486 P. pacificus loci that were classified based on community annotators. (c) The genome browser screenshot shows a homolog of C. elegans C29H12.2 which is located in the annotated 5′UTR of a P. pacificus gene. This locus harbors two P. pacificus transcripts with different expression levels and well supported as non-overlapping transcripts based on RNA-seq and Iso-seq data. (d) A homolog of apn-1 is completely missing from current gene annotations.
Figure 3Community-based curation of hidden orthologs. (a) We identified 2075 putative C. elegans one-to-one orthologs that were specific to the P. pacificus transcriptome assembly. (b) Community-based curation classified most of the corresponding gene loci as artificial gene fusions. (c) Non-overlapping transcripts corresponding to P. pacificus orthologs of mvb-12 and D1053.3 are artificially fused in a current gene model. This prohibits the detection of a one-to-one ortholog of D1053.3 based on a genome-wide approach such as best reciprocal hits.