Literature DB >> 26787666

GeneValidator: identify problems with protein-coding gene predictions.

Monica-Andreea Drăgan¹, Ismail Moghul², Anurag Priyam², Claudio Bustos³, Yannick Wurm².

Abstract

UNLABELLED: : Genomes of emerging model organisms are now being sequenced at very low cost. However, obtaining accurate gene predictions remains challenging: even the best gene prediction algorithms make substantial errors and can jeopardize subsequent analyses. Therefore, many predicted genes must be time-consumingly visually inspected and manually curated. We developed GeneValidator (GV) to automatically identify problematic gene predictions and to aid manual curation. For each gene, GV performs multiple analyses based on comparisons to gene sequences from large databases. The resulting report identifies problematic gene predictions and includes extensive statistics and graphs for each prediction to guide manual curation efforts. GV thus accelerates and enhances the work of biocurators and researchers who need accurate gene predictions from newly sequenced genomes.
AVAILABILITY AND IMPLEMENTATION: GV can be used through a web interface or in the command-line. GV is open-source (AGPL), available at https://wurmlab.github.io/tools/genevalidator CONTACT: : y.wurm@qmul.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene

Mesh：

Substances：
Proteins

Year: 2016 PMID： 26787666 PMCID： PMC4866521 DOI： 10.1093/bioinformatics/btw015

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The plummeting costs of DNA sequencing (Wetterstrand, 2015) have made de novo genome sequencing accessible to individual laboratories and even researchers (Nygaard and Wurm, 2015). However, identifying genes in a newly assembled genome remains challenging. Traditional gene prediction approaches involve either ab initio prediction via modelling of coding versus non-coding sequence or similarity-based prediction using independent sources. Relevant sources include protein-coding sequences from other organisms, or peptide or transcriptome sequences from the organism being studied. Modern algorithms combine both approaches (Cantarel ; Korf, 2004; Stanke ). The recent ability of obtaining large amounts of RNA sequences at low cost (Hou ) has led to a dramatic improvement in the performance of similarity-based algorithms and thus gene prediction quality (Goodswen ) albeit only for expressed genes. Despite this, the accuracy of gene prediction tools (e.g. Alioto ; Cantarel ; Keller ; Lomsadze ; Wilkerson ) remains disappointing (Yandell and Ence, 2012). Typical errors include missing exons, non-coding sequence retention in exons, fragmenting genes and merging neighboring genes. Automated gene prediction quality evaluation tools analyze exon boundaries (Eilbeck ; Yandell and Ence, 2012) or focus on subsets of highly conserved genes (Parra ). Unfortunately, such tools ignore most of the information present in frequently updated databases such as SwissProt or Genbank NR. Visual analysis is thus required to identify errors and manual curation is needed to fix them. This requires tens of minutes to days for one gene (Howe ) – a daunting task when considering analyses of dozens of species each with thousands of genes (Pray, 2008; Simola ). We thus created GeneValidator (GV), a tool to evaluate quality of protein-coding gene predictions based on comparisons with similar known proteins from public and private databases. GV provides quality evaluations in text formats for automated analysis and in highly visual formats for inspection by researchers.

2 Approach

For each new gene prediction, BLAST (Camacho ) identifies similar sequences in Swiss-Prot (The UniProt Consortium, 2014), Genbank NR (Benson ) or other relevant databases. Subsequently, GV performs up to seven comparisons between the gene prediction and the most highly significant hit sequences or high-scoring segment pairs (HSPs). The results of each comparison indicate whether characteristics of the query gene prediction deviate from those of hit sequences. The following four comparisons are performed on all queries: Length: We compare the length of the query sequence to the lengths of the most significant BLAST hits using hierarchical clustering (Fig. 1a, e) and a rank test. A particularly low or high rank can suggest that the query is too short or too long.

Fig. 1.

Contrasting GV graphs: (a), (e) sequence lengths; (b), (f) HSP offsets; (c), (g) overviews of hit regions; (d), (h) conserved regions. Graphs (a–d) were produced with a sequence for which GV detected no problems. The other graphs show typical problems: (e) query is short; (f), (g) query sequence is a fusion of unrelated genes; (h): query includes sequence absent from first 10 hits Coverage: We determine whether hit regions match the query sequence more than once using a Wilcoxon test. Significance suggests that the query includes duplicated regions (e.g. resulting from merging of tandem gene duplicates). Conserved regions: We align the query to a position specific scoring matrix profile derived from a multiple alignment of the ten most significant BLAST hits. This identifies potentially missing or extra regions (Fig. 1d, h and Supplementary Fig. S2). Different genes: Deviation from unimodality of HSP start and stop coordinates indicates that HSPs map to multiple regions of the query. If this is the case, we perform a linear regression between HSP start and stop coordinates, weighting data points proportionally to BLAST significance (see Fig. 1b, c, f, g). Regression slopes between 0.4 and 1.2 (empirically chosen values) suggest that the query prediction combines two different genes (see Supplementary Fig. S1). Two additional analyses are performed on nucleotide queries: : We expect a single major ORF. Frameshifts, retained introns or merged genes can lead to presence of multiple major ORFs. Similarity-based ORFs: We expect all BLAST hits to align within a single ORF. This test is more sensitive than the previous when a query has HSPs in multiple reading frames. An additional analysis is performed for MAKER gene predictions: MAKER RNASeq Quality Index: MAKER gene predictions include a quality index (in the FASTA defline) indicating the extent to which the prediction is supported by RNAseq evidence. GV considers this information when it is available. Each analysis of each query returns a binary result (i.e. similar or different to BLAST hits) according to a P-value or an empirically determined cutoff. The results for each query are combined into an indicative overall quality score from 0 to 100. The scores allow comparing overall qualities of different gene sets, or identifying the highest- or lowest-quality gene predictions within a gene set. The individual and global scores are provided in JSON and tab-delimited text formats, and as an HTML report that can be viewed in a web browser (Supplementary Fig. S3). Importantly, this HTML report includes up to five graphs for each gene (Fig. 1), as well as explanations of the analyses and results. These visualizations can be particularly useful to biocurators improving gene predictions.

3 Usage

GV is installed as a ruby gem (Bonnal ). The user provides FASTA protein or nucleotide gene predictions; BLAST is run remotely (NCBI) or on a local database, or the user provides an existing BLAST output. Alternatively, a web wrapper provides an elegant graphical interface and a programmatic jQuery API. Finally, GV can already be used from within the Afra genome annotation editor (Priyam et al. unpublished).

4 Discussion

GV’s power comes from leveraging large, frequently-updated databases, using multiple metrics, input/output format flexibility and importantly its multiple data visualization approaches. Indeed, visualization is crucial for understanding genomic comparisons (Nielsen ; Riba-Grognuz ). The code underlying GV respects best practices in scientific software development (Wurm, 2015). However, GV’s analyses depend on BLAST-identification of homologs in databases which include low-quality sequences, on expecting similar gene sequence and structure among homologs, and on empirically chosen cutoffs. Binary results of individual tests are thus indicative rather than infallible. Similarly, GV’s overall quality evaluations are not ground truths but indicate consistencies with database sequences. We used two approaches to determine the appropriateness of GV’s scoring system. GV scores for 10 000 randomly selected Swissprot genes were significantly higher than GV scores for 10 000 randomly selected TrEMBL genes (Supplementary Fig. S4). Similarly, 73–90% of recently updated gene models from four eukaryotic genomes had higher GV scores than older versions (Supplementary Table S1; Supplementary Fig. S5). Both results are consistent with GV appropriately quantifying gene prediction improvements due to manual curation or improved gene prediction technologies. Lower GV scores for some gene predictions could be due the reference databases containing sequences of low-quality, new automated predictions introducing new errors and scores being noisy for queries with few BLAST hits.

5 Future work

GV was developed with a plug-in system for adding validation approaches. We plan to extend GV with improved orthology detection, additional validation approaches (e.g. codon usage, explicit RNAseq support) and improved statistics (e.g. evidence-weighting based on phylogenetic and database-quality information). In its current form, GV already can save large amounts of time for biologists working with newly obtained gene predictions.

19 in total

1. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes.

Authors: Brandi L Cantarel; Ian Korf; Sofia M C Robb; Genis Parra; Eric Ross; Barry Moore; Carson Holt; Alejandro Sánchez Alvarado; Mark Yandell
Journal: Genome Res Date: 2007-11-19 Impact factor: 9.043

Review 2. Visualizing genomes: techniques and challenges.

Authors: Cydney B Nielsen; Michael Cantor; Inna Dubchak; David Gordon; Ting Wang
Journal: Nat Methods Date: 2010-02-25 Impact factor: 28.547

3. Visualization and quality assessment of de novo genome assemblies.

Authors: Oksana Riba-Grognuz; Laurent Keller; Laurent Falquet; Ioannis Xenarios; Yannick Wurm
Journal: Bioinformatics Date: 2011-10-12 Impact factor: 6.937

4. Quantitative measures for the management and comparison of annotated genomes.

Authors: Karen Eilbeck; Barry Moore; Carson Holt; Mark Yandell
Journal: BMC Bioinformatics Date: 2009-02-23 Impact factor: 3.169

5. Big data: The future of biocuration.

Authors: Doug Howe; Maria Costanzo; Petra Fey; Takashi Gojobori; Linda Hannick; Winston Hide; David P Hill; Renate Kania; Mary Schaeffer; Susan St Pierre; Simon Twigger; Owen White; Seung Yon Rhee
Journal: Nature Date: 2008-09-04 Impact factor: 49.962

6. Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics.

Authors: Raoul J P Bonnal; Jan Aerts; George Githinji; Naohisa Goto; Dan MacLean; Chase A Miller; Hiroyuki Mishima; Massimiliano Pagani; Ricardo Ramirez-Gonzalez; Geert Smant; Francesco Strozzi; Rob Syme; Rutger Vos; Trevor J Wennblom; Ben J Woodcroft; Toshiaki Katayama; Pjotr Prins
Journal: Bioinformatics Date: 2012-02-12 Impact factor: 6.937

7. A cost-effective RNA sequencing protocol for large-scale gene expression studies.

Authors: Zhonggang Hou; Peng Jiang; Scott A Swanson; Angela L Elwell; Bao Kim S Nguyen; Jennifer M Bolin; Ron Stewart; James A Thomson
Journal: Sci Rep Date: 2015-04-01 Impact factor: 4.379

8. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2009-11-12 Impact factor: 16.971

9. Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques.

Authors: Stephen J Goodswen; Paul J Kennedy; John T Ellis
Journal: PLoS One Date: 2012-11-30 Impact factor: 3.240

10. ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection.

Authors: Tyler Alioto; Ernesto Picardi; Roderic Guigó; Graziano Pesole
Journal: Biomed Res Int Date: 2013-11-07 Impact factor: 3.411

11 in total

1. The mole genome reveals regulatory rearrangements associated with adaptive intersexuality.

Authors: Francisca M Real; Stefan A Haas; Paolo Franchini; Peiwen Xiong; Oleg Simakov; Heiner Kuhl; Robert Schöpflin; David Heller; M-Hossein Moeinzadeh; Verena Heinrich; Thomas Krannich; Annkatrin Bressin; Michaela F Hartmann; Stefan A Wudy; Dina K N Dechmann; Alicia Hurtado; Francisco J Barrionuevo; Magdalena Schindler; Izabela Harabula; Marco Osterwalder; Michael Hiller; Lars Wittler; Axel Visel; Bernd Timmermann; Axel Meyer; Martin Vingron; Rafael Jiménez; Stefan Mundlos; Darío G Lupiáñez
Journal: Science Date: 2020-10-09 Impact factor: 47.728

2. Physico-chemical fingerprinting of RNA genes.

Authors: Ankita Singh; Akhilesh Mishra; Ali Khosravi; Garima Khandelwal; B Jayaram
Journal: Nucleic Acids Res Date: 2017-04-20 Impact factor: 16.971

3. The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA.

Authors: S Austin Hammond; René L Warren; Benjamin P Vandervalk; Erdi Kucuk; Hamza Khan; Ewan A Gibb; Pawan Pandoh; Heather Kirk; Yongjun Zhao; Martin Jones; Andrew J Mungall; Robin Coope; Stephen Pleasance; Richard A Moore; Robert A Holt; Jessica M Round; Sara Ohora; Branden V Walle; Nik Veldhoen; Caren C Helbing; Inanc Birol
Journal: Nat Commun Date: 2017-11-10 Impact factor: 14.919

4. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms.

Authors: Nicolas Scalzitti; Anne Jeannin-Girardon; Pierre Collet; Olivier Poch; Julie D Thompson
Journal: BMC Genomics Date: 2020-04-09 Impact factor: 3.969

5. Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes.

Authors: Corentin Meyer; Nicolas Scalzitti; Anne Jeannin-Girardon; Pierre Collet; Olivier Poch; Julie D Thompson
Journal: BMC Bioinformatics Date: 2020-11-10 Impact factor: 3.169

6. Nuclear genome of Bulinus truncatus, an intermediate host of the carcinogenic human blood fluke Schistosoma haematobium.

Authors: Neil D Young; Andreas J Stroehlein; Tao Wang; Pasi K Korhonen; Margaret Mentink-Kane; J Russell Stothard; David Rollinson; Robin B Gasser
Journal: Nat Commun Date: 2022-02-21 Impact factor: 14.919

7. Improved strategy for the curation and classification of kinases, with broad applicability to other eukaryotic protein groups.

Authors: Andreas J Stroehlein; Neil D Young; Robin B Gasser
Journal: Sci Rep Date: 2018-05-01 Impact factor: 4.379

8. Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models.

Authors: Jeanne Wilbrandt; Bernhard Misof; Kristen A Panfilio; Oliver Niehuis
Journal: BMC Genomics Date: 2019-10-17 Impact factor: 3.969

9. Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies.

Authors: Clementine M Francois; Faustine Durand; Emeric Figuet; Nicolas Galtier
Journal: G3 (Bethesda) Date: 2020-02-06 Impact factor: 3.154

10. High-Quality Assemblies for Three Invasive Social Wasps from the Vespula Genus.

Authors: Thomas W R Harrop; Joseph Guhlin; Gemma M McLaughlin; Elizabeth Permina; Peter Stockwell; Josh Gilligan; Marissa F Le Lec; Monica A M Gruber; Oliver Quinn; Mackenzie Lovegrove; Elizabeth J Duncan; Emily J Remnant; Jens Van Eeckhoven; Brittany Graham; Rosemary A Knapp; Kyle W Langford; Zev Kronenberg; Maximilian O Press; Stephen M Eacker; Erin E Wilson-Rankin; Jessica Purcell; Philip J Lester; Peter K Dearden
Journal: G3 (Bethesda) Date: 2020-10-05 Impact factor: 3.154