Literature DB >> 29186322

SAPP: functional genome annotation and analysis through a semantic framework using FAIR principles.

Jasper J Koehorst¹, Jesse C J van Dam¹, Edoardo Saccenti¹, Vitor A P Martins Dos Santos^1,2, Maria Suarez-Diez¹, Peter J Schaap¹.

Abstract

Summary: To unlock the full potential of genome data and to enhance data interoperability and reusability of genome annotations we have developed SAPP, a Semantic Annotation Platform with Provenance. SAPP is designed as an infrastructure supporting FAIR de novo computational genomics but can also be used to process and analyze existing genome annotations. SAPP automatically predicts, tracks and stores structural and functional annotations and associated dataset- and element-wise provenance in a Linked Data format, thereby enabling information mining and retrieval with Semantic Web technologies. This greatly reduces the administrative burden of handling multiple analysis tools and versions thereof and facilitates multi-level large scale comparative analysis. Availability and implementation: SAPP is written in JAVA and freely available at https://gitlab.com/sapp and runs on Unix-like operating systems. The documentation, examples and a tutorial are available at https://sapp.gitlab.io. Contact: jasperkoehorst@gmail.com or peter.schaap@wur.nl.

Entities: Gene

Mesh：

Year: 2018 PMID： 29186322 PMCID： PMC5905645 DOI： 10.1093/bioinformatics/btx767

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Managing the genomic data deluge puts specific emphasis on the ability of machines to automatically find and use the data. To meet this demand and to extract maximum benefit from research investments, digital objects should be Findable, Accessible, Interoperable and Reusable (i.e. FAIR) (Wilkinson ). Genome annotation data is usually findable and accessible through public repositories in which the data is linked to metadata providing detailed descriptions of the data acquisition and generation process. Interoperability reflects the potential for seamless integration of data from independent sources. Currently, genome comparisons usually involve a laborious process of data retrieval, modification and standardization (canonicalization). Reusability requires rich metadata with provenance for each annotation. Current standard formats (GenBank, EMBL or GFF3) retain the output of the prediction tools (for example for gene identification) but only when they score better than a predefined, often pragmatic, prediction threshold. Detailed information of the actual prediction scores is lost. This hampers critical re-examination of the results. Because existing genome annotation data is hard to be made FAIR and managing of FAIR genome annotation data requires a considerable administrative load, we developed SAPP, a semantic framework for large scale comparative functional genomics studies. SAPP can automatically annotate genome sequences using standard tools. The unique characteristic of SAPP is that the annotation results and their provenance are stored in a Linked Data format, thus enabling the deployment of mining capabilities of the Semantic Web. As the automatic annotations are incorporated into a dynamic framework, SAPP supports periodic querying, comparison and linking of diverse annotation sources, resulting in up-to-date genome annotations. By interrogating metadata as part of a digital annotation object, annotation data becomes interoperable as the extraction procedure requires no additional standardization process.

2 Implementation

SAPP accepts annotated and non-annotated sequence files which are converted into an RDF data structure using the GBOL ontology (van Dam ). Within SAPP, structural and functional annotation is performed using add-on modules incorporating existing standard annotation tools such as Prodigal and Augustus (Hyatt ; Stanke and Morgenstern, 2005). Modules for tRNA, tmRNA, rRNAs, protein domain and CRISPR repeats annotation are also available. New modules can be added. Annotation data and metadata are stored in a compressed graph database (Fernández ), as shown in Figure 1A.

Fig. 1

(A) The conversion module imports genome sequences in common formats. Annotation modules perform common tasks such as gene, tRNA, protein and protein domain annotation. Results are stored as Linked Data and consistency is ensured by the GBOL stack. (B) SPARQL query to retrieve the E-value score of the instances of the protein domain PF00465 across multiple bacterial genomes. (C) Distribution of E-values for protein domain PF00465 across multiple bacterial genomes: note the multimodality of the distribution. (D) Principal component analysis of functional similarities of 100 bacterial genomes from the Streptococcus (blue) and the Staphylococcus (orange) genera. PC1 and PC2 account for 51.4 and 10.1% of the variance in the dataset respectively Genome annotations can be exported to standard formats. All data can be directly queried and compared using the SPARQL endpoint or via the GBOL API (Java/R). Complex queries can be performed on multiple genomes while simultaneously taking meta-data into account. A SPARQL query example is provided in Figure 1B. Examples to query SAPP from R, Java or Python, a tutorial and a list of publications in which SAPP was used can be found at http://sapp.gitlab.io.

3 Results and discussion

Reproducible computational research requires a management system that links data with data provenance. Interoperability requires a strictly defined ontology. Using and sharing Linked Data based on controlled vocabularies and ontologies ensures the interoperability and reusability of the data. SAPP functionalities are unique since none of the existing de novo annotation pipelines implement Semantic Web technologies. SAPP generated data fulfil the applicable requirements for data FAIRness proposed by Wilkinson . For input and output, these tools interact directly with the database thereby forcing automatic linkage of data and provenance. In this way there is no need to work with predefined thresholds on the parameters controlling the annotation output. SAPP uses a controlled vocabulary to describe genome annotations. Consistency is ensured through the GBOL Stack (van Dam ). The GBOL ontology enables consistent genome annotation while integrating dataset-wise and element-wise provenance. The element-wise provenance is the statistical basis or score of each individual annotation, whereas the dataset-wise provenance refers to the programs, versions thereof and parameters used for the complete annotation of the (set of) sequences under study. GBOL makes use of existing ontologies: PROV-O for activity capturing (Lebo ); FOAF for agent information (Brickley and Miller, 2007); BIBO for article information stored within the annotation files (Giasson and D’arcus, 2008); SO for sequence information (Eilbeck ); FALDO for genomic location (Bolleman ), among many others. We refer the reader to van Dam for detailed information on the integrated ontologies and the data model. Annotations can be evaluated through critical examination of the provenance. The use of SPARQL allows complex queries across data annotated with SAPP and in direct comparison of these annotations with external resources, such as UniProt. Additionally for specific questions, likelihood values can be integrated, normalized or corrected for multiple testing. For instance, study of E-value distribution on instances of a protein domain across multiple genomes can inform optimal threshold selection, as shown in Figure 1C. SAPP implements existing tools: consistency of SAPP annotation and a comparison with deposited annotations is shown and discussed in Koehorst ). By querying multiple consistently annotated genomes simultaneously, large scale functional comparisons can be performed without additional conversion steps [see Fig. 1D and Koehorst ]. These examples demonstrate that by adopting FAIR principles to genome annotation, knowledge discovery is facilitated.

Funding

This work has received funding from the Research Council of Norway, No. 248792 (DigiSal) and from the European Union FP7 and H2020 under grant agreements No. 305340 (INFECT), No. 635536 (EmPowerPutida), Synthetic Biology Investment Theme (KB-32) from Wageningen University & Research, and No. 634940 (MycoSynVac). Conflict of Interest: none declared.

7 in total

1. Protein domain architectures provide a fast, efficient and scalable alternative to sequence-based methods for comparative functional genomics.

Authors: Jasper J Koehorst; Edoardo Saccenti; Peter J Schaap; Vitor A P Martins Dos Santos; Maria Suarez-Diez
Journal: F1000Res Date: 2016-08-15

2. Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors: Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal: BMC Bioinformatics Date: 2010-03-08 Impact factor: 3.169

3. The Sequence Ontology: a tool for the unification of genome annotations.

Authors: Karen Eilbeck; Suzanna E Lewis; Christopher J Mungall; Mark Yandell; Lincoln Stein; Richard Durbin; Michael Ashburner
Journal: Genome Biol Date: 2005-04-29 Impact factor: 13.583

4. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints.

Authors: Mario Stanke; Burkhard Morgenstern
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

5. Comparison of 432 Pseudomonas strains through integration of genomic, functional, metabolic and expression data.

Authors: Jasper J Koehorst; Jesse C J van Dam; Ruben G A van Heck; Edoardo Saccenti; Vitor A P Martins Dos Santos; Maria Suarez-Diez; Peter J Schaap
Journal: Sci Rep Date: 2016-12-06 Impact factor: 4.379

6. The FAIR Guiding Principles for scientific data management and stewardship.

Authors: Mark D Wilkinson; Michel Dumontier; I Jsbrand Jan Aalbersberg; Gabrielle Appleton; Myles Axton; Arie Baak; Niklas Blomberg; Jan-Willem Boiten; Luiz Bonino da Silva Santos; Philip E Bourne; Jildau Bouwman; Anthony J Brookes; Tim Clark; Mercè Crosas; Ingrid Dillo; Olivier Dumon; Scott Edmunds; Chris T Evelo; Richard Finkers; Alejandra Gonzalez-Beltran; Alasdair J G Gray; Paul Groth; Carole Goble; Jeffrey S Grethe; Jaap Heringa; Peter A C 't Hoen; Rob Hooft; Tobias Kuhn; Ruben Kok; Joost Kok; Scott J Lusher; Maryann E Martone; Albert Mons; Abel L Packer; Bengt Persson; Philippe Rocca-Serra; Marco Roos; Rene van Schaik; Susanna-Assunta Sansone; Erik Schultes; Thierry Sengstag; Ted Slater; George Strawn; Morris A Swertz; Mark Thompson; Johan van der Lei; Erik van Mulligen; Jan Velterop; Andra Waagmeester; Peter Wittenburg; Katherine Wolstencroft; Jun Zhao; Barend Mons
Journal: Sci Data Date: 2016-03-15 Impact factor: 6.444

7. FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation.

Authors: Jerven T Bolleman; Christopher J Mungall; Francesco Strozzi; Joachim Baran; Michel Dumontier; Raoul J P Bonnal; Robert Buels; Robert Hoehndorf; Takatomo Fujisawa; Toshiaki Katayama; Peter J A Cock
Journal: J Biomed Semantics Date: 2016-06-13

7 in total

15 in total

1. SALARECON connects the Atlantic salmon genome to growth and feed efficiency.

Authors: Maksim Zakhartsev; Filip Rotnes; Marie Gulla; Ove Øyås; Jesse C J van Dam; Maria Suarez-Diez; Fabian Grammes; Róbert Anton Hafþórsson; Wout van Helvoirt; Jasper J Koehorst; Peter J Schaap; Yang Jin; Liv Torunn Mydland; Arne B Gjuvsland; Simen R Sandve; Vitor A P Martins Dos Santos; Jon Olav Vik
Journal: PLoS Comput Biol Date: 2022-06-10 Impact factor: 4.779

2. Assembly and Comparison of Ca. Neoehrlichia mikurensis Genomes.

Authors: Tal Azagi; Ron P Dirks; Elena S Yebra-Pimentel; Peter J Schaap; Jasper J Koehorst; Helen J Esser; Hein Sprong
Journal: Microorganisms Date: 2022-05-31

3. Classification of the plant-associated lifestyle of Pseudomonas strains using genome properties and machine learning.

Authors: Wasin Poncheewin; Anne D van Diepeningen; Theo A J van der Lee; Maria Suarez-Diez; Peter J Schaap
Journal: Sci Rep Date: 2022-06-27 Impact factor: 4.996

4. SyNDI: synchronous network data integration framework.

Authors: Erno Lindfors; Jesse C J van Dam; Carolyn Ming Chi Lam; Niels A Zondervan; Vitor A P Martins Dos Santos; Maria Suarez-Diez
Journal: BMC Bioinformatics Date: 2018-11-06 Impact factor: 3.169

5. Forward Genetics by Genome Sequencing Uncovers the Central Role of the Aspergillus niger goxB Locus in Hydrogen Peroxide Induced Glucose Oxidase Expression.

Authors: Thanaporn Laothanachareon; Juan Antonio Tamayo-Ramos; Bart Nijsse; Peter J Schaap
Journal: Front Microbiol Date: 2018-09-24 Impact factor: 5.640

6. Co-culture of a Novel Fermentative Bacterium, Lucifera butyrica gen. nov. sp. nov., With the Sulfur Reducer Desulfurella amilsii for Enhanced Sulfidogenesis.

Authors: Irene Sánchez-Andrea; Anna Patrícya Florentino; Jeltzlin Semerel; Nikolaos Strepis; Diana Z Sousa; Alfons J M Stams
Journal: Front Microbiol Date: 2018-12-13 Impact factor: 5.640