| Literature DB >> 30500839 |
Aravind Venkatesan1,2, Gildas Tagny Ngompe1,2, Nordine El Hassouni1,3,4, Imene Chentli1,2, Valentin Guignon4,5, Clement Jonquet1,2, Manuel Ruiz1,3,4,6, Pierre Larmande1,2,4,7.
Abstract
Recent advances in high-throughput technologies have resulted in a tremendous increase in the amount of omics data produced in plant science. This increase, in conjunction with the heterogeneity and variability of the data, presents a major challenge to adopt an integrative research approach. We are facing an urgent need to effectively integrate and assimilate complementary datasets to understand the biological system as a whole. The Semantic Web offers technologies for the integration of heterogeneous data and their transformation into explicit knowledge thanks to ontologies. We have developed the Agronomic Linked Data (AgroLD- www.agrold.org), a knowledge-based system relying on Semantic Web technologies and exploiting standard domain ontologies, to integrate data about plant species of high interest for the plant science community e.g., rice, wheat, arabidopsis. We present some integration results of the project, which initially focused on genomics, proteomics and phenomics. AgroLD is now an RDF (Resource Description Format) knowledge base of 100M triples created by annotating and integrating more than 50 datasets coming from 10 data sources-such as Gramene.org and TropGeneDB-with 10 ontologies-such as the Gene Ontology and Plant Trait Ontology. Our evaluation results show users appreciate the multiple query modes which support different use cases. AgroLD's objective is to offer a domain specific knowledge platform to solve complex biological and agronomical questions related to the implication of genes/proteins in, for instances, plant disease resistance or high yield traits. We expect the resolution of these questions to facilitate the formulation of new scientific hypotheses to be validated with a knowledge-oriented approach.Entities:
Mesh:
Year: 2018 PMID: 30500839 PMCID: PMC6269127 DOI: 10.1371/journal.pone.0198270
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Current plant species included in AgroLD.
Plant species and data sources in AgroLD.
| Data sources | URLs | File format | #tuples | Crops | Ontologies used | #triples produced |
|---|---|---|---|---|---|---|
| GAF | 1, 160K | R, W, A, M, S | GO, PO, TO, EO | 6, 200K | ||
| Custom flat file | 1, 718K | R, W, M, A, S | GO, PO, TO, EO | 4, 600K | ||
| Custom flat file | 1, 400K | R, W, A, M, S | GO, PO | 50, 000 K | ||
| GFF | 1, 100K | R, S, A, | GO, SO | 14, 800K | ||
| Custom flat file | 22K | R | PO, TO, CO | 300K | ||
| Custom flat file | 2k | R | PO, TO, CO | 20K | ||
| Custom flat file | 100K | R, A | GO, PO | 700K | ||
| HapMap, VCF | 16K | R | GO | 16, 000K | ||
| Custom flat file | 2K | R | PO,TO | 20K | ||
| Custom flat file | 17K | R | GO,PO,TO | 160K | ||
| 92, 640K |
The number of tuples gives an idea of the number of elements we have annotated from the data sources (e.g., 1160K Gene Ontology annotations). The crops & ontologies are referred as follows: R = rice, W = wheat, A = Arabidopsis, S = sorghum, M = maize, GO = Gene Ontology, PO = Plant Ontology, TO = Plant Trait Ontology, EO = Plant Environment Ontology, SO = Sequence Ontology, CO = Crop Ontology (specific trait ontologies).
Fig 2ETL workflow for the various datasets and data formats.
The workflow shows two types of process: 1) from relational databases through a CVS file export: in that case, the transformation is tailored for the database model with some Python scripts converters. 2) from standards file formats: in that case, the transformation is generic with some Python packages used as converter tools. The workflow outputs can be produce in various type of RDF format such as turtle, JSON-LD, XML.
Fig 3Linking information in AgroLD.
The figure illustrates the linking of varies information in AgroLD.
Fig 4SPARQL query editor.
Figure illustrates the execution of query Q6: (a) Q6 is one the examples queries on the top-right corner (highlighted in red). On executing the query, the results are rendered below the editor; (b) the user can look up specific genes of interest by clicking on the corresponding URI, which points to the original information source (in this case EsemblPlants).
Fig 5Exploring entity relationships in AgroLD.
Figure illustrates differently the results obtained for Q6 using Explore Relationships tool. The results of Q6 can be visualized by entering the concepts (Calvin cycle and gene) in the left panel. On executing the query, all the genes involved in the chosen pathway are revealed. The visualized graph can be altered based on the user interest. Additionally, a gene could be selected (circled on the left) and further explored by clicking on the More Info link which directs the user to the information source.
Fig 6Advanced search query form: Figure demonstrates the steps involved in retrieving the results for Q6 using the Advanced Search query form: (a) query Q6 can be executed by selecting the type of entity (Pathways–highlighted in red) to search and entering the name of the entity (Calvin cycle). The API then displays the matched results; (b) Clicking on the result displays the genes participating in Calvin cycle; (c) selecting a gene of interest displays more information pertaining to that gene, for instance, encoding proteins and pathways this selected gene participates in.