| Literature DB >> 35283794 |
Yuda Munarko1, Dewan M Sarwar1, Anand Rampadarath1, Koray Atalag1, John H Gennari2, Maxwell L Neal3, David P Nickerson1.
Abstract
Semantic annotation is a crucial step to assure reusability and reproducibility of biosimulation models in biology and physiology. For this purpose, the COmputational Modeling in BIology NEtwork (COMBINE) community recommends the use of the Resource Description Framework (RDF). This grounding in RDF provides the flexibility to enable searching for entities within models (e.g., variables, equations, or entire models) by utilizing the RDF query language SPARQL. However, the rigidity and complexity of the SPARQL syntax and the nature of the tree-like structure of semantic annotations, are challenging for users. Therefore, we propose NLIMED, an interface that converts natural language queries into SPARQL. We use this interface to query and discover model entities from repositories of biosimulation models. NLIMED works with the Physiome Model Repository (PMR) and the BioModels database and potentially other repositories annotated using RDF. Natural language queries are first "chunked" into phrases and annotated against ontology classes and predicates utilizing different natural language processing tools. Then, the ontology classes and predicates are composed as SPARQL and finally ranked using our SPARQL Composer and our indexing system. We demonstrate that NLIMED's approach for chunking and annotating queries is more effective than the NCBO Annotator for identifying relevant ontology classes in natural language queries.Comparison of NLIMED's behavior against historical query records in the PMR shows that it can adapt appropriately to queries associated with well-annotated models.Entities:
Keywords: BioModels; NLP; SPARQL; information retrieval; ontology class; physiome model repository; semantic annotation
Year: 2022 PMID: 35283794 PMCID: PMC8908213 DOI: 10.3389/fphys.2022.820683
Source DB: PubMed Journal: Front Physiol ISSN: 1664-042X Impact factor: 4.566
Figure 1NLIMED workflow. We first create a Text Feature Index (TFI) and an RDF Graph Index (RGI) based on data available in the PMR, BioModels database, and ontology dictionaries. The natural language query is initially annotated into ontology classes in the NLQ Annotator module and then translated into SPARQL in the SPARQL Generator module.
Figure 2An annotation of a model entity of concentration of potassium in extracellular space. (A) An RDF/XML code describing the model entity. (B) A tree structure representing the RDF code describing the model entity.
Features representing an ontology class used by the Phrase Annotator to calculate the degree of association of a phrase to the ontology class.
|
|
|
|
|---|---|---|
| Preferred label | Ontology dictionary | The primary phrase used to explain the concept in the ontology class. |
| Synonym | Ontology dictionary | Alternative phrases used to explain the concept in the ontology class. |
| Definition | Ontology dictionary | A detailed explanation of the concept in the ontology class. |
| Parent label | Ontology dictionary | The preferred label of the parent ontology class. |
| Entity description | Biosimulation model | Textual information collected from entities in biosimulation model. The information is used as a sharing feature between ontology classes annotating an entity. |
Figure 3Example NLQ annotation including chunking, dependency level recognition, and similarity calculation for “concentration of potassium in extracellular space.” The left side presents the use of parsers (CoreNLP and Benepar) while the right side presents the use of NER (Stanza and xStanza).
Figure 4The example of SPARQL generation in SPARQL Composer where (A) phrases and ontology classes as NLQ Annotator result, (B) are combined and checked for their availability in RGI, (C) then are related to available entity annotation patterns, (D) and finally are compiled to SPARQL and ranked.
The performance of NLIMED to annotate NLQ on a test dataset containing 52 NLQs.
|
|
|
|
|
|---|---|---|---|
| NCBO Annotator | 0.542 | 0.504 | 0.522 |
| WPL + NoDep + benepar | 0.69 | 0.426 | 0.527 |
| WPL + NoDep + coreNLP |
| 0.539 | 0.617 |
| WPL + NoDep + stanza | 0.553 | 0.452 | 0.498 |
| WPL + NoDep + xStanza | 0.624 | 0.548 | 0.583 |
| WPL + Dep + benepar | 0.636 | 0.426 | 0.51 |
| WPL + Dep + coreNLP | 0.65 |
|
|
| WPL + Dep + stanza | 0.635 | 0.53 | 0.578 |
| WPL + Dep + xStanza | 0.615 | 0.557 | 0.584 |
We modify features by distributing preferred label to other features (WPL). The use of CoreNLP demonstrates the highest performance measured by precision, recall and F-measure (bold values, rows 3 and 7). Moreover, the complement of term dependency level to CoreNLP can increase recall and F-measure (bold values, row 7), although it can decrease precision.
Figure 5The analysis of NLIMED performance over different number of terms and phrases in NLQ (A) F-measure based on the number of terms in NLQ. (B) F-measure based on the number of phrases in NLQ.