Literature DB >> 30462164

Harmonizing semantic annotations for computational models in biology.

Maxwell Lewis Neal¹, Matthias König², David Nickerson³, Göksel Mısırlı⁴, Reza Kalbasi³, Andreas Dräger^5,6, Koray Atalag³, Vijayalakshmi Chelliah⁷, Michael T Cooling³, Daniel L Cook^8,9, Sharon Crook¹⁰, Miguel de Alba¹¹, Samuel H Friedman¹², Alan Garny³, John H Gennari⁹, Padraig Gleeson¹³, Martin Golebiewski¹⁴, Michael Hucka¹⁵, Nick Juty⁷, Chris Myers¹⁶, Brett G Olivier^17,18, Herbert M Sauro¹⁹, Martin Scharm²⁰, Jacky L Snoep^21,22,23, Vasundra Touré²⁴, Anil Wipat²⁵, Olaf Wolkenhauer^20,26, Dagmar Waltemath²⁰.

Abstract

Life science researchers use computational models to articulate and test hypotheses about the behavior of biological systems. Semantic annotation is a critical component for enhancing the interoperability and reusability of such models as well as for the integration of the data needed for model parameterization and validation. Encoded as machine-readable links to knowledge resource terms, semantic annotations describe the computational or biological meaning of what models and data represent. These annotations help researchers find and repurpose models, accelerate model composition and enable knowledge integration across model repositories and experimental data stores. However, realizing the potential benefits of semantic annotation requires the development of model annotation standards that adhere to a community-based annotation protocol. Without such standards, tool developers must account for a variety of annotation formats and approaches, a situation that can become prohibitively cumbersome and which can defeat the purpose of linking model elements to controlled knowledge resource terms. Currently, no consensus protocol for semantic annotation exists among the larger biological modeling community. Here, we report on the landscape of current annotation practices among the COmputational Modeling in BIology NEtwork community and provide a set of recommendations for building a consensus approach to semantic annotation.

Entities: CellLine Chemical Disease Gene Species

Keywords: computational modeling; data integration; knowledge representation; modeling standards; semantic annotation

Mesh：

Year: 2019 PMID： 30462164 PMCID： PMC6433895 DOI： 10.1093/bib/bby087

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

Biological researchers use computational models to articulate hypotheses about the organization and dynamics of biological systems. Modeling has become an essential component of life science research; it provides a basis for investigating system perturbations, predicting experimental outcomes and identifying a system’s most critical processes. Given the increasingly tight coupling between computational biology and biomedical research, the research community benefits when models can be easily repurposed across research groups. However, modelers use a variety of modeling languages and simulation platforms, not all of which interoperate coherently. Researchers aiming to repurpose a published model for their investigations must often resort to hand-coding the model anew, which is a time-consuming and error-prone process. Although much progress has been made in developing standardized model exchange formats such as the Systems Biology Markup Language (SBML) [1] and CellML [2], fundamental roadblocks associated with model reuse remain [3-6]. Here, we identify several critical bottlenecks that impede the reuse of models and model components as well as their integration. We argue that standardizing and expanding the semantic annotations on models remove these bottlenecks by helping researchers more quickly locate models, automate model compositions, translate between modeling formats and integrate the biological knowledge encoded in models. Our perspective on semantic annotation is shaped by our participation in the COmputational Modeling in BIology NEtwork (COMBINE), a community of researchers developing standards for modeling in computational biology [7]. Collectively, we represent teams developing SBML, CellML, the Simulation Experiment Description Markup Language (SED-ML) [8], the Systems Biology Graphical Notation (SBGN) [9], the Synthetic Biology Open Language (SBOL) [10], NeuroML [11, 12], rule-based modeling languages [13, 14], MultiCellDS [15], the COMBINE archive [16], the Semantic Simulation (SemSim) architecture [17], FAIRDOMHub [18], the SABIO-RK database [19], as well as BioModels [20] and the Physiome Model Repository [21, 22]. Here, we present a set of agreed-upon recommendations for harmonizing semantic annotations for computational models across these development efforts. Adopting these recommendations within our communities, together with the implementation of software libraries and tools that adhere to the recommendations, will vastly extend the possibilities for interpreting, comparing and evaluating computational models and thus enhance model reuse and repurposing.

Semantic annotations and their utility

In the context of computational modeling in biology, we define a semantic annotation as a computer-accessible metadata item that captures, entirely or in part, the meaning of a model, model component or data element. For example, an annotation on a model variable might indicate that it represents the concentration of cytosolic glucose in a pancreatic beta cell or it might describe a purely computational feature such as the simulation time step. To further clarify, we make a distinction between semantic annotations and other types of metadata that do not indicate meaning, such as curatorial annotations that indicate model authorship, provenance annotations that indicate the origins of model components (e.g. nanopublications) or tool-dependent annotations that indicate layout information for model visualization. Broadly, semantic annotations are a critical feature of the vision of the Semantic Web, wherein documents are linked to metadata describing the document’s contents, thus facilitating search and retrieval as well as data interoperability [23]. Semantic annotations are used in many fields besides biological modeling, including the geosciences [24], music retrieval [25] and business process modeling [26]. As defined above, semantic annotations describe the meaning of a model’s contents in a computer-accessible manner. These annotations are needed because, like much of the biological community, modelers and tool developers do not use a standardized set of names to indicate the meaning of model elements. One modeler may use a variable named ‘X’ to represent cytosolic glucose concentration in a pancreatic beta cell, and another modeler may use ‘X’ to indicate blood flow rate through the aortic valve. In the absence of community-wide naming conventions, machine-accessible metadata is used to capture the meaning of model elements. This metadata links model elements to terms from knowledge resources such as controlled vocabularies and ontologies, allowing different software tools to recognize when two models represent the same or similar biological features, which in turn helps researchers to align, reuse and merge models. Annotations have contributed to the successful reuse and exploration of biological models and data in tasks such as comparison [27, 28], interpretation [29], retrieval [30-32], integration [33-37], simulation [38], translation between formats [29, 37,39–42] and visualization [37, 43–45]. Semantic annotations are also a key component for model-driven design of synthetic biological systems where they are used in model composition tasks when constructing optimum biological systems built from models [41, 45–48].

Applying semantic annotations to address challenges in model reuse and integration

Several barriers impede the reuse of models and model components. Researchers cannot readily search across multiple repositories to find models of interest, and when a researcher does retrieve a relevant model, they must determine whether it (or part of it) can be repurposed for use in their modeling work. Similarity measures and pattern-matching algorithms can help identify relevant modeling components [28, 49], but existing proposals for cross-repository search and retrieval [32] are mostly theoretical and not yet applicable in practice. Applying semantic annotations in a standardized way across model repositories would provide a common ground for such cross-repository searches, allowing models encoded in diverse formats to be retrieved based on their shared semantic descriptions. The amount of time required to select a model for reuse increases when the researcher must choose from a large set of potentially usable models. With only limited means to compare models automatically, researchers must manually assess the content, scope and underlying assumptions of each candidate model as well as the biological questions each was designed to answer. While semantic annotations cannot yet capture the purpose for which a model was built or modified, they can make the biological content of a model explicit and therefore help researchers decide whether or not to repurpose it. Semantic annotations can also be used to quantify the similarity between model elements [27, 28, 50–54], helping to ensure that the user is presented with the most relevant models following a search. Other types of annotations, such as those that capture provenance information [55-58], can also help researchers determine which models best meet their research needs and make comparisons between different model versions [59]. Another set of barriers is associated with model-to-model integration. Composing new models from existing models remains a mostly manual and error-prone process, and previous work has demonstrated how it can be accelerated using semantic annotations [31, 34, 36, 60]. By examining semantic annotations, software tools can recognize where models overlap in their biological content and then provide recommendations about how the models could be coupled. An additional impediment to model reuse is the lack of integration between model repositories and online collections of experimental data. If such integration existed, modelers could readily find data appropriate for validating or constraining the models they repurpose, and experimentalists could find models for use in data analysis pipelines. If datasets and models were semantically annotated using a standard protocol, software tools could discover relevant connections between them and accelerate research. Efforts such as the SourceData project [61], which aims to make biological data discoverable through the use of metadata annotations, can help achieve this vision. Complementary efforts that link models with associated datasets also contribute to this goal. For example, the Simulation Database [62] component of the JWS Online Model Repository [63] or tellurium-web [64] provides standardized descriptions of reproducible simulation experiments (encoded in SED-ML), manually linked to model code and, if available, original experimental datasets. Using such standardized descriptions, researchers can readily reproduce and evaluate curated simulations against experimental datasets. The opportunities for integrating models and data could substantially increase if model and data repositories such as these used a standard semantic annotation protocol.

Examples of current semantic annotation practices

Participants across several modeling initiatives recognize the importance of semantic annotations; however, annotation practices vary across these efforts and are in need of standardization. To illustrate, we profile the semantic annotation practices of three initiatives within COMBINE in this section. These profiles are not meant to provide a comprehensive summary of all initiatives that focus on annotating biological models, but rather to offer examples of initiatives that, when considered together, demonstrate the need for harmonized annotation approaches.

BioModels

BioModels is a repository of publicly available models maintained at the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI). The most recent BioModels release (#31) includes 640 curated models. The BioModels team has created guidelines for annotating the semantics of these models (https://www.ebi.ac.uk/biomodels-main/annotationtips). These guidelines indicate which components of curated models should be annotated and provide guidance on choosing appropriate reference terms. They have been created following the Minimum Information Requested In the Annotation of Models (MIRIAM) [65] guidelines. Annotations within curated SBML models also follow the instructions given in SBML specification documents [66]. They are primarily encoded as Resource Description Framework (RDF) statements (https://www.w3.org/RDF/), although the SBML format also contains explicit constructs for associating Systems Biology Ontology [38] terms with model elements outside of the RDF content. The BioModels team has developed an in-house annotation tool for capturing the meaning of the biological aspects of SBML models using terms from a defined list of established knowledge resources such as UniProt [67], Chemical Entities of Biological Interest (ChEBI) [68] and the Gene Ontology [69]. Given that a model must be semantically annotated for inclusion in the curated branch of BioModels, the repository contains a broad set of semantic annotations. For details on the number and granularity of these annotations, we refer the reader to Henkel et al. [32] and Alm et al. [27]. Additionally, the EMBL-EBI maintains a service at https://identifiers.org for resolving the Uniform Resource Identifiers (URIs) used in annotations [70]. Formatted according to the identifiers.org guidelines, URIs from various biomedical knowledge resources can be resolved online. Together with the COMBINE-maintained BioModels.net ‘qualifiers’ (https://co.mbine.org/standards/qualifiers), also known as ‘predicates’ or ‘relations’, they allow the construction of complete semantic annotations [65] linking elements of COMBINE formats (e.g. SBML, SED-ML or COMBINE archive metadata) to knowledge resource terms in order to define an element’s biological meaning. Figure 1 shows an RDF-based semantic annotation on an example SBML model from BioModels.

Figure 1

Example RDF-based annotation from SBML model BIOMD0000000239 [71] in BioModels. The annotation block (indicated by a curly brace) defines the biological meaning of a physical compartment in the model. The RDF block within the element links the compartment’s metadata identifier ‘metaid_MT_IMS’ to the Gene Ontology term GO:0005758, which represents the mitochondrial intermembrane space. The use of the bqbiol:is predicate in this link indicates that the compartment is defined as the mitochondrial intermembrane space.

The Physiome Model Repository

The Auckland Bioengineering Institute at the University of Auckland manages the Physiome Model Repository, which currently contains over 800 CellML models as well as models and simulation protocols encoded in various other formats. Annotation of CellML models is currently limited, but a collection of metadata specifications exists that provide recommendations and best practices for annotating models [72, 73]. Although the CellML metadata specification states that semantic annotations should be serialized externally, current tools used in the CellML community such as OpenCOR embed RDF/XML annotations in the CellML documents themselves, using identifiers.org URI formatting and BioModels.net qualifiers. The CellML format focuses on representing the mathematical aspects of models, and it does not include biological constructs as in SBML models. Consequently, a model’s variables must be linked to knowledge resource terms to capture the precise meaning of what a CellML model simulates. This presents challenges, as these variables represent concepts that can be very fine-grained (e.g. concentration of cytosolic glucose in a pancreatic beta cell), and publicly available knowledge resources do not provide adequate coverage for such concepts. Thus, the CellML group has been collaborating with the creators of the SemSim architecture, described next, to develop an annotation approach that uses composite annotations [17] to describe such fine-grained concepts.

The SemSim architecture and SemGen

The SemSim architecture is a logical framework for capturing the biophysical meaning of what is represented in a biological model. Central to this architecture are composite annotations: logical statements that link multiple knowledge resource terms to precisely define a model element. The primary motivation behind the composite annotation approach is that biological models often simulate concepts that are not represented among the set of publicly available biological knowledge resources; therefore, annotators often cannot define a model element via a reference to a single controlled vocabulary term. With composite annotations, annotators can instead build a definition from multiple, more fundamental terms that are available in knowledge resources. For example, a model variable might simulate the concentration of cytosolic glucose in a pancreatic beta cell, but this concept is not explicitly represented in any single publicly available knowledge resource. Applying the SemSim architecture, an annotator can create a composite annotation that captures the variable’s biological meaning by linking the terms Chemical concentration from the Ontology of Physics for Biology (OPB) [74, 75], glucose from ChEBI, Cytoplasm from the Foundational Model of Anatomy (FMA) [76] and Type B cell of pancreatic islet from the FMA. Depending on annotator preferences, synonymous terms from alternative knowledge resources could be used to create this composite annotation. For example, Uberon [77] could be used as an alternative to the FMA. The SemSim development group has created SemGen (https://github.com/SemBioProcess/SemGen), a software tool for applying SemSim-compliant composite annotations to computational models. SemGen also provides capabilities for semantics-based model composition wherein a model’s annotations are leveraged to automate merging and extraction tasks [33, 34]. Using SemGen, composite annotations applied to a computational model can be stored within SemSim models, which are encoded in the RDF-based Web Ontology Language (OWL, https://www.w3.org/OWL/), or within SBML and CellML models as standard RDF. RDF is sufficiently expressive for serializing composite annotations, and SemSim models are encoded in OWL because the SemSim development group is exploring the use of automated inference algorithms for reasoning over the biological knowledge articulated in models [78]. The SemSim development group maintains an informal protocol for annotating a model with SemGen in accordance with the SemSim framework ( https://github.com/SemBioProcess/SemGen/wiki/Annotation-protocol). These guidelines indicate which model components should be semantically annotated and at what level of detail. They also list recommended knowledge resources to use for different components of a composite annotation. For example, the OPB is recommended for physical property terms (chemical concentration, fluid volume, etc.), and ChEBI for small molecules. As evidenced by these three initiatives, there is currently no consensus approach to semantic annotation being applied to the different projects under COMBINE. Every COMBINE initiative either uses a protocol specific to that initiative or has no established protocol for semantic annotation. A consensus approach is essential not only for integration across COMBINE standards but also for integration with external biological data representation efforts. For example, the openEHR (http://www.openehr.org/) and HL7 Fast Healthcare Interoperability Resources (https://www.hl7.org/fhir/) initiatives have established protocols for the semantic annotation of clinical data (called terminology binding). As discussed above, a consensus protocol for annotating models and data would greatly benefit researchers who apply models to understand clinical data. Working toward this goal, members of the openEHR and COMBINE communities have recently begun collaborating to coordinate their semantic annotation approaches. Thus, a primary challenge for the COMBINE community is to develop an approach to semantic annotation that will help ensure annotations among COMBINE standards are widely interoperable and consistent.

Recommendations

Through discussions and face-to-face meetings at COMBINE meetings, we have developed the following seven recommendations for harmonizing semantic annotations. They are summarized in Box 1.

Box 1. Summary of Recommendations.

Use RDF, URIs and qualifiers to encode semantic annotations This recommendation indicates which technical standards to use for storage of semantic annotations. Store annotations in a separate file This practice will help normalize the format in which annotations are stored across community initiatives. Storing annotations apart from the model also allows different research groups to annotate a model using different knowledge resources. Establish a dedicated group for developing a software library that supports semantic annotation standards The proposed group will create and disseminate semantic annotation standards, develop standards-compliant software, promote consistency in annotation practices and ensure that advancements in annotation standards are coordinated across the modeling community. Document which knowledge resources should be used for annotation and why Model annotators have different requirements for disambiguating model content, depending on their research goals. We encourage research groups to articulate their preferred knowledge resources on a group-by-group basis and provide publicly available documentation describing how specific knowledge resources should be used in annotations. This will help the broader community determine how to map annotations between research efforts. Establish a repository of reusable annotations This will help reduce the time required for annotation and promote inter-annotator consistency. Annotations in the repository could also be used for annotating experimental data. Ensure high-quality semantic annotations through training and quality control processes Annotators should be trained so their annotations are specific, complete and as consistent as possible across model repositories. We encourage annotation initiatives to develop quality control protocols to ensure annotations adhere to community standards. Establish and maintain collaborations with knowledge resource developers Concepts represented in biological models are not always present in existing knowledge resources. Establishing and maintaining collaborative relationships with knowledge resource developers will help ensure that new terms needed for annotation become available in a timely manner.

(i) Use RDF, identifiers.org URIs and BioModels.net qualifiers to encode semantic annotations

This recommendation addresses the issue of how to standardize the storage of semantic annotations within documents, an essential technical pre-requisite for harmonizing their representation across model annotation efforts. We recommend using RDF, a World Wide Web Consortium-recommended standard for representing information on the Web, for encoding annotations. RDF has emerged as the de facto standard for encoding semantic annotations among the COMBINE community; all COMBINE standards currently use it. While more expressive knowledge representation formats exist, our experience indicates that RDF is sufficiently expressive for articulating the kinds of semantic annotations required to catalyze the significant advances in model discovery, reuse and integration that we envision. RDF statements are built using subject-predicate-object triples that, as shown in Figure 1, can be used to assert relationships between model components and terms from online knowledge resources. A primer on RDF is available (https://www.w3.org/TR/rdf11-concepts/), and links to examples of RDF-encoded model annotations can be found in the Example Semantic Annotations section below. Example RDF-based annotation from SBML model BIOMD0000000239 [71] in BioModels. The annotation block (indicated by a curly brace) defines the biological meaning of a physical compartment in the model. The RDF block within the element links the compartment’s metadata identifier ‘metaid_MT_IMS’ to the Gene Ontology term GO:0005758, which represents the mitochondrial intermembrane space. The use of the bqbiol:is predicate in this link indicates that the compartment is defined as the mitochondrial intermembrane space. We recommend using the identifiers.org URI format when referencing knowledge resource terms in RDF statements. Identifiers.org supports a vast set of biological knowledge resources used for semantic annotation, including databases as well as ontologies, and because identifiers.org-formatted URIs are web-resolvable. These URIs are widely adopted by the biological research community, particularly to facilitate scientific data interoperability and are beginning to show uptake among high-impact publishing groups [79, 80]. The identifiers.org services are capable of more complex URI resolution compared to alternative services. For example, identifiers.org is specifically built to address downtime and changing endpoints and directs users to an alternative site for a given data record as long as one is listed (one-to-many mappings) whereas persistent uniform resource locator services specify only one endpoint for URI resolution (one-to-one mappings). We also recommend using identifiers.org-formatted URIs because they use a uniform and simple nested structure that facilitates generation and parsing, and because identifiers.org reuses data providers’ record identifiers. We recommend the use of BioModels.net qualifiers in annotation statements that define model elements. These existing qualifiers provide a basic level of coverage needed for articulating semantic annotations in models, and they are specifically intended for use in statements that link computational abstractions of physical phenomena to knowledge resource terms representing the material manifestations of those phenomena. We would welcome efforts to unify BioModels.net qualifiers with alternative collections of qualifiers that do not have an explicit focus on linking elements in computer simulations with physical phenomena, such as the Relations Ontology [81].

(ii) Store annotations in a separate file

Current practice among the COMBINE community is to store semantic annotations within the same file that specifies a model’s computational aspects. This practice is in line with the initial MIRIAM guidelines put forth for annotating models [65]; however, there are several reasons why we recommend storing semantic annotations in a separate file. First, we wish to normalize the format in which annotations are stored across the different COMBINE standards. Currently, the exact format used to store annotations within model files differs slightly from standard to standard. Normalizing the format will simplify the development of software that provides programmatic manipulation of semantic annotations and will allow for better separation between modeling and annotation tasks. This will also remove the burden of supporting semantic annotation from the software teams that are developing software libraries for specific COMBINE standards. We recommend establishing a separate development group that will be responsible for creating and maintaining a common data model and annotation library (Recommendation 3). If the community decides to change the way semantic annotations are stored or processed, this should not require each team of developers to adjust their software. Instead, there should be one development project focused on the programmatic manipulation of semantic annotations, and the development team for that annotation library should implement the community’s decisions. We also recommend storing annotation files separately because we recognize that different research groups may have different preferences for which knowledge resources to use for annotation. Externalizing annotations in a separate file allows a single model file to be referenced by multiple annotation files, allowing different research groups to describe the same modeling resource in different ways. This approach follows the vision of the COMBINE archive, wherein multiple types of modeling files are archived together (as in a zip file) to make simulation experiments readily reproducible and shareable among research groups [16]. When sharing models, we recommend that annotations be distributed along with the files they annotate, and COMBINE archives provide a standardized way to bundle such files together. We recognize that storing annotations in a separate file requires keeping them synchronized. For example, if a variable identifier changes in the model file, that change should be reflected in the annotation file(s) as well. However, ensuring synchrony is an issue regardless of whether or not annotations are stored in a separate file, and we recommend that the community encourage the development of tools that help ensure coordination between a model’s computational aspects and its semantic annotations. An additional advantage of storing annotations in a separate file is that the RDF content can be serialized in various formats including XML or Turtle, whereas currently the serialization is dictated by the model format.

(iii) Establish a dedicated group for developing a software library that supports semantic annotation standards

Programmatically creating, applying and managing semantic annotations on models require substantial software development. The modeling community, especially model curators, would benefit from the creation of a single dedicated group that focuses on creating and disseminating semantic annotation standards and developing standards-compliant software. The organization and operation of the new working group would follow the model of other successful software development efforts within COMBINE, such as those that focus on developing the widely adopted SBML libraries libSBML [82] and JSBML [83]. Using this model, participants within the development group can come from different research labs, and funding to support the group’s efforts can come from multiple sources. Centralized software development systems such as GitHub can be used to develop a semantic annotation library collaboratively among members of the group, and working group meetings at COMBINE conferences/hackathons provide venues for promoting group cohesion and for interfacing with other standards development efforts. Online forums can be used to resolve outstanding software issues, propose enhancements, discuss the working group’s priorities, etc. Although this operational model has worked in the past, it is crucial that funding for the effort is as continuous as possible. It is also important that individuals contributing to the effort can be compensated through alternative funding mechanisms in case a specific funding source ends. The semantic annotation group’s role would be to promote consistency in annotation practices and help ensure that advancements in annotation standards are coordinated across the modeling community. It would also work to ensure the longevity of annotation tools beyond individual software projects within COMBINE. Given that several tools already provide model annotation capabilities, including OpenCOR [84], SemGen [17, 34], COPASI [85], JWS Online [63], CellDesigner [86] and the annotation tools used internally by the BiGG Models Database [87] and BioModels, the working group’s focus would not be to create a one-size-fits-all semantic annotation toolset, but rather provide guidance and core software infrastructure to help ensure that users and developers of modeling software adhere to the community’s annotation standards. This centralized semantic annotation software package should harmonize with other annotation standards, such as the curatorial metadata standards of MIRIAM and should support compliance-checking. We would also encourage package developers to provide multilingual support so that the lexical content associated with models and annotations can be readily converted into a wide variety of languages. This would further enhance researchers’ abilities to comprehend the meaning of model elements.

(iv) Document which knowledge resources should be used for annotation and why

The COMBINE community currently uses terms from a variety of biomedical knowledge resources to capture the meaning of model elements. Some of these knowledge resources overlap in content, and there is no broad consensus about which resources should be used for an annotation. Therefore, the same biophysical concept represented in two different models might be annotated against different knowledge resource terms. This undermines the community’s ability to compare and compose models in an automated fashion, as well as convert between standard formats. Ideally, synonymous content within models should be annotated using the same set of reference terms and qualifiers. Otherwise, the developers of model comparison and composition tools must rely on mappings between knowledge resources to identify synonymous terms. This can be burdensome, given the increased computational cost and the challenges associated with ontology mapping and alignment [88]. It is in the interest of the community to agree upon a set of core knowledge resources that will be used for annotation and also define the scope of use for each resource. That being said, we recognize that different research groups have different requirements for disambiguating model content. For example, UniProt might be a more useful resource compared to the Protein Ontology (PRO) [89] for a group that needs to disambiguate proteins by amino acid sequence. For a group that does not require sequence information and wishes to use taxon-neutral protein entities, the PRO might be more useful. Furthermore, the value of different knowledge resources may change over time. Therefore, we do not propose a short list of recommended knowledge resources here. Instead, we encourage research groups to make recommendations on a group-by-group basis and provide publicly available documentation describing how specific knowledge resources should be used in annotations (see, for example, the SemSim development group’s annotation protocol mentioned above). This will help those outside the research group determine how to map annotations between research efforts. Such mappings represent perhaps the most significant challenge associated with this approach, and we therefore stress the importance of creating and maintaining formal mappings between knowledge resources, such as those provided by BioPortal (https://bioportal.bioontology.org/) and the EMBL-EBI Ontology Xref Service (http://www.ebi.ac.uk/spot/oxo/). We recommend that research groups use knowledge resources that are publicly and freely available, programmatically accessible via web services and mapped to other resources used within the community that overlap in content.

(v) Establish a repository of reusable annotations

The process of annotating a model is time-consuming and susceptible to inter-annotator variability. We, therefore, recommend the creation of an online repository of curated, reusable annotations and integration between this repository and software tools. The repository would allow annotators to quickly find and reuse valid, standards-compliant annotations previously applied to curated models, reducing annotation time and promoting consistency among annotators. This feature is especially relevant for applying composite annotations; they take longer to create compared to singular annotations and require linking multiple knowledge resource terms, resulting in more opportunities for inter-annotator inconsistency. We envision this repository would mine RDF annotation statements from curated models (e.g. in BioModels or the PMR) and integrate with annotation software tools so that annotators can quickly link elements in their models/datasets to knowledge resource terms by reusing these statements. Such a repository could also provide the basis for automating the annotation process; once an annotator begins annotating a model, software tools could search the repository for curated models that have the same annotations and make suggestions for unannotated model elements. This approach would also improve inter-annotator consistency; see Recommendation 4 above. By utilizing the mappings that exist between knowledge resources (e.g. between ChEBI and the Kyoto Encyclopedia of Genes and Genomes [90]), a centralized repository of annotations could also identify and auto-generate synonymous annotations. This feature would help minimize redundant annotations within the repository and allow annotators to retrieve annotations that only use references to their preferred knowledge resources. Additionally, as discussed in Recommendation 3, we encourage the developers of an annotation repository to provide multilingual support. An additional benefit of such a repository is that the annotations it contains could be used for annotating experimental data. Therefore, the repository could act as a hub for accessing biological models, data and associated metadata. Given this, we recommend that the repository’s development team interfaces not only with members of the COMBINE community but with other external data sharing/integration efforts such as the Findable, Accessible, Interoperable, Reusable [91] initiative and Bioschemas [92]. This will help ensure that the repository’s content is readily discoverable, accessible and integrable by a wide community of users and will assist in developing effective metadata governance approaches. We anticipate that outreach will be a significant challenge that repository developers must address.

(vi) Ensure high-quality semantic annotations through training and quality control processes

The success of new software tools for expediting cross-repository model retrieval and model composition depends on linking models to thorough, precise semantic descriptions. Therefore, we recommend that annotators, whether they are the modelers themselves or curation team members, receive training so that their annotations are specific, complete and as consistent as possible across repositories. Annotators must not only thoroughly understand the meaning of a model’s computational elements, but also the scope, organization and limitations of the knowledge resources used for semantic annotation as well as the technical aspects of encoding annotations. We also encourage all groups annotating models to develop quality control protocols that ensure annotations are encoded correctly; to identify overly generic, overly specific and missing annotations; and to check that the annotations accurately capture the meaning of the model elements to which they are linked. We also recommend the enhancement of existing annotation software tools [36, 93, 94] so they prevent the application of inconsistent annotations and can auto-suggest annotations. Another interesting approach for promoting high-quality annotations is the use of Web 2.0 advancements and collaborative technologies to evaluate semantic annotations with user-friendly and interactive infrastructures (see, for example, Kalbasi et al. [24]). Implementing these proposed training and quality control processes can be a challenge. Funds must be obtained to finance their implementation and maintenance. Furthermore, implementation will require specialized expertise to train annotators and develop quality control practices. Therefore, the biological modeling community would benefit from new funding initiatives that aim to establish training programs and quality control operations for annotation practices.

(vii) Establish and maintain collaborations with knowledge resource developers

Modelers do not always organize a system for study in the same way as knowledge resource developers, and sometimes the biological concepts represented in a model are not present in existing knowledge resources. Many biological concepts in models are too fine-grained to be included and maintained in knowledge resources aiming to provide broader classifications for biophysical phenomena. Thus, annotating a model may require sending term requests to resource developers. Therefore, the modeling community should establish and maintain connections with these developers so that terms needed for annotation can be added to resources or modified in a timely fashion. For example, the Kinetic Simulation Algorithm Ontology [38] term submission system on the COMBINE website (http://co.mbine.org/standards/kisao) connects to a SourceForge ticket system. An ongoing dialog between modelers and resource developers will also help modelers better understand the scope and content of knowledge resources, which is a critical component of Recommendation 6. These collaborations, although they will require resource investment by all parties involved, will likely be mutually beneficial: model annotators will be able to annotate faster and more thoroughly, and knowledge resource developers will be able to improve the coverage of their resources via feedback from annotation efforts.

Example semantic annotations

To provide concrete examples of semantic annotations that adhere to our recommendations, we have created two example COMBINE archives that contain annotated models. These archives are included as Supplementary Data, and their latest versions are available at the COMBINE GitHub repository (https://github.com/combine-org/Annotations). One of the archives contains an SBML model retrieved from BioModels (BIOMD0000000176) that simulates glycolysis in yeast [95]. The other contains a CellML model retrieved from the Physiome Model Repository that simulates hemodynamics in the human circulatory system [96]. These two models were chosen because, together, they provide examples of semantic annotations across physical scales and because the phenomena they simulate will be familiar to many biological researchers. In the future, we plan to create example semantic annotations for additional COMBINE standards (e.g. NeuroML and SBGN) and post them to the COMBINE GitHub repository as well. Developers within the COMBINE community are currently enhancing existing tools to support reading and writing semantic annotations that adhere to our recommendations, and recent versions of SemGen and tellurium-web support reading the example archives provided here. We have serialized semantic annotations for both of our example models as separate files within COMBINE archives. The annotations within these standalone files are encoded as RDF subject-predicate-object statements and the URIs for the subjects in these statements serve as pointers to the specific model elements that are annotated. To construct these pointers, we use the relative model file name for the URI base and the ‘metaid’ attribute on the annotated model element for the URI fragment. For example, the URI referencing the cytosol compartment in the SBML model is ./BIOMD0000000176.xml#_525523. This approach allows us to uniquely reference model elements within a COMBINE archive, even if elements from two different models share the same metaid. These URI pointers establish the link between the model file and its annotations, and the original model code is left unedited. By asserting links this way, a model file does not have to be updated when it is annotated, and synchronization of pointers across multiple files is not required. Our examples are based on initial progress toward a full technical specification of how external annotation files should be implemented for COMBINE standards. However, this specification is still under development, and several implementation details must be resolved before it can be completed, including how annotations should be implemented in BioPAX [97] and SBOL documents. Regardless, we hope that these examples will provide an initial reference point for standardization of a COMBINE-wide semantic annotation protocol. We invite those interested in participating in this standardization process to join the ‘COMBINE-annot’ online forum (https://groups.google.com/forum/#!forum/combine-annot), which provides a venue for addressing issues related to the annotation of models and data.

Discussion

As discussed above, semantic annotations can accelerate model reuse by enhancing search and retrieval of models. If a standard semantic annotation protocol is applied to publicly available models, the modeling community can develop more advanced tools that can search across model repositories, improve the relevance of search results and provide users with information that helps them decide whether an existing model is suitable for reuse in a modeling project. A common annotation protocol is also a critical component of model composition. Applying the protocol, model-merging tools such as SemGen can recognize the biological commonalities between models and then use that information to compare the models’ biological content and guide their assembly. Without a harmonized approach to semantic annotation, modelers would be required to identify the biological overlap between models manually. Given that data is an essential component of model-based research, we strongly encourage the computational biology community to develop protocols for data annotation. We hope that the annotation approach articulated here, and implemented in the example annotation files described above, provides a basis from which to build such protocols. It is in the interest of both modelers and experimentalists to adopt a shared approach for semantic annotation: by linking annotations in models and data sources, modelers could discover valuable datasets for tuning and validating their models, and experimentalists could discover simulation models that integrate knowledge about their field of study into systems-level perspectives. We recognize that it takes time and effort to annotate models accurately and consistently. Therefore, we urge the community to develop incentives and tools to make it easier and faster to annotate models. Incentives might include giving credit to annotators in model metadata and on model repositories. New tools that automate compliance-checking of annotations (e.g. https://normsys.h-its.org/validate and https://github.com/opencobra/memote) could help flag errors, omissions and overgeneralizations. Despite these challenges, we believe that semantic annotation is one of the keys for transforming the entire field of biological modeling into one in which models do not languish within single research labs, but are readily shared among the broader research community, easily incorporated into new modeling projects and contain reusable components. As the biotechnology community undertakes more investigations into systems-level biology, standardized semantic annotations will reduce the activation energy required to initiate, repurpose and extend model-based research projects. We hope that the recommendations articulated here help move the greater modeling community, including groups outside of COMBINE, toward the goal of broad model reuse and interoperability.

Key Points

Semantic annotations are a critical component for achieving broad interoperability and reusability of computational models in biology. Semantic annotation practices differ across groups within the larger biological modeling community; a consensus protocol for semantic annotation is needed. We present seven recommendations for harmonizing semantic annotation practices across the biological modeling community, along with example annotation files that adhere to the recommendations.

80 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models.

Authors: M Hucka; A Finney; H M Sauro; H Bolouri; J C Doyle; H Kitano; A P Arkin; B J Bornstein; D Bray; A Cornish-Bowden; A A Cuellar; S Dronov; E D Gilles; M Ginkel; V Gor; I I Goryanin; W J Hedley; T C Hodgman; J-H Hofmeyr; P J Hunter; N S Juty; J L Kasberger; A Kremling; U Kummer; N Le Novère; L M Loew; D Lucio; P Mendes; E Minch; E D Mjolsness; Y Nakayama; M R Nelson; P F Nielsen; T Sakurada; J C Schaff; B E Shapiro; T S Shimizu; H D Spence; J Stelling; K Takahashi; M Tomita; J Wagner; J Wang
Journal: Bioinformatics Date: 2003-03-01 Impact factor: 6.937

3. A reference ontology for biomedical informatics: the Foundational Model of Anatomy.

Authors: Cornelius Rosse; José L V Mejino
Journal: J Biomed Inform Date: 2003-12 Impact factor: 6.317

4. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.

Authors: P W Lord; R D Stevens; A Brass; C A Goble
Journal: Bioinformatics Date: 2003-07-01 Impact factor: 6.937

5. Web-based kinetic modelling using JWS Online.

Authors: Brett G Olivier; Jacky L Snoep
Journal: Bioinformatics Date: 2004-04-08 Impact factor: 6.937

6. Minimum information requested in the annotation of biochemical models (MIRIAM).

Authors: Nicolas Le Novère; Andrew Finney; Michael Hucka; Upinder S Bhalla; Fabien Campagne; Julio Collado-Vides; Edmund J Crampin; Matt Halstead; Edda Klipp; Pedro Mendes; Poul Nielsen; Herbert Sauro; Bruce Shapiro; Jacky L Snoep; Hugh D Spence; Barry L Wanner
Journal: Nat Biotechnol Date: 2005-12 Impact factor: 54.908

7. Measures of semantic similarity and relatedness in the biomedical domain.

Authors: Ted Pedersen; Serguei V S Pakhomov; Siddharth Patwardhan; Christopher G Chute
Journal: J Biomed Inform Date: 2006-06-10 Impact factor: 6.317

8. COPASI--a COmplex PAthway SImulator.

Authors: Stefan Hoops; Sven Sahle; Ralph Gauges; Christine Lee; Jürgen Pahle; Natalia Simus; Mudita Singhal; Liang Xu; Pedro Mendes; Ursula Kummer
Journal: Bioinformatics Date: 2006-10-10 Impact factor: 6.937

9. Minimal haemodynamic system model including ventricular interaction and valve dynamics.

Authors: Bram W Smith; J Geoffrey Chase; Roger I Nokes; Geoffrey M Shaw; Graeme Wake
Journal: Med Eng Phys Date: 2004-03 Impact factor: 2.242

10. Relations in biomedical ontologies.

Authors: Barry Smith; Werner Ceusters; Bert Klagges; Jacob Köhler; Anand Kumar; Jane Lomax; Chris Mungall; Fabian Neuhaus; Alan L Rector; Cornelius Rosse
Journal: Genome Biol Date: 2005-04-28 Impact factor: 13.583

14 in total

1. SemGen: a tool for semantics-based annotation and composition of biosimulation models.

Authors: Maxwell L Neal; Christopher T Thompson; Karam G Kim; Ryan C James; Daniel L Cook; Brian E Carlson; John H Gennari
Journal: Bioinformatics Date: 2019-05-01 Impact factor: 6.937

2. Setting the basis of best practices and standards for curation and annotation of logical models in biology-highlights of the [BC]2 2019 CoLoMoTo/SysMod Workshop.

Authors: Anna Niarakis; Martin Kuiper; Marek Ostaszewski; Rahuman S Malik Sheriff; Cristina Casals-Casas; Denis Thieffry; Tom C Freeman; Paul Thomas; Vasundra Touré; Vincent Noël; Gautier Stoll; Julio Saez-Rodriguez; Aurélien Naldi; Eugenia Oshurko; Ioannis Xenarios; Sylvain Soliman; Claudine Chaouiya; Tomáš Helikar; Laurence Calzone
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

3. A semantics, energy-based approach to automate biomodel composition.

Authors: Niloofar Shahidi; Michael Pan; Kenneth Tran; Edmund J Crampin; David P Nickerson
Journal: PLoS One Date: 2022-06-03 Impact factor: 3.752

4. Addressing barriers in comprehensiveness, accessibility, reusability, interoperability and reproducibility of computational models in systems biology.

Authors: Anna Niarakis; Dagmar Waltemath; James Glazier; Falk Schreiber; Sarah M Keating; David Nickerson; Claudine Chaouiya; Anne Siegel; Vincent Noël; Henning Hermjakob; Tomáš Helikar; Sylvain Soliman; Laurence Calzone
Journal: Brief Bioinform Date: 2022-07-18 Impact factor: 13.994

5. Hierarchical semantic composition of biosimulation models using bond graphs.

Authors: Niloofar Shahidi; Michael Pan; Soroush Safaei; Kenneth Tran; Edmund J Crampin; David P Nickerson
Journal: PLoS Comput Biol Date: 2021-05-13 Impact factor: 4.475

6. libOmexMeta: Enabling semantic annotation of models to support FAIR principles.

Authors: Ciaran Welsh; David P Nickerson; Anand Rampadarath; Maxwell L Neal; Herbert M Sauro; John H Gennari
Journal: Bioinformatics Date: 2021-06-16 Impact factor: 6.931

7. Consistency, Inconsistency, and Ambiguity of Metabolite Names in Biochemical Databases Used for Genome-Scale Metabolic Modelling.

Authors: Nhung Pham; Ruben G A van Heck; Jesse C J van Dam; Peter J Schaap; Edoardo Saccenti; Maria Suarez-Diez
Journal: Metabolites Date: 2019-02-06

8. Model annotation and discovery with the Physiome Model Repository.

Authors: Dewan M Sarwar; Reza Kalbasi; John H Gennari; Brian E Carlson; Maxwell L Neal; Bernard de Bono; Koray Atalag; Peter J Hunter; David P Nickerson
Journal: BMC Bioinformatics Date: 2019-09-06 Impact factor: 3.169

9. Protein ontology on the semantic web for knowledge discovery.

Authors: Chuming Chen; Hongzhan Huang; Karen E Ross; Julie E Cowart; Cecilia N Arighi; Cathy H Wu; Darren A Natale
Journal: Sci Data Date: 2020-10-12 Impact factor: 6.444

10. The first 10 years of the international coordination network for standards in systems and synthetic biology (COMBINE).

Authors: Dagmar Waltemath; Martin Golebiewski; Michael L Blinov; Padraig Gleeson; Henning Hermjakob; Michael Hucka; Esther Thea Inau; Sarah M Keating; Matthias König; Olga Krebs; Rahuman S Malik-Sheriff; David Nickerson; Ernst Oberortner; Herbert M Sauro; Falk Schreiber; Lucian Smith; Melanie I Stefan; Ulrike Wittig; Chris J Myers
Journal: J Integr Bioinform Date: 2020-06-29